i wasn't able to get llama.cpp to run it even after pulling latest master and rebuilding because of an unknown architecture. chatgpt told me to pull a specific branch and PR and rebuild:
git fetch origin pull/18058/head:nemotron3
git checkout nemotron3
cmake -S . -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build build --config Release -j --clean-first --target llama-server
and that did the trick
Also, this thing is flying. I'm using Q4_K_M on my 5090 and i'm getting 220 t/s on average.
