Clarence's Wicked Mind: June 2025

Monday, June 2, 2025

Trying out Qwen3 30B model

Running the Qwen3 30B Q8 model with llama.cpp locally is really responsive. Even with my Ryzen 5600G internal graphics card, it could run at almost 8 tokens per second.

Since my machine has 96GB of memory (with 70GB allocatable for graphics), I could even leave it running in the background all the time. With context size of 40860 and flash attention enabled, it only takes up about 36GB of GPU memory when all loaded to GPU.

./llama-cli -m ../../models/Qwen3-30B-A3B-Q8_0.gguf --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 40960 -n 32768 --no-context-shift

The only downside, as usual since this model is from Alibaba in China, it refuses to answer "sensitive questions"...