Clarence's Wicked Mind: 2025

Monday, June 2, 2025

Trying out Qwen3 30B model

Running the Qwen3 30B Q8 model with llama.cpp locally is really responsive. Even with my Ryzen 5600G internal graphics card, it could run at almost 8 tokens per second.

Since my machine has 96GB of memory (with 70GB allocatable for graphics), I could even leave it running in the background all the time. With context size of 40860 and flash attention enabled, it only takes up about 36GB of GPU memory when all loaded to GPU.

./llama-cli -m ../../models/Qwen3-30B-A3B-Q8_0.gguf --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 40960 -n 32768 --no-context-shift

The only downside, as usual since this model is from Alibaba in China, it refuses to answer "sensitive questions"...

Wednesday, April 2, 2025

Running llama.cpp locally to test code suggestions

Yesterday I was playing with the Qwen 2.5 70B model to generate some Python code but it didn't work.

So today I tried to run the Llama.cpp server and see if using Continue plugin with Visual Studio Code will help to fix the code.

To make it more responsive, running the server with a 7B model:

./llama-server -m ../../models/qwen2.5-coder-7b-instruct-q8_0.gguf -c 4096 --host 0.0.0.0 -ngl 99

After pointing Continue to the local llama server and asking it to fix it, it replied with some convincing suggestions.

However, it still doesn't work. So follow up again...

And third time is the charm? Nope.

Tuesday, April 1, 2025

Trying out code generation with Qwen 2.5 32B

I was playing with some of the smaller LLM models (Llama 2 7B Q4, Llama 3.1 8B Q8 etc) and found that they hallucinate a lot (as expected) .

So I decided to add more RAM to my PC to run a bigger model. My computer is a Ryzen 5 5600G with integrated graphics. Although it definitely lacks the GPU computation power, one benefit with AMD iGPU is that it can access more system memory as needed as GTT. On Linux, by default it can allocate up to 50% of system memory to the GPU (more by specifying amdgpu.gttsize kernel parameter).

So with 96GB of system memory, I was able to run Qwen Coder 2.5 32B to try out some code generation.

./llama-cli -m ../../models/qwen2.5-coder-32b-instruct-q8_0.gguf -c 16384 -e -ngl 99

Here is a prompt I got somewhere from the internet:

Write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically

The result was much better than Qwen 2.5 14B but the generated code was still wrong. It didn't get the physics right. The ball simply fell through the hexagon.

And the LLM output speed was around 1.35 tokens per second.

llama_perf_sampler_print: sampling time = 182.90 ms / 1662 runs ( 0.11 ms per token, 9087.18 tokens per second)
llama_perf_context_print: load time = 9746.62 ms
llama_perf_context_print: prompt eval time = 3818.63 ms / 43 tokens ( 88.81 ms per token, 11.26 tokens per second)
llama_perf_context_print: eval time = 1202554.57 ms / 1618 runs ( 743.24 ms per token, 1.35 tokens per second)
llama_perf_context_print: total time = 1310292.02 ms / 1661 tokens

Thursday, March 27, 2025

Building llama.cpp with vulkan on openSUSE

When trying to run llama.cpp locally, I found that the instructions for building the docker image with vulkan acceleration doesn't work on my openSUSE Tumbleweed machine.

Instead, I needed to build and run the client directly on my host machine.

First, make sure both "vulkan-devel" and "shaderc" packages are installed.

Next, build it with vulkan

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release
make

The client should detect and use GPU via vulkan library.

[~/work/llama.cpp/build/bin] $ ./llama-cli -m ../../models/Meta-Llama-3.1-8B-Instruct-Q5_K_L.gguf -p "Building a website can be done in 10 simple steps:" -n 600 -e -ngl 99

ggml_vulkan: Found 1 Vulkan devices:

build: 4967 (f17a3bb4) with cc (SUSE Linux) 14.2.1 20250220 [revision 9ffecde121af883b60bbe60d00425036bc873048] for x86_64-suse-linux

main: llama backend init

......