I was playing with some of the smaller LLM models (Llama 2 7B Q4, Llama 3.1 8B Q8 etc) and found that they hallucinate a lot (as expected) .
So I decided to add more RAM to my PC to run a bigger model. My computer is a Ryzen 5 5600G with integrated graphics. Although it definitely lacks the GPU computation power, one benefit with AMD iGPU is that it can access more system memory as needed as GTT. On Linux, by default it can allocate up to 50% of system memory to the GPU (more by specifying amdgpu.gttsize kernel parameter).
./llama-cli -m ../../models/qwen2.5-coder-32b-instruct-q8_0.gguf -c 16384 -e -ngl 99
Here is a prompt I got somewhere from the internet:
Write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically
The result was much better than Qwen 2.5 14B but the generated code was still wrong. It didn't get the physics right. The ball simply fell through the hexagon.
And the LLM output speed was around 1.35 tokens per second.
llama_perf_context_print: load time = 9746.62 ms
llama_perf_context_print: prompt eval time = 3818.63 ms / 43 tokens ( 88.81 ms per token, 11.26 tokens per second)
llama_perf_context_print: eval time = 1202554.57 ms / 1618 runs ( 743.24 ms per token, 1.35 tokens per second)
llama_perf_context_print: total time = 1310292.02 ms / 1661 tokens