Sunday, July 27, 2025

Foundations of Large Language Models

 Here is a summary of Foundations of Large Language Models, generated by Gemini CLI with gemini-2.5-pro.


Of course. Here is a summary for each chapter of "Foundations of Large Language Models":


### **Chapter 1: Pre-training**


Chapter 1 introduces pre-training as the foundational step for modern Large Language Models. It distinguishes between unsupervised, supervised, and the now-dominant self-supervised learning paradigms. Self-supervised pre-training allows models to learn from vast amounts of unlabeled text by creating their own supervision signals. The chapter details the primary self-supervised objectives, categorizing them by model architecture: decoder-only (e.g., causal language modeling), encoder-only (e.g., masked language modeling), and encoder-decoder (e.g., denoising).


A significant portion of the chapter uses BERT as a case study for an encoder-only model. It explains BERT's architecture and its two pre-training tasks: Masked Language Modeling (MLM), where the model predicts randomly masked tokens, and Next Sentence Prediction (NSP), which teaches the model to understand sentence relationships. The chapter also covers the evolution of BERT, discussing improvements through more data (RoBERTa), increased scale, efficiency enhancements, and multilingual capabilities. Finally, it outlines how these powerful pre-trained models are adapted for downstream tasks through methods like fine-tuning, where the model's weights are further adjusted on a smaller, task-specific dataset, or through prompting, which is explored in later chapters.


### **Chapter 2: Generative Models**


Chapter 2 focuses on generative models, the class of LLMs like GPT that are designed to produce text. It traces the evolution from traditional n-gram models to the sophisticated neural network architectures of today. The core of modern generative LLMs, the decoder-only Transformer, is explained in detail. This architecture processes a sequence of tokens and predicts the next token in an auto-regressive fashion, generating text one token at a time. The chapter discusses the immense challenges of training these models at scale, which requires vast computational resources and distributed systems to manage the model parameters and data.


A key concept covered is "scaling laws," which describe the predictable relationship between a model's performance and increases in model size, dataset size, and computational budget. These laws have driven the trend toward building ever-larger models. The chapter also addresses a critical challenge: long-sequence modeling. The quadratic complexity of self-attention makes processing long texts computationally expensive. To overcome this, the chapter explores various techniques, including efficient attention mechanisms (e.g., sparse and linear attention), Key-Value (KV) caching for memory optimization, and advanced positional embeddings like RoPE that help models generalize to longer contexts than they saw during training.


### **Chapter 3: Prompting**


Chapter 3 delves into prompting, the method of guiding an LLM's behavior by providing it with a specific input, or "prompt." This technique is central to interacting with modern LLMs and has given rise to the field of prompt engineering. The chapter introduces the fundamental concept of in-context learning (ICL), where the model learns to perform a task from examples provided directly in the prompt, without any weight updates. This is demonstrated through zero-shot, one-shot, and few-shot learning paradigms.


The chapter outlines several strategies for effective prompt design, such as providing clear instructions, specifying the desired format, and assigning a role to the model. It then explores advanced techniques that significantly enhance LLM reasoning. The most prominent of these is Chain of Thought (CoT) prompting, which encourages the model to generate a step-by-step reasoning process before giving a final answer, dramatically improving performance on complex tasks. Other advanced methods discussed include problem decomposition, self-refinement (where the model critiques and improves its own output), and the use of external tools and retrieval-augmented generation (RAG) to incorporate external knowledge. Finally, the chapter touches on methods for automating prompt creation, such as learning "soft prompts."


### **Chapter 4: Alignment**


Chapter 4 addresses the critical process of alignment, which ensures that an LLM's behavior aligns with human values and intentions, making it helpful, harmless, and honest. This process goes beyond simple task performance and is crucial for the safe deployment of LLMs. The chapter outlines two primary methodologies for achieving alignment after the initial pre-training phase.


The first method is **Instruction Alignment**, achieved through Supervised Fine-Tuning (SFT). In SFT, the pre-trained model is further trained on a high-quality dataset of curated instruction-response pairs, teaching it to follow directions effectively. The second, more complex method is **Human Preference Alignment**, most commonly implemented via Reinforcement Learning from Human Feedback (RLHF). RLHF involves a multi-step process: 1) An initial model is used to generate multiple responses to a prompt. 2) Humans rank these responses based on preference. 3) A separate "reward model" is trained on this human preference data to predict which outputs humans would prefer. 4) This reward model is then used to fine-tune the original LLM using reinforcement learning algorithms like PPO, optimizing the model to generate outputs that maximize the predicted reward. The chapter also introduces Direct Preference Optimization (DPO) as a more direct and less complex alternative to PPO-based RLHF.


### **Chapter 5: Inference**


Chapter 5 focuses on inference, the process of using a trained LLM to generate predictions. The core of LLM inference is divided into a two-phase framework: **Prefilling and Decoding**. In the prefilling phase, the input prompt is processed in a highly parallelized, single pass to compute the initial Key-Value (KV) cache. This phase is compute-bound. The subsequent decoding phase is an auto-regressive, token-by-token generation process that uses this KV cache. This phase is memory-bound due to the large memory footprint of the cache and the sequential nature of the generation.


The chapter details the various search strategies, or decoding algorithms, used to select the next token at each step. These include deterministic methods like greedy search and beam search, as well as a stochastic sampling methods like top-k and top-p (nucleus) sampling, which introduce diversity into the output. To improve efficiency, the chapter covers advanced batching techniques like continuous batching and PagedAttention, which optimize GPU utilization by dynamically managing requests of varying lengths. It also explains speculative decoding, a method that uses a smaller, faster "draft" model to generate candidate tokens that are then verified in a single pass by the larger, more powerful model, significantly accelerating the decoding process.

Monday, June 2, 2025

Trying out Qwen3 30B model

Running the Qwen3 30B Q8 model with llama.cpp locally is really responsive. Even with my Ryzen 5600G internal graphics card, it could run at almost 8 tokens per second.

Since my machine has 96GB of memory (with 70GB allocatable for graphics), I could even leave it running in the background all the time. With context size of 40860 and flash attention enabled, it only takes up about 36GB of GPU memory when all loaded to GPU.


./llama-cli -m ../../models/Qwen3-30B-A3B-Q8_0.gguf --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 40960 -n 32768 --no-context-shift


The only downside, as usual since this model is from Alibaba in China, it refuses to answer "sensitive questions"...

Wednesday, April 2, 2025

Running llama.cpp locally to test code suggestions

 

Yesterday I was playing with the Qwen 2.5 70B model to generate some Python code but it didn't work.

So today I tried to run the Llama.cpp server and see if using Continue plugin with Visual Studio Code will help to fix the code.

To make it more responsive, running the server with a 7B model:

./llama-server -m ../../models/qwen2.5-coder-7b-instruct-q8_0.gguf -c 4096 --host 0.0.0.0 -ngl 99

After pointing Continue to the local llama server and asking it to fix it, it replied with some convincing suggestions.



However, it still doesn't work. So follow up again...



And third time is the charm? Nope.




Tuesday, April 1, 2025

Trying out code generation with Qwen 2.5 32B

I was playing with some of the smaller LLM models (Llama 2 7B Q4, Llama 3.1 8B Q8 etc) and found that they hallucinate a lot (as expected) .

So I decided to add more RAM to my PC to run a bigger model. My computer is a Ryzen 5 5600G with integrated graphics. Although it definitely lacks the GPU computation power, one benefit with AMD iGPU is that it can access more system memory as needed as GTT. On Linux, by default it can allocate up to 50% of system memory to the GPU (more by specifying amdgpu.gttsize kernel parameter).



So with 96GB of system memory, I was able to run Qwen Coder 2.5 32B to try out some code generation.

./llama-cli -m ../../models/qwen2.5-coder-32b-instruct-q8_0.gguf -c 16384 -e -ngl 99

Here is a prompt I got somewhere from the internet:

Write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically

The result was much better than Qwen 2.5 14B but the generated code was still wrong. It didn't get the physics right. The ball simply fell through the hexagon. 



And the LLM output speed was around 1.35 tokens per second.

llama_perf_sampler_print:    sampling time =     182.90 ms /  1662 runs   (    0.11 ms per token,  9087.18 tokens per second)
llama_perf_context_print:        load time =    9746.62 ms
llama_perf_context_print: prompt eval time =    3818.63 ms /    43 tokens (   88.81 ms per token,    11.26 tokens per second)
llama_perf_context_print:        eval time = 1202554.57 ms /  1618 runs   (  743.24 ms per token,     1.35 tokens per second)
llama_perf_context_print:       total time = 1310292.02 ms /  1661 tokens


Thursday, March 27, 2025

Building llama.cpp with vulkan on openSUSE



When trying to run llama.cpp locally, I found that the instructions for building the docker image with vulkan acceleration doesn't work on my openSUSE Tumbleweed machine.

Instead, I needed to build and run the client directly on my host machine.

First, make sure both "vulkan-devel" and "shaderc" packages are installed.

Next, build it with vulkan

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release
make

The client should detect and use GPU via vulkan library.

[~/work/llama.cpp/build/bin] $ ./llama-cli -m ../../models/Meta-Llama-3.1-8B-Instruct-Q5_K_L.gguf -p "Building a website can be done in 10 simple steps:" -n 600 -e -ngl 99  

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none
build: 4967 (f17a3bb4) with cc (SUSE Linux) 14.2.1 20250220 [revision 9ffecde121af883b60bbe60d00425036bc873048] for x86_64-suse-linux
main: llama backend init
......





Friday, December 27, 2024

Trying out vector embeddings

newsSum is a Google App Engine application that bundle articles from different news sources. To try out ML embeddings, I decided to add a suggestions service.


High-level idea



  • There will be no changes to the backend of "newssum". The suggestion service "newssum-sug" will be implemented as a separate service
  • The frontend of "newssum" will check if the suggestion service "newssum-sug" is available. If so, it will allow the user to expand an article and query the suggestion service for additional information to display


Implementation of the suggestion service

  • Technically, "newssum-sug" could gather suggestions from any sources (e.g. Google search results, a Youtube video etc). But for now, it will process articles from selected "newssum" sources. So, there will be scheduled tasks to collect articles from "newssum" and prepare them for searching.
  • Vector embeddings will be used to find similar articles. A machine learning model is used to turn a news headline into a vector of numbers. When a query comes in, an embedding will also be generated from that query. By comparing the distance between vectors, we could find articles that are related to the query.
  • The embeddings generated during batch processing are stored in a vector database. The database will also provide the mechanism for searching vectors by distance.
  • Since "newssum" is for current news only, embeddings will only be kept for 2 days.
  • The suggestion service can also be used for free-text search. But for now, the frontend only uses it for article suggestions.

While "newssum" is open source, the "newssum-sug" service is still under development in closed source. But the basic functionality has been integrated and available on the demo site.

Tuesday, December 24, 2024

A machine learning model to identify music albums from photos

 




I was looking for a home automation project to select and play specific music album from stream services. There are similar ideas of using NFC tags. Basically, it means preparing some NFC tags with album/movie cover arts on them. And putting a tag on the reader will trigger the playback of that album/movie. While it brings the joy of handling and selecting physical collections, it costs money and time to prepare those NFC tags and I wanted to avoid that.

Since now we have those machine learning models and classifiers, I thought I can just train up a model to look at a webcam photo of a record / CD and tell me the Spotify link to play that album.

BTW, I know Microsoft co-pilot (or maybe OpenAI too) can do it without any special training, but then I don't want to pay extra for that and just wanted to host the model on my own machines.

I imagine it will be something like this:

I put an album in front of a webcam...


... and the model will tell me the Spotify URL to pass on to the music streamer


Long story short, my model can a identify my music collection with a 98% correctness (more on that later). If you are interested in the technical details and the scripts used to train the model, they are available on github: https://github.com/kitsook/AlbumSpotter

But eventually I didn't integrate this into my home automation, which is kind of related to the correctness. When I got a new CD / vinyl record, I always add that to my collection on Spotify. So I can just get the cover arts from Spotify to train my model. But then I discovered there are at least two problems that will affect the correctness:

  • there are many editions of the same album. e.g. I could have a physical CD of the standard edition but my Spotify collection has the extended edition with a different song list
  • nowadays artists tend to release an album with several "special" cover arts. My physical copy could look totally different from the one on Spotify

That means I will need to cleanup the data for a more accurate result. As procrastination kicks in, I am stopping the project with just the machine learning model and the home automation part will be a future project.


Monday, September 23, 2024

OpenSUSE Tumbleweed - another bug


 


Tumbleweed used to be great in the past eight years or so that I used it on my main desktop. But recently it is just bugs after bugs after every update.

Here is another one after updating today: the KDE screenlocker crashed with segfault and failed to unlock the screen. Seems to be pam library related.




Saturday, June 15, 2024

Google cloud AppEngine deployment issue with cache



Notes to self: when deploying Google Cloud applications, if encounter the following "no such object" error, it is because the deployment process is looking for cache that doesnt exist.

"build": ERROR: failed to initialize analyzer: getting previous image: getting config file for image

To solve it, deploy with "--no-cache" parameter.

Friday, February 23, 2024

Segfault with bluez and A2DP


Currently (as of Feb 23, 2024) openSUSE Tubleweed with bluez 5.71-2.4 is experiencing segfault whenever I try to connect my Sony WH-H800 headphone (but fine with others).

After some digging, seems like it is plagued by a recent bug. Waiting for Tumbleweed to include the patch.

[Fri Feb 23 16:23:24 2024] bluetoothd[9971]: segfault at 5600bf0628b5 ip 00005605de5743c1 sp 00007ffee8f18f70 error 4 in blu
etoothd[5605de552000+d6000] likely on CPU 7 (core 1, socket 0)
[Fri Feb 23 16:23:24 2024] Code: 41 83 2c 24 01 0f 85 1b ff ff ff 4c 89 e7 e8 96 f4 fd ff e9 0e ff ff ff 90 41 55 41 54 55 5

3 48 83 ec 08 48 8b 2a 48 8b 7a 08 <48> 8b 45 20 4c 8b ad 88 00 00 00 4c 8b 20 48 85 ff 74 19 c7 47 08

$ sudo coredumpctl info 9971
          PID: 9971 (bluetoothd)
          UID: 0 (root)
          GID: 0 (root)
       Signal: 11 (SEGV)
    Timestamp: Fri 2024-02-23 16:23:24 PST (4h 2min ago)
 Command Line: /usr/libexec/bluetooth/bluetoothd
   Executable: /usr/libexec/bluetooth/bluetoothd
Control Group: /system.slice/bluetooth.service
         Unit: bluetooth.service
        Slice: system.slice
      Boot ID: 153713284dde4d7cba57f31e2956690d
   Machine ID: 5c42528e25094a3cb1af7e2c43a85357
     Hostname: linux-lct7
      Storage: /var/lib/systemd/coredump/core.bluetoothd.0.153713284dde4d7cba57f31e2956690d.9971.1708734204000000.zst (present)
 Size on Disk: 138.1K
      Message: Process 9971 (bluetoothd) of user 0 dumped core.
                
               Stack trace of thread 9971:
               #0  0x00005605de5743c1 n/a (bluetoothd + 0x463c1)
               #1  0x00005605de55f4d0 n/a (bluetoothd + 0x314d0)
               #2  0x00005605de55f5a8 n/a (bluetoothd + 0x315a8)
               #3  0x00005605de569767 n/a (bluetoothd + 0x3b767)
               #4  0x00007fc0b22daf30 n/a (libglib-2.0.so.0 + 0x5bf30)
               #5  0x00007fc0b22dcb58 n/a (libglib-2.0.so.0 + 0x5db58)
               #6  0x00007fc0b22dd42f g_main_loop_run (libglib-2.0.so.0 + 0x5e42f)
               #7  0x00005605de5555dc n/a (bluetoothd + 0x275dc)
               #8  0x00007fc0b1e2a1f0 __libc_start_call_main (libc.so.6 + 0x2a1f0)
               #9  0x00007fc0b1e2a2b9 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2a2b9)
               #10 0x00005605de5566b5 n/a (bluetoothd + 0x286b5)
               ELF object binary architecture: AMD x86-64