Sunday, July 27, 2025

Foundations of Large Language Models

 Here is a summary of Foundations of Large Language Models, generated by Gemini CLI with gemini-2.5-pro.


Of course. Here is a summary for each chapter of "Foundations of Large Language Models":


### **Chapter 1: Pre-training**


Chapter 1 introduces pre-training as the foundational step for modern Large Language Models. It distinguishes between unsupervised, supervised, and the now-dominant self-supervised learning paradigms. Self-supervised pre-training allows models to learn from vast amounts of unlabeled text by creating their own supervision signals. The chapter details the primary self-supervised objectives, categorizing them by model architecture: decoder-only (e.g., causal language modeling), encoder-only (e.g., masked language modeling), and encoder-decoder (e.g., denoising).


A significant portion of the chapter uses BERT as a case study for an encoder-only model. It explains BERT's architecture and its two pre-training tasks: Masked Language Modeling (MLM), where the model predicts randomly masked tokens, and Next Sentence Prediction (NSP), which teaches the model to understand sentence relationships. The chapter also covers the evolution of BERT, discussing improvements through more data (RoBERTa), increased scale, efficiency enhancements, and multilingual capabilities. Finally, it outlines how these powerful pre-trained models are adapted for downstream tasks through methods like fine-tuning, where the model's weights are further adjusted on a smaller, task-specific dataset, or through prompting, which is explored in later chapters.


### **Chapter 2: Generative Models**


Chapter 2 focuses on generative models, the class of LLMs like GPT that are designed to produce text. It traces the evolution from traditional n-gram models to the sophisticated neural network architectures of today. The core of modern generative LLMs, the decoder-only Transformer, is explained in detail. This architecture processes a sequence of tokens and predicts the next token in an auto-regressive fashion, generating text one token at a time. The chapter discusses the immense challenges of training these models at scale, which requires vast computational resources and distributed systems to manage the model parameters and data.


A key concept covered is "scaling laws," which describe the predictable relationship between a model's performance and increases in model size, dataset size, and computational budget. These laws have driven the trend toward building ever-larger models. The chapter also addresses a critical challenge: long-sequence modeling. The quadratic complexity of self-attention makes processing long texts computationally expensive. To overcome this, the chapter explores various techniques, including efficient attention mechanisms (e.g., sparse and linear attention), Key-Value (KV) caching for memory optimization, and advanced positional embeddings like RoPE that help models generalize to longer contexts than they saw during training.


### **Chapter 3: Prompting**


Chapter 3 delves into prompting, the method of guiding an LLM's behavior by providing it with a specific input, or "prompt." This technique is central to interacting with modern LLMs and has given rise to the field of prompt engineering. The chapter introduces the fundamental concept of in-context learning (ICL), where the model learns to perform a task from examples provided directly in the prompt, without any weight updates. This is demonstrated through zero-shot, one-shot, and few-shot learning paradigms.


The chapter outlines several strategies for effective prompt design, such as providing clear instructions, specifying the desired format, and assigning a role to the model. It then explores advanced techniques that significantly enhance LLM reasoning. The most prominent of these is Chain of Thought (CoT) prompting, which encourages the model to generate a step-by-step reasoning process before giving a final answer, dramatically improving performance on complex tasks. Other advanced methods discussed include problem decomposition, self-refinement (where the model critiques and improves its own output), and the use of external tools and retrieval-augmented generation (RAG) to incorporate external knowledge. Finally, the chapter touches on methods for automating prompt creation, such as learning "soft prompts."


### **Chapter 4: Alignment**


Chapter 4 addresses the critical process of alignment, which ensures that an LLM's behavior aligns with human values and intentions, making it helpful, harmless, and honest. This process goes beyond simple task performance and is crucial for the safe deployment of LLMs. The chapter outlines two primary methodologies for achieving alignment after the initial pre-training phase.


The first method is **Instruction Alignment**, achieved through Supervised Fine-Tuning (SFT). In SFT, the pre-trained model is further trained on a high-quality dataset of curated instruction-response pairs, teaching it to follow directions effectively. The second, more complex method is **Human Preference Alignment**, most commonly implemented via Reinforcement Learning from Human Feedback (RLHF). RLHF involves a multi-step process: 1) An initial model is used to generate multiple responses to a prompt. 2) Humans rank these responses based on preference. 3) A separate "reward model" is trained on this human preference data to predict which outputs humans would prefer. 4) This reward model is then used to fine-tune the original LLM using reinforcement learning algorithms like PPO, optimizing the model to generate outputs that maximize the predicted reward. The chapter also introduces Direct Preference Optimization (DPO) as a more direct and less complex alternative to PPO-based RLHF.


### **Chapter 5: Inference**


Chapter 5 focuses on inference, the process of using a trained LLM to generate predictions. The core of LLM inference is divided into a two-phase framework: **Prefilling and Decoding**. In the prefilling phase, the input prompt is processed in a highly parallelized, single pass to compute the initial Key-Value (KV) cache. This phase is compute-bound. The subsequent decoding phase is an auto-regressive, token-by-token generation process that uses this KV cache. This phase is memory-bound due to the large memory footprint of the cache and the sequential nature of the generation.


The chapter details the various search strategies, or decoding algorithms, used to select the next token at each step. These include deterministic methods like greedy search and beam search, as well as a stochastic sampling methods like top-k and top-p (nucleus) sampling, which introduce diversity into the output. To improve efficiency, the chapter covers advanced batching techniques like continuous batching and PagedAttention, which optimize GPU utilization by dynamically managing requests of varying lengths. It also explains speculative decoding, a method that uses a smaller, faster "draft" model to generate candidate tokens that are then verified in a single pass by the larger, more powerful model, significantly accelerating the decoding process.