Fine tuning foundation models
2025-01-23 — 2025-07-23
Suspiciously similar content
The alignment of large language models (LLMs) with human values and intentions is a critical problem in contemporary AI research. I don’t truly believe that alignment is well-posed in the abstract but for the moment let’s be satisfied with steering a model’s behaviour to better match a user’s intent or a set of specified principles, such as being “helpful, harmless, and truthful” (Ouyang et al. 2022).
Pre-training on vast text corpora makes these models swole with wordy power but notoriously the power may not do what you wish (if you get into a very 4chan-part of the parameter space, for example it is not going to produce great bed-time stories for kids). More formally, the pre-training objective is misaligned with what an average user wants. Fine-tuning with human feedback, particularly through methods like Reinforcement Learning from Human Feedback (RLHF), is the archetypal way to fix this.
I’m no expert in RLHF, but I’m keeping some notes.
1 Formalizing
A pre-trained autoregressive language model is a parameterized probability distribution \(p_\theta\) over a vocabulary \(\mathcal {V}\). Given a sequence of tokens (a prompt) \(x = (x_1, \dots, x_k)\), the model generates a continuation \(y = (y_1, \dots, y_m)\) by sequentially sampling tokens: \[ p_\theta(y—x) = \prod_{i=1}^{m} p_\theta(y_i—x, y_1, \dots, y_{i-1}) \] The parameters \(\theta\) are learned during a pre-training phase, typically by maximizing the log-likelihood of a massive text corpus \(\mathcal {D}_{\text {pretrain}}\): \[ \max_\theta \sum_{x \in \mathcal {D}_{\text {pretrain}}} \log p_\theta (x) \] This objective trains the model to be an excellent predictor of the “average” text on the internet. However, this is not the same as being helpful or following instructions, since some text on the internet is helpful with respect to what came before it, and some of it is other stuff, e.g. arseholes shouting at one another. Being in a chat session is not the typical origin of text on the internet, but it is what the model is supposed to do. A model trained this way might produce outputs that are untruthful, toxic, or simply bizarre. We want to take model that is great at sampling from all possible sentences, and instead, interacting usefully with a user, where “useful” means in practice something like “they will pay
The goal of fine-tuning is to adapt the pre-trained model parameters, which we’ll call \(\theta_{\text{SFT}}\) (for Supervised Fine-Tuned, a common starting point), to a new set of parameters \(\theta_{\text {aligned}}\) that produces more desirable outputs. The challenge is that “desirability” is not defined by a simple likelihood objective but by latent human preferences, which are hard to elicit (cue economists sighing). RLHF is a framework to solve this problem by learning a reward function directly from these preferences by throwing reinforcement learning into the mix.
2 RLHF origins
The core idea is to translate human preferences between two model outputs, \((y_w, y_l) \sim p_{\theta_{\text{SFT}}}(y—x)\), into a reward signal for RL.
- Learning from Human Preferences: The paper by Christiano et al. (2017) is the direct intellectual ancestor of RLHF for LLMs. It established the core loop of learning a reward model from human pairwise comparisons and optimizing a policy against it, albeit in the context of Atari games and simulated robotics.
- Application to Summarization: Stiennon et al. (2020) provides a clear, focused application of these ideas to text summarization, detailing the data collection and training process. Maybe a nice conrete example?
- Scaling to General Instructions (InstructGPT): The InstructGPT paper (Ouyang et al. (2022)) is a seminal work demonstrating how RLHF can be scaled to train models to follow a wide range of instructions. Pay close attention to their three-step process:
- Supervised Fine-Tuning (SFT): Collecting a dataset of human-written demonstrations and fine-tuning the base LLM.
- Reward Model (RM) Training: Training a model to predict which of two model outputs a human would prefer. The reward model learns a function \(r_\phi(x, y)\) that assigns a scalar score to an output.
- Reinforcement Learning (PPO): The core RLHF step. The SFT model is fine-tuned using Proximal Policy Optimization (Schulman et al. 2017) to maximize the reward from the RM, with a KL-divergence penalty to prevent the policy from straying too far from the initial SFT model. The objective is roughly: \(\mathbb {E}_{y \sim \pi_{\text {RL}}}[r_\phi(x, y)] - \beta \text {KL}(\pi_{\text {RL}}(\cdot—x) || \pi_{\text{SFT}}(\cdot—x))\).
3 Recent fancy things
- Principled RLHF: Zhu, Jordan, and Jiao (2023) provides a theoretical framework for RLHF, discussing the use of preference models like Bradley-Terry-Luce (BTL) for handling pairwise and K-wise comparisons.
- Direct Preference Optimization (DPO): The RL step in RLHF can be complex and unstable. Rafailov et al. (2023) introduces Direct Preference Optimization (DPO), an famously elegant way to bypass the need for an explicit reward model and RL. DPO shows that the constrained reward maximization objective of PPO can be optimized directly with a simple binary classification loss on preference data. This went very viral and I keep meaning to read it properly.
- Pitfalls in Implementation: Tang and Munos (2025) discusses common pitfalls in implementing the KL divergence gradient estimation used in RL, which is a key component of PPO-based fine-tuning.
4 Weird stuff I don’t know where to file yet
- SFT Memorizes, RL Generalizes?: We do both SFT and RL? Why? Chu et al. (2025) is a comparative study, suggesting that SFT may lead to memorization of the training data formats, while RL fosters better generalization to unseen rules and visual domains.
- Application to Big-A Alignment: The ultimate goal of alignment is to create ethically robust AI systems. Plasencia (2024) gives and overview of these challenges. A simple generalisation of RLHF is Constitutional AI (Bai et al. 2022), which proposes using an AI model itself to provide the preference labels, guided by a human-written “constitution” or set of principles. This reduces the reliance on large-scale human labeling for every decision.
5 Bonus time: RL papers I should probably read
- Proximal Policy Optimization (PPO): The original paper by Schulman et al. (2017) is the source of the RL algorithm used in almost all major RLHF works, including InstructGPT. Understanding why PPO is preferred (due to its stability and sample efficiency compared to vanilla policy gradients) is crucial.