Fine tuning foundation models

2025-01-23 — 2025-07-23

Wherein the tuning of vast language models is framed as a human‑feedback loop, in which pairwise preference data are used to train a reward model and PPO fine‑tuning is constrained by a KL penalty

adaptive

agents

approximation

bandit problems

Bayes

control

generative

incentive mechanisms

language

learning

machine learning

meta learning

networks

neural nets

NLP

optimization

stochastic processes

stringology

time series

utility

The alignment of large language models (LLMs) with human values and intentions is a critical problem in contemporary AI research. I don’t truly believe that alignment is well-posed in the abstract, but for the moment let’s be satisfied with steering a model’s behaviour to better match a user’s intent or a set of specified principles, such as being “helpful, harmless, and truthful” (Ouyang et al. 2022).

Pre-training on vast text corpora makes these models swole with wordy power, but that power often won’t do what we want (if we get into a very 4chan-y part of the parameter space, for example, it isn’t going to produce great bedtime stories for kids). More formally, the pre-training objective is misaligned with what an average user wants. Fine-tuning with human feedback, particularly through methods like Reinforcement Learning from Human Feedback (RLHF), is the archetypal way to fix this.

I’m no expert in RLHF, but I’m keeping some notes.

1 Formalizing

A pre-trained autoregressive language model is a parameterized probability distribution \(p_\theta\) over a vocabulary \(\mathcal {V}\). Given a sequence of tokens (a prompt) \(x = (x_1, \dots, x_k)\), the model generates a continuation \(y = (y_1, \dots, y_m)\) by sequentially sampling tokens: \[ p_\theta(y—x) = \prod_{i=1}^{m} p_\theta(y_i—x, y_1, \dots, y_{i-1}) \] We learn the parameters \(\theta\) during a pre-training phase, typically by maximizing the log-likelihood of a massive text corpus \(\mathcal {D}_{\text {pretrain}}\): \[ \max_\theta \sum_{x \in \mathcal {D}_{\text {pretrain}}} \log p_\theta (x) \] This objective trains the model to be an excellent predictor of the “average” text on the internet. However, that’s not the same as being helpful or following instructions, since some internet text is helpful in context, and some of it is other stuff, e.g. arseholes shouting at one another. Chat sessions aren’t the typical origin of text on the internet, but it is what the model is supposed to do. A model trained this way might produce outputs that are untruthful, toxic, or simply bizarre. We want to take a model that’s great at sampling from all possible sentences, and instead make it interact usefully with a user, where “useful” means in practice something like “the user might pay cash money for this interaction.”

The goal of fine-tuning is to adapt the pre-trained model parameters, which we’ll call \(\theta_{\text{SFT}}\) (for Supervised Fine-Tuned, a common starting point), to a new set of parameters \(\theta_{\text {aligned}}\) that produces more desirable outputs. The challenge is that “desirability” is not defined by a simple likelihood objective but by latent human preferences, which are hard to elicit (cue economists sighing). RLHF is a framework to solve this problem by learning a reward function directly from these preferences by throwing reinforcement learning into the mix.

2 RLHF origins

The core idea is to translate human preferences between two model outputs, \((y_w, y_l) \sim p_{\theta_{\text{SFT}}}(y—x)\), into a reward signal for RL.

Learning from Human Preferences: The paper by Christiano et al. (2017) is the direct intellectual ancestor of RLHF for LLMs. It established the core loop of learning a reward model from human pairwise comparisons and optimizing a policy against it, albeit in the context of Atari games and simulated robotics.
Application to Summarization: Stiennon et al. (2020) provides a clear, focused application of these ideas to text summarization, detailing the data collection and training process. Maybe a nice conrete example?
Scaling to General Instructions (InstructGPT): The InstructGPT paper (Ouyang et al. (2022)) is a seminal work demonstrating how RLHF can be scaled to train models to follow a wide range of instructions. Pay close attention to their three-step process:

Supervised Fine-Tuning (SFT): Collecting a dataset of human-written demonstrations and fine-tuning the base LLM.
Reward Model (RM) Training: Training a model to predict which of two model outputs a human would prefer. The reward model learns a function \(r_\phi(x, y)\) that assigns a scalar score to an output.
Reinforcement Learning (PPO): The core RLHF step. The SFT model is fine-tuned using Proximal Policy Optimization (Schulman et al. 2017) to maximize the reward from the RM, with a KL-divergence penalty to prevent the policy from straying too far from the initial SFT model. The objective is roughly: \(\mathbb {E}_{y \sim \pi_{\text {RL}}}[r_\phi(x, y)] - \beta \text {KL}(\pi_{\text {RL}}(\cdot—x) || \pi_{\text{SFT}}(\cdot—x))\).

3 Recent fancy things

Principled RLHF: Zhu, Jordan, and Jiao (2023) provides a theoretical framework for RLHF, discussing the use of preference models like Bradley-Terry-Luce (BTL) for handling pairwise and K-wise comparisons.
Direct Preference Optimization (DPO): The RL step in RLHF can be complex and unstable. Rafailov et al. (2023) introduces Direct Preference Optimization (DPO), a famously elegant way to bypass the need for an explicit reward model and RL. DPO shows that the constrained reward maximization objective of PPO can be optimized directly with a simple binary classification loss on preference data. This went very viral and I keep meaning to read it properly.
Pitfalls in Implementation: Tang and Munos (2025) discusses common pitfalls in implementing the KL divergence gradient estimation used in RL, which is a key component of PPO-based fine-tuning.

4 Weird stuff I don’t know where to file yet

SFT Memorizes, RL Generalizes?: Why do we do both SFT and RL? Chu et al. (2025) is a comparative study, suggesting that SFT may lead to memorization of the training-data formats, while RL fosters better generalization to unseen rules and visual domains.
Application to Big-A Alignment: The ultimate goal of alignment is to create ethically robust AI systems. Plasencia (2024) gives an overview of these challenges. A simple generalization of RLHF is Constitutional AI (Bai et al. 2022), which proposes using an AI model itself to provide the preference labels, guided by a human-written “constitution” or set of principles. This reduces the reliance on large-scale human labeling for every decision.

5 Bonus time: RL papers I should probably read

Proximal Policy Optimization (PPO): The original paper by Schulman et al. (2017) is the source of the RL algorithm used in almost all major RLHF works, including InstructGPT. Understanding why PPO is preferred (due to its stability and sample efficiency compared to vanilla policy gradients) is crucial.

6 References

Bai, Kadavath, Kundu, et al. 2022. “Constitutional AI: Harmlessness from AI Feedback.”

Christiano, Leike, Brown, et al. 2017. “Deep Reinforcement Learning from Human Preferences.” In Advances in Neural Information Processing Systems.

Chu, Zhai, Yang, et al. 2025. “SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-Training.”

Ouyang, Wu, Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” In.

Plasencia. 2024. “Reinforcement Learning From Human Feedback For Ethically Robust Ai Decision-Making.”

Rafailov, Sharma, Mitchell, et al. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.”

Renda, Frankle, and Carbin. 2020. “Comparing Rewinding and Fine-Tuning in Neural Network Pruning.” arXiv:2003.02389 [Cs, Stat].

Schulman, Wolski, Dhariwal, et al. 2017. “Proximal Policy Optimization Algorithms.”

Shenfeld, Pari, and Agrawal. 2025. “RL’s Razor: Why Online Reinforcement Learning Forgets Less.”

Stiennon, Ouyang, Wu, et al. 2020. “Learning to Summarize with Human Feedback.” In Advances in Neural Information Processing Systems.

Tang, and Munos. 2025. “On a Few Pitfalls in KL Divergence Gradient Estimation for RL.”

Zhu, Jordan, and Jiao. 2023. “Principled Reinforcement Learning with Human Feedback from Pairwise or K-Wise Comparisons.” In Proceedings of the 40th International Conference on Machine Learning.

Ziegler, Stiennon, Wu, et al. 2019. “Fine-Tuning Language Models from Human Preferences.” In arXiv.org.