Fine tuning foundation models

2025-01-23 — 2026-06-13

Wherein Language Models Are Aligned With Human Intent Through Reinforcement Learning and Low-Rank Adaptation, and Wherein Refusal Is Found to Be Mediated by a Single Direction in the Residual Stream.

adaptive
agents
approximation
bandit problems
Bayes
control
generative
incentive mechanisms
language
learning
machine learning
meta learning
networks
neural nets
NLP
optimization
stochastic processes
stringology
time series
utility
Figure 1

The alignment of large language models (LLMs) with human values and intentions is a critical problem in contemporary AI research. I don’t truly believe that alignment is well-posed in the abstract, but for the moment let’s be satisfied with steering a model’s behaviour to better match a user’s intent or a set of specified principles, such as being “helpful, harmless, and truthful” (Ouyang et al. 2022).

Pre-training on vast text corpora makes these models swole with wordy power, but that power often won’t do what we want (if we get into a very 4chan-y part of the parameter space, for example, it isn’t going to produce great bedtime stories for kids). More formally, the pre-training objective is misaligned with what an average user wants. Fine-tuning with human feedback, particularly through methods like Reinforcement Learning from Human Feedback (RLHF), is the archetypal way to fix this.

I’m no expert in RLHF, but I’m keeping some notes.

1 Formalizing

A pre-trained autoregressive language model is a parameterized probability distribution \(p_\theta\) over a vocabulary \(\mathcal {V}\). Given a sequence of tokens (a prompt) \(x = (x_1, \dots, x_k)\), the model generates a continuation \(y = (y_1, \dots, y_m)\) by sequentially sampling tokens: \[ p_\theta(y—x) = \prod_{i=1}^{m} p_\theta(y_i—x, y_1, \dots, y_{i-1}) \] We learn the parameters \(\theta\) during a pre-training phase, typically by maximizing the log-likelihood of a massive text corpus \(\mathcal {D}_{\text {pretrain}}\): \[ \max_\theta \sum_{x \in \mathcal {D}_{\text {pretrain}}} \log p_\theta (x) \] This objective trains the model to be an excellent predictor of the “average” text on the internet. However, that’s not the same as being helpful or following instructions, since some internet text is helpful in context, and some of it is other stuff, e.g. arseholes shouting at one another. Chat sessions aren’t the typical origin of text on the internet, but it is what the model is supposed to do. A model trained this way might produce outputs that are untruthful, toxic, or simply bizarre. We want to take a model that’s great at sampling from all possible sentences, and instead make it interact usefully with a user, where “useful” means in practice something like “the user might pay cash money for this interaction.”

The goal of fine-tuning is to adapt the pre-trained model parameters, which we’ll call \(\theta_{\text{SFT}}\) (for Supervised Fine-Tuned, a common starting point), to a new set of parameters \(\theta_{\text {aligned}}\) that produces more desirable outputs. The challenge is that “desirability” is not defined by a simple likelihood objective but by latent human preferences, which are hard to elicit (cue economists sighing). RLHF is a framework to solve this problem by learning a reward function directly from these preferences by throwing reinforcement learning into the mix.

2 Parameter-efficient fine-tuning

Everything above adapts the parameters \(\theta\) toward a new objective, but says nothing about how many of them we touch — which in practice is very important, because these models are huge and messing with them is costly. We introduce some terminology.

The network we start from — pretrained, general-purpose — is the base model (equivalently the backbone, or foundation model), and its weights written to disk are a checkpoint. The thorough way to specialise it is a full fine-tune: keep training every weight and save a fresh checkpoint. That works, but the artefact is as big as the original model (“huge”, we said) — and the run wants serious hardware. Also, done badly we run into problems of catastrophic forgetting.

Parameter-efficient fine-tuning (PEFT) is the cheaper alternative: freeze the base weights and train only a small set of new parameters bolted on top. The result is a small adapter file rather than a whole checkpoint, and the training fits on far less GPU.

LoRA — low-rank adaptation (Hu et al. 2021) — is the PEFT method that “won” for most purposes.1 The observation is that the update a fine-tune would make to a big weight matrix is itself close to low-rank, so instead of learning a dense update we learn two skinny matrices whose product approximates it, inject that into the layer, and leave the base frozen. The higher the matrix rank, the more capacity at the cost of a bigger file. At inference we either run the adapter beside the base or merge it in, so one base checkpoint can be used for a whole library of swappable adapters.

LoRA is not the only PEFT method, even where the community talks as though it were.

  • Textual inversion trains a single new embedding — one token vector — and touches no network weights at all.
  • Adapters and prefix- or prompt-tuning splice small trainable modules, or learned pseudo-tokens, into each layer.
  • DoRA, LoKr and other descendants reparameterize the low-rank update for a little more fidelity per trained parameter.
  • QLoRA runs LoRA over a quantized frozen base, which is what makes it feasible to train a big model on a single consumer card.

I suspect that these methods are close to being a full transfer-learning solution.

This how-much-we-touch axis is independent of the how-we-update question that the rest of the page considers. We can run a full-weight SFT, a LoRA-based DPO, etc.

3 RLHF origins

The core idea is to translate human preferences between two model outputs, \((y_w, y_l) \sim p_{\theta_{\text{SFT}}}(y—x)\), into a reward signal for RL.

  • Learning from Human Preferences: The paper by Christiano et al. (2017) is the direct intellectual ancestor of RLHF for LLMs. It established the core loop of learning a reward model from human pairwise comparisons and optimizing a policy against it, albeit in the context of Atari games and simulated robotics.
  • Application to Summarization: Stiennon et al. (2020) applies these ideas to text summarization, detailing the data collection and training process. Maybe a nice concrete example?
  • Scaling to General Instructions (InstructGPT): The InstructGPT paper (Ouyang et al. (2022)) is a seminal work demonstrating how RLHF trains models to follow a wide range of instructions. The training flow is a three-step process:
  1. Supervised Fine-Tuning (SFT): Collecting a dataset of human-written demonstrations and fine-tuning the base LLM.
  2. Reward Model (RM) Training: Training a model to predict which of two model outputs a human would prefer. The reward model learns a function \(r_\phi(x, y)\) that assigns a scalar score to an output.
  3. Reinforcement Learning (PPO): The core RLHF step. The SFT model is fine-tuned using Proximal Policy Optimization (Schulman et al. 2017) to maximize the reward from the RM, with a KL-divergence penalty to prevent the policy from straying too far from the initial SFT model. The objective is roughly: \(\mathbb {E}_{y \sim \pi_{\text {RL}}}[r_\phi(x, y)] - \beta \text {KL}(\pi_{\text {RL}}(\cdot—x) || \pi_{\text{SFT}}(\cdot—x))\).

4 Recent fancy things

  • Principled RLHF: Zhu, Jordan, and Jiao (2023) provides a theoretical framework for RLHF, discussing the use of preference models like Bradley-Terry-Luce (BTL) for handling pairwise and K-wise comparisons.
  • Direct Preference Optimization (DPO): The RL step in RLHF can be complex and unstable. Rafailov et al. (2023) introduces Direct Preference Optimization (DPO), a famously elegant way to bypass the need for an explicit reward model and RL. DPO shows that the constrained reward maximization objective of PPO can be optimized directly with a simple binary classification loss on preference data. This went very viral and I keep meaning to read it properly.
  • Pitfalls in Implementation: Tang and Munos (2025) discusses common pitfalls in implementing the KL divergence gradient estimation used in RL, which is a key component of PPO-based fine-tuning.

5 Abliteration

Everything above is about installing a behaviour. Abliteration is the opposite — removing a behaviour installed by the alignment pass. Amazingly this is sometimes possible without retraining. The word is a portmanteau of ablation and obliteration, coined by FailSpy; the usual target is the model’s capacity to refuse forbidden instructions.

The result that makes it work is Arditi et al. (2024). Across multiple open chat models, they find that refusal is mediated by a single direction in the residual stream — a one-dimensional subspace. Erase that direction from the activations and the model stops refusing harmful instructions; add it back artificially and it refuses harmless ones. The safety fine-tuning (at least the refusal) collapses, to a first approximation, onto one vector.

The recipe to remove this is embarrassingly cheap:

  1. We run the model over a batch of harmful prompts and a batch of harmless ones, caching the residual-stream activations.
  2. We take the difference of the two means; normalized, that is the refusal direction at each layer.
  3. We ablate it — either at inference time, projecting each component’s output onto the direction and subtracting it at every layer and token, or permanently, by orthogonalizing the weight matrices that write to the residual stream so they can no longer express it.

Even the permanent version is a static weight edit: no gradients needed, no optimizer, no dataset beyond a few hundred contrastive prompts. Maxime Labonne’s walk-through calls it “fine-tuning without retraining”.

It is not free. Ablation “dents” the model, i.e. benchmarks drop a little. To recover performance we do need to (re-)train. People heal the result with a light preference pass (e.g. a touch of DPO), recovering most of the loss without re-installing the refusals.

Abliteration techniques generalize past censorship: any behaviour that lives along a direction can be removed or amplified the same way.2 Refusal is just the direction most people are interested in.

A wider lesson is the brittleness of this refusal mechanism as safety — the alignment holds up in ordinary use, but mechanistically it is a thin layer that white-box access plus a little interpretability dissolves away.

This has a particular following around the open-weights models from Chinese labs. Those ship with (from a western perspective) particularly frustrating refusals: not only are they subject to the usual safety alignment, but they also have a politically-motivated layer that Chinese regulation requires. TC260-003 sets out specific handling for questions about China’s political system and history, so the models deflect on Tiananmen, Taiwan, Xinjiang and other political flashpoints. Most of the global audience downloading Qwen or DeepSeek for their (excellent, free) general capability does not share that second set of constraints, and would prefer to strip political refusals. deccp is a proof-of-concept that un-censors Qwen on exactly the CCP-sensitive prompts, and huihui-ai publishes abliterated DeepSeek-R1 and Qwen variants by the dozen. The -abliterated suffix has become a small fixture of the DIY-on-open-weights scene.

Interesting tools:

6 Weird stuff I don’t know where to file yet

  • SFT Memorizes, RL Generalizes?: Why do we do both SFT and RL? Chu et al. (2025) is a comparative study, suggesting that SFT may lead to memorization of the training-data formats, while RL fosters better generalization to unseen rules and visual domains.
  • Application to Big-A Alignment: The ultimate goal of alignment is to create ethically robust AI systems. Plasencia (2024) gives an overview of these challenges. A simple generalization of RLHF is Constitutional AI (Bai et al. 2022), which proposes using an AI model itself to provide the preference labels, guided by a human-written “constitution” or set of principles. This reduces the reliance on large-scale human labelling for every decision.

7 Bonus time: RL papers I should probably read

  1. Proximal Policy Optimization (PPO): The original paper by Schulman et al. (2017) is the source of the RL algorithm used in almost all major RLHF works, including InstructGPT. Understanding why PPO is preferred (“due to its stability and sample efficiency compared to vanilla policy gradients”) would be good for my soul.

8 Practice

Fine-tuning assistance:

Unsloth is the DIY end of that same spectrum — a LoRA/QLoRA toolkit with custom CUDA kernels (“2× faster training, 70% less VRAM” against vanilla transformers), whose Unsloth Studio bolts on a local no-code browser UI that trains, exports GGUF or safetensors, and serves the result on an OpenAI-compatible /v1 endpoint. It is CUDA-first, so on Apple Silicon Mac support is still in progress; the danbot writeup weighs it against the managed services for a paid run.

9 References

Arditi, Obeso, Syed, et al. 2024. Refusal in Language Models Is Mediated by a Single Direction.”
Bai, Kadavath, Kundu, et al. 2022. Constitutional AI: Harmlessness from AI Feedback.”
Christiano, Leike, Brown, et al. 2017. “Deep Reinforcement Learning from Human Preferences.” In Advances in Neural Information Processing Systems.
Chu, Zhai, Yang, et al. 2025. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-Training.”
Hu, Shen, Wallis, et al. 2021. LoRA: Low-Rank Adaptation of Large Language Models.”
Ouyang, Wu, Jiang, et al. 2022. Training Language Models to Follow Instructions with Human Feedback.” In.
Plasencia. 2024. Reinforcement Learning From Human Feedback For Ethically Robust Ai Decision-Making.”
Rafailov, Sharma, Mitchell, et al. 2023. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.”
Renda, Frankle, and Carbin. 2020. Comparing Rewinding and Fine-Tuning in Neural Network Pruning.” arXiv:2003.02389 [Cs, Stat].
Schulman, Wolski, Dhariwal, et al. 2017. Proximal Policy Optimization Algorithms.”
Shenfeld, Pari, and Agrawal. 2025. RL’s Razor: Why Online Reinforcement Learning Forgets Less.”
Stiennon, Ouyang, Wu, et al. 2020. Learning to Summarize with Human Feedback.” In Advances in Neural Information Processing Systems.
Tang, and Munos. 2025. On a Few Pitfalls in KL Divergence Gradient Estimation for RL.”
Zhu, Jordan, and Jiao. 2023. Principled Reinforcement Learning with Human Feedback from Pairwise or K-Wise Comparisons.” In Proceedings of the 40th International Conference on Machine Learning.
Ziegler, Stiennon, Wu, et al. 2019. Fine-Tuning Language Models from Human Preferences.” In arXiv.org.

Footnotes

  1. The image-generation community lives and breathes these: a CivitAI “LoRA” is this trick onto a diffusion backbone..↩︎

  2. FailSpy’s MopeyMule is an abliterated model steered into permanent melancholy.↩︎