Aligning AI systems
Practical approaches to domesticating wild models. RLHF, Constitutional AI, etc
2025-01-19 — 2025-10-28
Wherein the notion of alignment is considered systemic, the human corpus, pretraining, RLHF and deployment are treated as a single epistemic ecosystem, and institutional and economic dimensions are noted.
Placeholder.
Notes on how to implement alignment in AI systems. This is necessarily a fuzzy concept: both Alignment and AI are fuzzy. We need to make peace with the frustrations of this fuzziness and move on.
1 Fine-tuning to do nice stuff
Think RLHF, Constitutional AI, etc. I’m not greatly convinced these are the right way to go, but they’re interesting.
2 Classifying models as unaligned
I’m only familiar with mechanistic interpretability at the moment; I’m sure there’s other work.
3 Emergent values systems
Emergent Values (Mazeika et al. 2025); cf. Chiu, Jiang, and Choi (2024), and consider computational morality.
4 Contrarian corner
I was asked this in a job application recently:
Explain your strongest disagreement with other alignment thinkers. Consider only your own inside-view understanding, and don’t defer to others’ expertise.
My answer follows:
My most common disagreement is that I think that “alignment”, in the sense that we wish to “solve” it, is not meaningfully an attribute of an algorithm in isolation. There is an obvious, and trivial, sense, in which this is true, which is that if YOUR algorithm is perfectly aligned with your intent to kill me, then it is probably not aligned with my intentions, which most likely involve me being alive. But I think that this points not just to a problem of alignment with “something”/“someone” being something to certify, but rather than we want to think about incentive compatibility at the systems scale.
It is hard to imagine, in the era of pre-trained foundation models, that “aligning” an algorithm with “me” is even well-posed. What would it mean to align an intelligent encyclopaedia of all human opinions, knowledge and feelings, with my feelings. Should we even? We can try to band-aid that up with coherent extrapolated volition, or some notion of what I would want if I were more than I am. This feels very hard to articulate and also very ill-posed.
My idiosyncratic hypothesis is that some of the tension could be resolved by changing the unit of analysis. Can the system of humans-generating-the-pretraining-corpus and the algorithm pre-trained upon it, and the human generating the RLHF signals that further tune it, and the inference in deployment, can they all be regarded as a single epistemic system? Can that system be “aligned” to its own well-being more meaningfully than the constituent parts? I think so. I cannot prove it, and I am open to being proven wrong. But I think it might be possible, and fruitful, and ultimately correct, to start from that level of analysis. I think we, jointly with our machines, might be best thought of as thermodynamic epistemic ecosystems, consuming energy and information. All the standard metrics of ecosystem health might analogously apply to us (energy flux, disturbance stability, species diversity, food web depth…). Moreover, I think that if we don’t find some method of analysis that is in some way systemic, that solves “broad” alignment, then we attain nothing. Infinitely precise tools to “align” an algorithm in some narrow sense with an arbitrary goal will surely almost eventually be used to achieve an arbitrarily bad goal.
That’s a long way of saying that I think alignment is in part institutional, and deeply economic. That’s not to say the vastly greater speed, complexity and scale of AI systems don’t change the game in major ways.
5 Incoming
- Center for the Alignment of AI Alignment Centers is super legit and very important.
- Illustrating Reinforcement Learning from Human Feedback (RLHF)
- AI Control: Improving Safety Despite Intentional Subversion — AI Alignment Forum
- Stuart Russell on Making Artificial Intelligence Compatible with Humans, an interview on various themes in his book (Russell 2019)
- When should we worry about AI power-seeking?
- How do we solve the alignment problem?
- What is it to solve the alignment problem?
- OpenAI’s CBRN tests seem unclear.
