Aligning AI systems
Practical approaches to domesticating wild models. RLHF, Constitutional AI, etc
2025-01-19 — 2025-03-03
Wherein practical methods such as reinforcement learning from human feedback and mechanistic interpretability are surveyed, computational morality is noted, and the inherent fuzziness of alignment is acknowledged.
Placeholder.
Notes on how to implement alignment in AI systems. This is necessarily a fuzzy concept, because Alignment is fuzzy and AI is fuzzy. We need to make peace with the frustrations of this fuzziness and move on.
1 Fine tuning to do nice stuff
Think RLHF, Constitutional AI etc. I’m not greatly persuaded that these are the right way to go, but they are interesting.
2 Classifying models as unaligned
I’m familiar only with mechanistic interpretability at the moment; I’m sure there is other stuff.
3 Emergent values systems
Emergent Values (Mazeika et al. 2025) cf Chiu, Jiang, and Choi (2024), and consider computational morality.
4 Incoming
- Center for the Alignment of AI Alignment Centers is super legit and very important
- Illustrating Reinforcement Learning from Human Feedback (RLHF)
- AI Control: Improving Safety Despite Intentional Subversion — AI Alignment Forum
- Stuart Russell on Making Artificial Intelligence Compatible with Humans, an interview on various themes in his book (Russell 2019)
- When should we worry about AI power-seeking?
- How do we solve the alignment problem?
- What is it to solve the alignment problem?
- OpenAI’s CBRN tests seem unclear