Aligning AI systems
Practical approaches to domesticating wild models. RLHF, Constitutional AI, etc
2025-01-19 — 2025-03-03
adversarial
AI safety
classification
communicating
feature construction
game theory
high d
language
machine learning
metrics
mind
NLP
Suspiciously similar content
Placeholder.
Notes on how to implement alignment in AI systems. This is necessarily a fuzzy concept, because Alignment is fuzzy and AI is fuzzy. We need to make peace with the frustrations of this fuzziness and move on.
1 Fine tuning to do nice stuff
Think RLHF, Constitutional AI etc. I’m not greatly persuaded that these are the right way to go, but they are interesting.
2 Classifying models as unaligned
I’m familiar only with mechanistic interpretability at the moment; I’m sure there is other stuff.
3 Emergent values systems
Emergent Values (Mazeika et al. 2025) cf Chiu, Jiang, and Choi (2024), and consider computational morality.
4 Incoming
- Illustrating Reinforcement Learning from Human Feedback (RLHF)
- AI Control: Improving Safety Despite Intentional Subversion — AI Alignment Forum
- Stuart Russell on Making Artificial Intelligence Compatible with Humans, an interview on various themes in his book (Russell 2019)
- When should we worry about AI power-seeking?
- How do we solve the alignment problem?
- What is it to solve the alignment problem?
- OpenAI’s CBRN tests seem unclear
5 References
Aguirre, Dempsey, Surden, et al. 2020. “AI Loyalty: A New Paradigm for Aligning Stakeholder Interests.” IEEE Transactions on Technology and Society.
Bai, Kadavath, Kundu, et al. 2022. “Constitutional AI: Harmlessness from AI Feedback.”
Barez, Fu, Prabhu, et al. 2025. “Open Problems in Machine Unlearning for AI Safety.”
Burns, Izmailov, Kirchner, et al. 2023. “Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision.”
Chiu, Jiang, and Choi. 2024. “DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life.”
Christiano, Shlegeris, and Amodei. 2018. “Supervising Strong Learners by Amplifying Weak Experts.”
Conitzer, and Oesterheld. 2023. “Foundations of Cooperative AI.” Proceedings of the AAAI Conference on Artificial Intelligence.
Duque, Aghajohari, Cooijmans, et al. 2024. “Advantage Alignment Algorithms.” In.
Greenblatt, Denison, Wright, et al. 2024. “Alignment Faking in Large Language Models.”
Greenblatt, Shlegeris, Sachan, et al. 2024. “AI Control: Improving Safety Despite Intentional Subversion.”
Irving, Christiano, and Amodei. 2018. “AI Safety via Debate.”
Khan, Hughes, Valentine, et al. 2024. “Debating with More Persuasive LLMs Leads to More Truthful Answers.”
Laidlaw, Bronstein, Guo, et al. 2025. “AssistanceZero: Scalably Solving Assistance Games.” In Workshop on Bidirectional Human↔︎AI Alignment.
Mazeika, Yin, Tamirisa, et al. 2025. “Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs.”
Meulemans, Kobayashi, Oswald, et al. 2024. “Multi-Agent Cooperation Through Learning-Aware Policy Gradients.” In.
Ngo, Chan, and Mindermann. 2024. “The Alignment Problem from a Deep Learning Perspective.”
Russell. 2019. Human Compatible: Artificial Intelligence and the Problem of Control.
Stray, Vendrov, Nixon, et al. 2021. “What Are You Optimizing for? Aligning Recommender Systems with Human Values.”
Taylor, Yudkowsky, LaVictoire, et al. 2020. “Alignment for Advanced Machine Learning Systems.” In Ethics of Artificial Intelligence.
Zhuang, and Hadfield-Menell. 2021. “Consequences of Misaligned AI.”
Zou, Wang, Kolter, et al. 2023. “Universal and Transferable Adversarial Attacks on Aligned Language Models.”