Aligning AI systems

Practical approaches to domesticating wild models. RLHF, Constitutional AI, etc

2025-01-19 — 2025-10-28

Wherein the notion of alignment is considered systemic, the human corpus, pretraining, RLHF and deployment are treated as a single epistemic ecosystem, and institutional and economic dimensions are noted.

adversarial
AI safety
classification
communicating
feature construction
game theory
high d
language
machine learning
metrics
mind
NLP
Figure 1

Placeholder.

Notes on how to implement alignment in AI systems. This is necessarily a fuzzy concept: both Alignment and AI are fuzzy. We need to make peace with the frustrations of this fuzziness and move on.

1 Fine-tuning to do nice stuff

Think RLHF, Constitutional AI, etc. I’m not greatly convinced these are the right way to go, but they’re interesting.

2 Classifying models as unaligned

I’m only familiar with mechanistic interpretability at the moment; I’m sure there’s other work.

3 Emergent values systems

Emergent Values (Mazeika et al. 2025); cf. Chiu, Jiang, and Choi (2024), and consider computational morality.

4 Contrarian corner

I was asked this in a job application recently:

Explain your strongest disagreement with other alignment thinkers. Consider only your own inside-view understanding, and don’t defer to others’ expertise.

My answer follows:

My most common disagreement is that I think that “alignment”, in the sense that we wish to “solve” it, is not meaningfully an attribute of an algorithm in isolation. There is an obvious, and trivial, sense, in which this is true, which is that if YOUR algorithm is perfectly aligned with your intent to kill me, then it is probably not aligned with my intentions, which most likely involve me being alive. But I think that this points not just to a problem of alignment with “something”/“someone” being something to certify, but rather than we want to think about incentive compatibility at the systems scale.

It is hard to imagine, in the era of pre-trained foundation models, that “aligning” an algorithm with “me” is even well-posed. What would it mean to align an intelligent encyclopaedia of all human opinions, knowledge and feelings, with my feelings. Should we even? We can try to band-aid that up with coherent extrapolated volition, or some notion of what I would want if I were more than I am. This feels very hard to articulate and also very ill-posed.

My idiosyncratic hypothesis is that some of the tension could be resolved by changing the unit of analysis. Can the system of humans-generating-the-pretraining-corpus and the algorithm pre-trained upon it, and the human generating the RLHF signals that further tune it, and the inference in deployment, can they all be regarded as a single epistemic system? Can that system be “aligned” to its own well-being more meaningfully than the constituent parts? I think so. I cannot prove it, and I am open to being proven wrong. But I think it might be possible, and fruitful, and ultimately correct, to start from that level of analysis. I think we, jointly with our machines, might be best thought of as thermodynamic epistemic ecosystems, consuming energy and information. All the standard metrics of ecosystem health might analogously apply to us (energy flux, disturbance stability, species diversity, food web depth…). Moreover, I think that if we don’t find some method of analysis that is in some way systemic, that solves “broad” alignment, then we attain nothing. Infinitely precise tools to “align” an algorithm in some narrow sense with an arbitrary goal will surely almost eventually be used to achieve an arbitrarily bad goal.

That’s a long way of saying that I think alignment is in part institutional, and deeply economic. That’s not to say the vastly greater speed, complexity and scale of AI systems don’t change the game in major ways.

5 Incoming

6 References

Aguirre, Dempsey, Surden, et al. 2020. AI Loyalty: A New Paradigm for Aligning Stakeholder Interests.” IEEE Transactions on Technology and Society.
Bai, Kadavath, Kundu, et al. 2022. Constitutional AI: Harmlessness from AI Feedback.”
Barez, Fu, Prabhu, et al. 2025. Open Problems in Machine Unlearning for AI Safety.”
Burns, Izmailov, Kirchner, et al. 2023. Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision.”
Chiu, Jiang, and Choi. 2024. DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life.”
Christiano, Shlegeris, and Amodei. 2018. Supervising Strong Learners by Amplifying Weak Experts.”
Conitzer, and Oesterheld. 2023. Foundations of Cooperative AI.” Proceedings of the AAAI Conference on Artificial Intelligence.
Duque, Aghajohari, Cooijmans, et al. 2025. Advantage Alignment Algorithms.” In.
Greenblatt, Denison, Wright, et al. 2024. Alignment Faking in Large Language Models.”
Greenblatt, Shlegeris, Sachan, et al. 2024. AI Control: Improving Safety Despite Intentional Subversion.”
Irving, Christiano, and Amodei. 2018. AI Safety via Debate.”
Khan, Hughes, Valentine, et al. 2024. Debating with More Persuasive LLMs Leads to More Truthful Answers.”
Laidlaw, Bronstein, Guo, et al. 2025. AssistanceZero: Scalably Solving Assistance Games.” In Workshop on Bidirectional Human↔︎AI Alignment.
Mazeika, Yin, Tamirisa, et al. 2025. Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs.”
Meulemans, Kobayashi, Oswald, et al. 2024. Multi-Agent Cooperation Through Learning-Aware Policy Gradients.” In.
Ngo, Chan, and Mindermann. 2024. The Alignment Problem from a Deep Learning Perspective.”
Russell. 2019. Human Compatible: Artificial Intelligence and the Problem of Control.
Stray, Vendrov, Nixon, et al. 2021. What Are You Optimizing for? Aligning Recommender Systems with Human Values.”
Taylor, Yudkowsky, LaVictoire, et al. 2020. Alignment for Advanced Machine Learning Systems.” In Ethics of Artificial Intelligence.
Zhuang, and Hadfield-Menell. 2021. Consequences of Misaligned AI.”
Zou, Wang, Kolter, et al. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models.”