AI Alignment Fast-Track Course

Scattered notes from the floor

2025-01-10 — 2025-01-13

Wherein failure modes and mitigation techniques are examined, RLHF and scalable oversight are surveyed, adversarial attacks and convergent instrumental goals are discussed, and Ajeya Cotra’s taxonomy is introduced.

adversarial

AI safety

economics

faster pussycat

innovation

language

machine learning

mind

neural nets

NLP

security

tail risk

technology

Notes on Bluedot’s AI Alignment Fast-Track course. I took this course, not so much to inform myself about AI safety, which I have been tracking for a long enough time to have a passing grasp with, but because taking courses like this is a legible signal of commitment to the topic to a certain type of institution.

1 Session 1

What is AI alignment? – BlueDot Impact
More Is Different for AI
Paul Christiano, What failure looks like 👈 my favourite. Cannot believe I hadn’t read this.
AI Could Defeat All Of Us Combined
Why AI alignment could be hard with modern deep learning

Terminology I should have already known but didn’t: Convergent Instrumental Goals.

Self-Preservation
Goal Preservation
Resource Acquisition
Self-Improvement

Ajeya Cotra’s intuitive taxonomy of different failure modes

Saints
Sycophants
Schemers.

2 Session 2

RLHF and Constitutional AI

Problems with Reinforcement Learning from Human Feedback (RLHF) for AI safety – BlueDot Impact
A simple technical explanation of RLH(AI)F | Kairos.fm
[1hr Talk] Intro to Large Language Models
RLAIF vs. RLHF: the technology behind Anthropic’s Claude (Constitutional AI Explained)
The True Story of How GPT-2 Became Maximally Lewd
- OpenAI blog post: https://openai.com/research/fine-tuni…
- OpenAI paper behind the blog post: https://arxiv.org/pdf/1909.08593.pdf
- RLHF explainer on Hugging Face: https://huggingface.co/blog/rlhf
- RLHF explainer on aisafety.info https://aisafety.info/?state=88FN_904…

3 Session 3

4 Misc things learned

AI Safety Asia (AISA)

5 References

Bai, Kadavath, Kundu, et al. 2022. “Constitutional AI: Harmlessness from AI Feedback.”

Bao, and Ullah. 2009. “Expectation of Quadratic Forms in Normal and Nonnormal Variables with Econometric Applications.” 200907. Working Papers.

Barez, Fu, Prabhu, et al. 2025. “Open Problems in Machine Unlearning for AI Safety.”

Burns, Izmailov, Kirchner, et al. 2023. “Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision.”

Christiano, Shlegeris, and Amodei. 2018. “Supervising Strong Learners by Amplifying Weak Experts.”

Cloud, Goldman-Wetzler, Wybitul, et al. 2024. “Gradient Routing: Masking Gradients to Localize Computation in Neural Networks.”

Everitt, Carey, Langlois, et al. 2021. “Agent Incentives: A Causal Perspective.” In Proceedings of the AAAI Conference on Artificial Intelligence.

Everitt, Kumar, Krakovna, et al. 2019. “Modeling AGI Safety Frameworks with Causal Influence Diagrams.”

Greenblatt, Shlegeris, Sachan, et al. 2024. “AI Control: Improving Safety Despite Intentional Subversion.”

Hammond, Fox, Everitt, et al. 2023. “Reasoning about Causality in Games.” Artificial Intelligence.

Hubinger, Jermyn, Treutlein, et al. 2023. “Conditioning Predictive Models: Risks and Strategies.”

Irving, Christiano, and Amodei. 2018. “AI Safety via Debate.”

Khan, Hughes, Valentine, et al. 2024. “Debating with More Persuasive LLMs Leads to More Truthful Answers.”

Leech, Garfinkel, Yagudin, et al. 2024. “Ten Hard Problems in Artificial Intelligence We Must Get Right.”

Richens, and Everitt. 2024. “Robust Agents Learn Causal World Models.”

Wang, Variengien, Conmy, et al. 2022. “Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small.”

Ward, MacDermott, Belardinelli, et al. 2024. “The Reasons That Agents Act: Intention and Instrumental Goals.”

Zou, Wang, Kolter, et al. 2023. “Universal and Transferable Adversarial Attacks on Aligned Language Models.”