AI Alignment Fast-Track Course

Scattered notes from the floor

2025-01-10 — 2025-01-13

Wherein failure modes and mitigation techniques are examined, RLHF and scalable oversight are surveyed, adversarial attacks and convergent instrumental goals are discussed, and Ajeya Cotra’s taxonomy is introduced.

adversarial
AI safety
economics
faster pussycat
innovation
language
machine learning
mind
neural nets
NLP
security
tail risk
technology

Notes on Bluedot’s AI Alignment Fast-Track course. I took this course, not so much to inform myself about AI safety, which I have been tracking for a long enough time to have a passing grasp with, but because taking courses like this is a legible signal of commitment to the topic to a certain type of institution.

1 Session 1

Terminology I should have already known but didn’t: Convergent Instrumental Goals.

  • Self-Preservation
  • Goal Preservation
  • Resource Acquisition
  • Self-Improvement

Ajeya Cotra’s intuitive taxonomy of different failure modes

  • Saints
  • Sycophants
  • Schemers.
Figure 1

2 Session 2

RLHF and Constitutional AI

3 Session 3

4 Misc things learned

5 References

Bai, Kadavath, Kundu, et al. 2022. Constitutional AI: Harmlessness from AI Feedback.”
Bao, and Ullah. 2009. Expectation of Quadratic Forms in Normal and Nonnormal Variables with Econometric Applications.” 200907. Working Papers.
Barez, Fu, Prabhu, et al. 2025. Open Problems in Machine Unlearning for AI Safety.”
Burns, Izmailov, Kirchner, et al. 2023. Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision.”
Christiano, Shlegeris, and Amodei. 2018. Supervising Strong Learners by Amplifying Weak Experts.”
Cloud, Goldman-Wetzler, Wybitul, et al. 2024. Gradient Routing: Masking Gradients to Localize Computation in Neural Networks.”
Everitt, Carey, Langlois, et al. 2021. Agent Incentives: A Causal Perspective.” In Proceedings of the AAAI Conference on Artificial Intelligence.
Everitt, Kumar, Krakovna, et al. 2019. Modeling AGI Safety Frameworks with Causal Influence Diagrams.”
Greenblatt, Shlegeris, Sachan, et al. 2024. AI Control: Improving Safety Despite Intentional Subversion.”
Hammond, Fox, Everitt, et al. 2023. Reasoning about Causality in Games.” Artificial Intelligence.
Hubinger, Jermyn, Treutlein, et al. 2023. Conditioning Predictive Models: Risks and Strategies.”
Irving, Christiano, and Amodei. 2018. AI Safety via Debate.”
Khan, Hughes, Valentine, et al. 2024. Debating with More Persuasive LLMs Leads to More Truthful Answers.”
Leech, Garfinkel, Yagudin, et al. 2024. Ten Hard Problems in Artificial Intelligence We Must Get Right.”
Richens, and Everitt. 2024. Robust Agents Learn Causal World Models.”
Wang, Variengien, Conmy, et al. 2022. Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small.”
Ward, MacDermott, Belardinelli, et al. 2024. The Reasons That Agents Act: Intention and Instrumental Goals.”
Zou, Wang, Kolter, et al. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models.”