AI Alignment Fast-Track Course
Scattered notes from the floor
2025-01-10 — 2025-01-13
Wherein failure modes and mitigation techniques are examined, RLHF and scalable oversight are surveyed, adversarial attacks and convergent instrumental goals are discussed, and Ajeya Cotra’s taxonomy is introduced.
Notes on AI Alignment Fast-Track - Losing control to AI
1 Session 1
- What is AI alignment? – BlueDot Impact
- More Is Different for AI
- Paul Christiano, What failure looks like 👈 my favourite. Cannot believe I hadn’t read this.
- AI Could Defeat All Of Us Combined
- Why AI alignment could be hard with modern deep learning
Terminology I should have already known but didn’t: Convergent Instrumental Goals.
- Self-Preservation
- Goal Preservation
- Resource Acquisition
- Self-Improvement
Ajeya Cotra’s intuitive taxonomy of different failure modes
- Saints
- Sycophants
- Schemers.
2 Session 2
RLHF and Constitutional AI
Problems with Reinforcement Learning from Human Feedback (RLHF) for AI safety – BlueDot Impact
RLAIF vs. RLHF: the technology behind Anthropic’s Claude (Constitutional AI Explained)
The True Story of How GPT-2 Became Maximally Lewd
- OpenAI blog post: https://openai.com/research/fine-tuni…
- OpenAI paper behind the blog post: https://arxiv.org/pdf/1909.08593.pdf
- RLHF explainer on Hugging Face: https://huggingface.co/blog/rlhf
- RLHF explainer on aisafety.info https://aisafety.info/?state=88FN_904…
3 Session 3
- Can we scale human feedback for complex AI tasks? An intro to scalable oversight. – BlueDot Impact
- Robert Miles on Using Dangerous AI, But Safely?
- [2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models
- [1810.08575] Supervising strong learners by amplifying weak experts
- Factored Cognition | Ought
- [2402.06782] Debating with More Persuasive LLMs Leads to More Truthful Answers
- Adversarial Machine Learning explained! | With examples.
- AI Control: Improving Safety Despite Intentional Subversion — AI Alignment Forum
- [2312.06942] AI Control: Improving Safety Despite Intentional Subversion
- [2312.09390] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision