Mechanistic interpretability
August 29, 2024 — May 19, 2025
Understanding complicated AI models by “how they work”, in the sense that we attempt to reverse-engineer the machine that is the AI and see what it did to produce the output it did. See developmental interpretability, which looks at how neural networks evolve and develop capabilities during training.
1 Finding circuits
e.g. Wang et al. (2022)
2 Disentanglement and monosemanticity
Here’s a placeholder to talk about one hyped way of explaining models, especially large language models, using sparse autoencoders. This is popular as an AI Safety technology.
- Interesting critique of the whole area: Heap et al. (2025) What’s even the null model of the sparse interpretation?
- Toy Models of Superposition
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- God Help Us, Let’s Try To Understand The Paper On AI Monosemanticity
- An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability | Adam Karvonen
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- Excursions into Sparse Autoencoders: What is monosemanticity?
- Intro to Superposition & Sparse Autoencoders (Colab exercises)
- Lewingtonpitsos, LLM Sparse Autoencoder Embeddings can be used to train NLP Classifiers
- Neel Nanda, An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
3 Via causal abstraction
See causal abstraction for a different (?) approach to interpretability and disentanglement.
4 Incoming
The Misguided Quest for Mechanistic AI Interpretability
With the disclaimer that I am not actually a mechinterp guy, this piece aligns broadly with my intuitions. I am generally skeptical about post hoc extracting the right stuff from a model trained arbitrarily. That said, I suspect that the tools mechinterp develops are still useful.
Multimodal Neurons in Artificial Neural Networks / Distill version
Tracing the thoughts of a large language model Anthropic
Today, we’re sharing two new papers that represent progress on the development of the “microscope”, and the application of it to see new “AI biology”. In the first paper, we extend our prior work locating interpretable concepts (“features”) inside a model to link those concepts together into computational “circuits”, revealing parts of the pathway that transforms the words that go into Claude into the words that come out. In the second, we look inside Claude 3.5 Haiku, performing deep studies of simple tasks representative of ten crucial model behaviors, including the three described above. Our method sheds light on a part of what happens when Claude responds to these prompts, which is enough to see solid evidence that: