Mechanistic interpretability
2024-08-29 — 2025-05-19
Suspiciously similar content
Understanding complicated AI models by “how they work”, in the sense that we attempt to reverse-engineer the machine that is the AI and see what it did to produce the output it did. See developmental interpretability, which looks at how neural networks evolve and develop capabilities during training.
1 Finding circuits
e.g. Wang et al. (2022)
2 Disentanglement and monosemanticity
Here’s a placeholder to talk about one hyped way of explaining models, especially large language models, using sparse autoencoders. This is popular as an AI Safety technology.
Interesting critique of the whole area: Heap et al. (2025) What’s even the null model of the sparse interpretation?
Sparse Crosscoders for Cross-Layer Features and Model Diffing
-
It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn’t always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don’t? Why do some models and tasks have many of these clean neurons, while they’re vanishingly rare in others?
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
God Help Us, Let’s Try To Understand The Paper On AI Monosemanticity
An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability | Adam Karvonen
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Excursions into Sparse Autoencoders: What is monosemanticity?
Intro to Superposition & Sparse Autoencoders (Colab exercises)
Lewingtonpitsos, LLM Sparse Autoencoder Embeddings can be used to train NLP Classifiers
Neel Nanda, An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
3 Via causal abstraction
See causal abstraction for a different (?) approach to interpretability and disentanglement.
4 Incoming
The Misguided Quest for Mechanistic AI Interpretability
With the disclaimer that I am not actually a mechinterp guy, this piece aligns broadly with my intuitions. I am generally skeptical about post hoc extracting the right stuff from a model trained arbitrarily. That said, I suspect that the tools mechinterp develops are still useful.
Multimodal Neurons in Artificial Neural Networks / Distill version
Tracing the thoughts of a large language model Anthropic
Today, we’re sharing two new papers that represent progress on the development of the “microscope”, and the application of it to see new “AI biology”. In the first paper, we extend our prior work locating interpretable concepts (“features”) inside a model to link those concepts together into computational “circuits”, revealing parts of the pathway that transforms the words that go into Claude into the words that come out. In the second, we look inside Claude 3.5 Haiku, performing deep studies of simple tasks representative of ten crucial model behaviours, including the three described above. Our method sheds light on a part of what happens when Claude responds to these prompts, which is enough to see solid evidence that: