Mechanistic interpretability

2024-08-29 — 2025-05-19

Wherein mechanistic methods are outlined, internal circuits in neural networks are traced and analyzed, sparse autoencoders and monosemantic features are surveyed, and circuit tracing of Claude models is reported

AI safety

communicating

feature construction

high d

language

machine learning

metrics

mind

NLP

sparser than thou

statmech

stochastic processes

Understanding complicated AI models by “how they work”, in the sense that we attempt to reverse-engineer the machine that is the AI and see what it did to produce the output it did. See developmental interpretability, which looks at how neural networks evolve and develop capabilities during training.

1 Finding circuits

e.g. Wang et al. (2022)

Zoom In: An Introduction to Circuits

2 Disentanglement and monosemanticity

Here’s a placeholder to talk about one hyped way of explaining models, especially large language models, using sparse autoencoders. This is popular as an AI Safety technology.

Interesting critique of the whole area: Heap et al. (2025) What’s even the null model of the sparse interpretation?
Sparse Crosscoders for Cross-Layer Features and Model Diffing
Toy Models of Superposition

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn’t always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don’t? Why do some models and tasks have many of these clean neurons, while they’re vanishingly rare in others?
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
God Help Us, Let’s Try To Understand The Paper On AI Monosemanticity
An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability | Adam Karvonen
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Excursions into Sparse Autoencoders: What is monosemanticity?
Intro to Superposition & Sparse Autoencoders (Colab exercises)
Lewingtonpitsos, LLM Sparse Autoencoder Embeddings can be used to train NLP Classifiers
Neel Nanda, An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

3 Via causal abstraction

See causal abstraction for a different (?) approach to interpretability and disentanglement.

4 Incoming

The Misguided Quest for Mechanistic AI Interpretability

With the disclaimer that I am not actually a mechinterp guy, this piece aligns broadly with my intuitions. I am generally skeptical about post hoc extracting the right stuff from a model trained arbitrarily. That said, I suspect that the tools mechinterp develops are still useful.
Multimodal Neurons in Artificial Neural Networks / Distill version
Tracing the thoughts of a large language model Anthropic

Today, we’re sharing two new papers that represent progress on the development of the “microscope”, and the application of it to see new “AI biology”. In the first paper, we extend our prior work locating interpretable concepts (“features”) inside a model to link those concepts together into computational “circuits”, revealing parts of the pathway that transforms the words that go into Claude into the words that come out. In the second, we look inside Claude 3.5 Haiku, performing deep studies of simple tasks representative of ten crucial model behaviours, including the three described above. Our method sheds light on a part of what happens when Claude responds to these prompts, which is enough to see solid evidence that:
- Circuit Tracing: Revealing Computational Graphs in Language Models
- On the Biology of a Large Language Model

5 References

Arditi, Obeso, Syed, et al. 2024. “Refusal in Language Models Is Mediated by a Single Direction.”

Cloud, Goldman-Wetzler, Wybitul, et al. 2024. “Gradient Routing: Masking Gradients to Localize Computation in Neural Networks.”

Cunningham, Ewart, Riggs, et al. 2023. “Sparse Autoencoders Find Highly Interpretable Features in Language Models.”

Gurnee, Nanda, Pauly, et al. 2023. “Finding Neurons in a Haystack: Case Studies with Sparse Probing.”

Heap, Lawson, Farnik, et al. 2025. “Sparse Autoencoders Can Interpret Randomly Initialized Transformers.”

Jørgensen, Gresele, and Weichwald. 2025. “What Is Causal about Causal Models and Representations?”

Kantamneni, and Tegmark. 2025. “Language Models Use Trigonometry to Do Addition.”

Marks, Rager, Michaud, et al. 2024. “Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.” In.

Moran, Sridhar, Wang, et al. 2022. “Identifiable Deep Generative Models via Sparse Decoding.”

O’Neill, Ye, Iyer, et al. 2024. “Disentangling Dense Embeddings with Sparse Autoencoders.”

Park, Choe, and Veitch. 2024. “The Linear Representation Hypothesis and the Geometry of Large Language Models.”

Ravfogel, Svete, Snæbjarnarson, et al. 2025. “Gumbel Counterfactual Generation From Language Models.”

Saengkyongam, Rosenfeld, Ravikumar, et al. 2024. “Identifying Representations for Intervention Extrapolation.”

Saphra, and Wiegreffe. 2024. “Mechanistic?”

Tigges, Hollinsworth, Geiger, et al. 2023. “Linear Representations of Sentiment in Large Language Models.”

von Kügelgen, Besserve, Wendong, et al. 2023. “Nonparametric Identifiability of Causal Representations from Unknown Interventions.” In Advances in Neural Information Processing Systems.

Wang, Variengien, Conmy, et al. 2022. “Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small.”