Model interpretation and explanation
Colorising black boxes; mechanistic interpretability
2016-09-01 — 2025-08-04
Suspiciously similar content
The meeting point of differential privacy, accountability, interpretability, the tank detection story, clever horses in machine learning: Can I explain why my model made this prediction?
Closely related: am I explaining the model so I can see if it is fair?
There is much work; I understand little of it at the moment, but I keep needing to refer to papers, so this notebook exists.
1 Impossibility results
One trick of “explainable” models must perform is that they need to be simpler than the model they are explaining, or they would be just as incomprehensible. Is that a real thing to worry about? What is the actual trade-off? Can we sketch a Pareto frontier of interpretability and accuracy?
- Cassie Kozyrkov, Explainable AI won’t deliver. Here’s why.
- Wolters Kluwer, peeking into the black box a design perspective on comprehensible ai part 1
- Rudin (2019) argues that interpretable models can be worth the tradeoff relative to black boxes.
2 Lottery tickets
Are the hypothesized tiny lottery ticket networks useful for interpretation? (Frankle and Carbin 2019; Hayou et al. 2020; Schotthöfer et al. 2022)
3 Coincidences and computation
A computational no-coincidence principle (Christiano, Neyman, and Xu 2022) is another way of approaching interpretability. What if we think about good neural network performance, in the sense of it learning a representation more compact than the one implied by memorisation of the data, in the language of computational complexity? It is simple if it that principle can be “efficiently verified” in some sense from a compact proof. This feels attractive, intuitively, but holy blazes I suspect the search for such proofs will be punishing.
4 Influence functions
If we think about models as interpolators of memorised training data, then the idea of looking at influence functions from individual training data becomes powerful.
See Tracing Model Outputs to the Training Data (Grosse et al. 2023).
Integrated gradients seem to be in this family. See Ancona et al. (2017);Sundararajan, Taly, and Yan (2017). The Captum implementation seems neat.
5 Shapley values
Shapley values are a fairness technique, which turns out to be applicable to explanation. These are computationally intractable in general but there are some fashionable approximations in the form of SHAP values.
Not sure what else happens here, but see (Ghorbani and Zou 2019; Hama, Mase, and Owen 2022; Scott M. Lundberg et al. 2020; Scott M. Lundberg and Lee 2017) for application to both explanation of both data and features.
6 Linear explanations
The LIME lineage. A neat model that uses penalised regression to do local model explanations. (Ribeiro, Singh, and Guestrin 2016) See their blog post.
7 When do neurons mean something?
Sparse autoencoders etc. See mechinterp.
8 By ablation
Because of its ubiquity in ML literature this has been de facto an admissible form of explanation. I am not a fan of how this is typically done, which is to say in the absence of causal awareness. I bet we could do better though. See ablation studies.
9 Incoming
Saphra, Interpretability Creationism
[…] Stochastic Gradient Descent is not literally biological evolution, but post-hoc analysis in machine learning has a lot in common with scientific approaches in biology, and likewise often requires an understanding of the origin of model behaviour. Therefore, the following holds whether looking at parasitic brooding behaviour or at the inner representations of a neural network: if we do not consider how a system develops, it is difficult to distinguish a pleasing story from a useful analysis. In this piece, I will discuss the tendency towards “interpretability creationism” – interpretability methods that only look at the final state of the model and ignore its evolution over the course of training—and propose a focus on the training process to supplement interpretability research.
Good idea, or too ad hominem?
-
Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).
Christoph Molnar, Interpretable Machine Learning “A Guide for Making Black Box Models Explainable”
George Hosu, A Parable Of Explainability
Connection to Gödel: Mathematical paradoxes demonstrate the limits of AI (Colbrook, Antun, and Hansen 2022; Heaven 2019)
The deep dream “activation maximisation” images could sort of be classified as a type of model explanation, e.g. Multifaceted neuron visualization (Nguyen, Yosinski, and Clune 2016)
Belatedly I notice that the Data Skeptic podcast did a whole season on interpretability
How explainable artificial intelligence can help humans innovate
Are Model Explanations Useful in Practice? Rethinking How to Support Human-ML Interactions.
Existing XAI methods are not useful for decision-making. Presenting humans with popular, general-purpose XAI methods does not improve their performance on real-world use cases that motivated the development of these methods. Our negative findings align with those of contemporaneous works.
Neuronpedia is “an open platform for interpretability research. Explore, steer, and experiment on Al models.”