Model interpretation and explanation

Colorising black boxes; mechanistic interpretability

2016-09-01 — 2025-08-04

Wherein the limits and methods of model explanation are surveyed, and influence from individual training examples is traced via influence functions while SHAP approximations are noted as computationally costly

adversarial

AI safety

game theory

hierarchical models

machine learning

sparser than thou

The meeting point of differential privacy, accountability and interpretability, the tank detection story, clever horses in machine learning: Can I explain why my model made this prediction?

Closely related: am I explaining the model so I can see whether it’s fair?

There’s a lot of work; I understand little of it at the moment, but I keep needing to refer to papers, so this notebook exists.

1 Impossibility results

One requirement for “explainable” models is that they must be simpler than the model they explain; otherwise they’d be just as incomprehensible. Is that a real thing to worry about? What is the actual trade-off? Can we sketch a Pareto frontier of interpretability and accuracy?

Cassie Kozyrkov, Explainable AI won’t deliver. Here’s why.
Wolters Kluwer, peeking into the black box a design perspective on comprehensible ai part 1
Rudin (2019) argues that interpretable models can be worth the tradeoff relative to black boxes.

2 Lottery tickets

Are the hypothesized tiny lottery ticket networks useful for interpretation? (Frankle and Carbin 2019; Hayou et al. 2020; Schotthöfer et al. 2022)

3 Coincidences and computation

See computation and coincidence.

4 Influence functions

If we think of models as interpolators of memorized training data, then looking at influence functions for individual training points becomes powerful.

See Tracing Model Outputs to the Training Data (Grosse et al. 2023).

Figure 2: From the Twitter summary of Grosse et al. (2023)

Integrated gradients seem to belong to this family. See Ancona et al. (2017);Sundararajan, Taly, and Yan (2017). The Captum implementation seems neat.

5 Shapley values

Shapley values come from fairness and turn out to be useful for explanation. They’re computationally intractable in general, but there are fashionable approximations in the form of SHAP values.

I’m not sure what else belongs here, but see (Ghorbani and Zou 2019; Hama, Mase, and Owen 2022; Scott M. Lundberg et al. 2020; Scott M. Lundberg and Lee 2017) for applications to explanations of both data and features.

6 Linear explanations

The LIME lineage. It’s a neat model that uses penalised regression to produce local explanations. (Ribeiro, Singh, and Guestrin 2016) See their blog post.

7 When do neurons mean something?

Sparse autoencoders, etc. See mechinterp.

8 By ablation

Because of its ubiquity in ML literature, this has become de facto an admissible form of explanation. I’m not a fan of how this is typically done — it often ignores causal awareness. I bet we could do better, though. See ablation studies.

9 Incoming

Saphra, Interpretability Creationism

[…] Stochastic Gradient Descent is not literally biological evolution, but post-hoc analysis in machine learning has a lot in common with scientific approaches in biology, and likewise often requires an understanding of the origin of model behaviour. Therefore, the following holds whether looking at parasitic brooding behaviour or at the inner representations of a neural network: if we don’t consider how a system develops, it is difficult to distinguish a pleasing story from a useful analysis. In this piece, I will discuss the tendency towards “interpretability creationism” – interpretability methods that only look at the final state of the model and ignore its evolution over the course of training—and propose a focus on the training process to supplement interpretability research.

Good idea, or too ad hominem?
“Understanding RL Vision”

Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).
Christoph Molnar, Interpretable Machine Learning “A Guide for Making Black Box Models Explainable”
Captum · Model Interpretability for PyTorch
George Hosu, A Parable Of Explainability
Connection to Gödel: Mathematical paradoxes demonstrate the limits of AI (Colbrook, Antun, and Hansen 2022; Heaven 2019)
The deep‑dream “activation maximisation” images could be classified as a type of model explanation, e.g. Multifaceted neuron visualization (Nguyen, Yosinski, and Clune 2016)
I belatedly noticed that the Data Skeptic podcast did a whole season on interpretability
How explainable artificial intelligence can help humans innovate
Are Model Explanations Useful in Practice? Rethinking How to Support Human-ML Interactions.

Existing XAI methods are not useful for decision-making. Presenting humans with popular, general-purpose XAI methods does not improve their performance on real-world use cases that motivated the development of these methods. Our negative findings align with those of contemporaneous works.
Neuronpedia is “an open platform for interpretability research. Explore, steer, and experiment with AI models.”

10 References

Aggarwal, and Yu. 2008. “A General Survey of Privacy-Preserving Data Mining Models and Algorithms.” In Privacy-Preserving Data Mining. Advances in Database Systems 34.

Alain, and Bengio. 2016. “Understanding Intermediate Layers Using Linear Classifier Probes.” arXiv:1610.01644 [Cs, Stat].

Ancona, Ceolini, Öztireli, et al. 2017. “Towards Better Understanding of Gradient-Based Attribution Methods for Deep Neural Networks.”

Barocas, and Selbst. 2016. “Big Data’s Disparate Impact.” SSRN Scholarly Paper ID 2477899.

Black, Koepke, Kim, et al. 2023. “Less Discriminatory Algorithms.” SSRN Scholarly Paper.

Blazek, and Lin. 2021. “Explainable Neural Networks That Simulate Reasoning.” Nature Computational Science.

Blazek, Venkatesh, and Lin. 2021. “Deep Distilling: Automated Code Generation Using Explainable Deep Learning.” arXiv:2111.08275 [Cs].

Bowyer, King, and Scheirer. 2020. “The Criminality From Face Illusion.” arXiv:2006.03895 [Cs].

Burrell. 2016. “How the Machine ’Thinks’: Understanding Opacity in Machine Learning Algorithms.” Big Data & Society.

Chipman, and Gu. 2005. “Interpretable Dimension Reduction.” Journal of Applied Statistics.

Christiano, Neyman, and Xu. 2022. “Formalizing the Presumption of Independence.”

Colbrook, Antun, and Hansen. 2022. “The Difficulty of Computing Stable and Accurate Neural Networks: On the Barriers of Deep Learning and Smale’s 18th Problem.” Proceedings of the National Academy of Sciences.

Din, Karidi, Choshen, et al. 2023. “Jump to Conclusions: Short-Cutting Transformers With Linear Transformations.”

Dwork, Hardt, Pitassi, et al. 2012. “Fairness Through Awareness.” In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. ITCS ’12.

Feldman, Friedler, Moeller, et al. 2015. “Certifying and Removing Disparate Impact.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’15.

Fisher, Rudin, and Dominici. 2019. “All Models Are Wrong, but Many Are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously.”

Frankle, and Carbin. 2019. “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.” arXiv:1803.03635 [Cs].

Garcez, and Lamb. 2020. “Neurosymbolic AI: The 3rd Wave.”

Geiger, Ibeling, Zur, et al. 2024. “Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability.”

Ghorbani, and Zou. 2019. “Data Shapley: Equitable Valuation of Data for Machine Learning.”

Grosse, Bae, Anil, et al. 2023. “Studying Large Language Model Generalization with Influence Functions.”

Hama, Mase, and Owen. 2022. “Model Free Shapley Values for High Dimensional Data.”

Hardt, Price, and Srebro. 2016. “Equality of Opportunity in Supervised Learning.” In Advances in Neural Information Processing Systems.

Hayou, Ton, Doucet, et al. 2020. “Pruning Untrained Neural Networks: Principles and Analysis.” arXiv:2002.08797 [Cs, Stat].

Heaven. 2019. “Why Deep-Learning AIs Are so Easy to Fool.” Nature.

Hidalgo, Orghian, Albo Canals, et al. 2021. How Humans Judge Machines.

Karimi, Muandet, Kornblith, et al. 2022. “On the Relationship Between Explanation and Prediction: A Causal View.”

Kilbertus, Rojas Carulla, Parascandolo, et al. 2017. “Avoiding Discrimination Through Causal Reasoning.” In Advances in Neural Information Processing Systems 30.

Kleinberg, Mullainathan, and Raghavan. 2016. “Inherent Trade-Offs in the Fair Determination of Risk Scores.”

Lash, Lin, Street, et al. 2016. “Generalized Inverse Classification.” arXiv:1610.01675 [Cs, Stat].

Lipton. 2016. “The Mythos of Model Interpretability.” In arXiv:1606.03490 [Cs, Stat].

Lombrozo. 2006. “The Structure and Function of Explanations.” Trends in Cognitive Sciences.

Lombrozo, and Liquin. 2023. “Explanation Is Effective Because It Is Selective.” Current Directions in Psychological Science.

Lombrozo, and Vasilyeva. 2017. “Causal Explanation.” In The Oxford Handbook of Causal Reasoning. Oxford Library of Psychology.

Lundberg, Scott M., Erion, Chen, et al. 2020. “From Local Explanations to Global Understanding with Explainable AI for Trees.” Nature Machine Intelligence.

Lundberg, Scott M, and Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems.

Miconi. 2017. “The Impossibility of ‘Fairness’: A Generalized Impossibility Result for Decisions.”

Moosavi-Dezfooli, Fawzi, Fawzi, et al. 2016. “Universal Adversarial Perturbations.” In arXiv:1610.08401 [Cs, Stat].

Nanda, Chan, Lieberum, et al. 2023. “Progress Measures for Grokking via Mechanistic Interpretability.”

Nguyen, Yosinski, and Clune. 2016. “Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks.” arXiv Preprint arXiv:1602.03616.

Power, Burda, Edwards, et al. 2022. “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.”

Ribeiro, Singh, and Guestrin. 2016. “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier.” In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16.

Rudin. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence.

Schotthöfer, Zangrando, Kusch, et al. 2022. “Low-Rank Lottery Tickets: Finding Efficient Low-Rank Neural Networks via Matrix Differential Equations.”

Sundararajan, Taly, and Yan. 2017. “Axiomatic Attribution for Deep Networks.”

Sweeney. 2013. “Discrimination in Online Ad Delivery.” Queue.

Wisdom, Powers, Pitton, et al. 2016. “Interpretable Recurrent Neural Networks Using Sequential Sparse Recovery.” In Advances in Neural Information Processing Systems 29.

Wu, and Zhang. 2016. “Automated Inference on Criminality Using Face Images.” arXiv:1611.04135 [Cs].

Zemel, Wu, Swersky, et al. 2013. “Learning Fair Representations.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13).