Model interpretation and explanation

Colorizing black boxes

September 1, 2016 — December 22, 2023

adversarial
game theory
hierarchical models
machine learning
sparser than thou
Figure 1

The meeting point of differential privacy, accountability, interpretability, the tank detection story, clever horses in machine learning.

Closely related: am I explaining the model so I can see if it is fair?

There is much work; I understand little of it at the moment, but I keep needing to refer to papers, so this notebook exists.

1 Impossibility results

2 Integrated gradients

Ancona et al. (2017);Sundararajan, Taly, and Yan (2017) Captum implementation seems neat.

3 When do neurons mean something?

  • Toy Models of Superposition

    It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn’t always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don’t? Why do some models and tasks have many of these clean neurons, while they’re vanishingly rare in others?

4 Influence functions

If we think about models as interpolators of memorised training data, then the idea of looking at influence functions from individual training data becomes powerful.

Figure 2: From the Twitter summary of Grosse et al. (2023)

5 Shapley values

Shapley values are a fairness technique, which turns out to be applicable to explanation.

Not sure what else happens here, but see (Ghorbani and Zou 2019; Hama, Mase, and Owen 2022; Scott M. Lundberg et al. 2020; Scott M. Lundberg and Lee 2017) for application to both explanation of both data and features.

6 By ablation

Not a fan. But see ablation studies.

7 Incoming

8 References

Aggarwal, and Yu. 2008. A General Survey of Privacy-Preserving Data Mining Models and Algorithms.” In Privacy-Preserving Data Mining. Advances in Database Systems 34.
Alain, and Bengio. 2016. Understanding Intermediate Layers Using Linear Classifier Probes.” arXiv:1610.01644 [Cs, Stat].
Ancona, Ceolini, Öztireli, et al. 2017. Towards Better Understanding of Gradient-Based Attribution Methods for Deep Neural Networks.”
Barocas, and Selbst. 2016. Big Data’s Disparate Impact.” SSRN Scholarly Paper ID 2477899.
Black, Koepke, Kim, et al. 2023. Less Discriminatory Algorithms.” SSRN Scholarly Paper.
Bowyer, King, and Scheirer. 2020. The Criminality From Face Illusion.” arXiv:2006.03895 [Cs].
Burrell. 2016. How the Machine ’Thinks’: Understanding Opacity in Machine Learning Algorithms.” Big Data & Society.
Chipman, and Gu. 2005. Interpretable Dimension Reduction.” Journal of Applied Statistics.
Colbrook, Antun, and Hansen. 2022. The Difficulty of Computing Stable and Accurate Neural Networks: On the Barriers of Deep Learning and Smale’s 18th Problem.” Proceedings of the National Academy of Sciences.
Din, Karidi, Choshen, et al. 2023. Jump to Conclusions: Short-Cutting Transformers With Linear Transformations.”
Dwork, Hardt, Pitassi, et al. 2012. Fairness Through Awareness.” In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. ITCS ’12.
Feldman, Friedler, Moeller, et al. 2015. Certifying and Removing Disparate Impact.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’15.
Garcez, and Lamb. 2020. Neurosymbolic AI: The 3rd Wave.”
Ghorbani, and Zou. 2019. Data Shapley: Equitable Valuation of Data for Machine Learning.”
Grosse, Bae, Anil, et al. 2023. Studying Large Language Model Generalization with Influence Functions.”
Hama, Mase, and Owen. 2022. Model Free Shapley Values for High Dimensional Data.”
Hardt, Price, and Srebro. 2016. Equality of Opportunity in Supervised Learning.” In Advances in Neural Information Processing Systems.
Heaven. 2019. Why Deep-Learning AIs Are so Easy to Fool.” Nature.
Hidalgo, Orghian, Albo Canals, et al. 2021. How Humans Judge Machines.
Karimi, Muandet, Kornblith, et al. 2022. On the Relationship Between Explanation and Prediction: A Causal View.”
Kilbertus, Rojas Carulla, Parascandolo, et al. 2017. Avoiding Discrimination Through Causal Reasoning.” In Advances in Neural Information Processing Systems 30.
Kleinberg, Mullainathan, and Raghavan. 2016. Inherent Trade-Offs in the Fair Determination of Risk Scores.”
Lash, Lin, Street, et al. 2016. Generalized Inverse Classification.” arXiv:1610.01675 [Cs, Stat].
Lipton. 2016. The Mythos of Model Interpretability.” In arXiv:1606.03490 [Cs, Stat].
Lombrozo. 2006. The Structure and Function of Explanations.” Trends in Cognitive Sciences.
Lombrozo, and Liquin. 2023. Explanation Is Effective Because It Is Selective.” Current Directions in Psychological Science.
Lombrozo, and Vasilyeva. 2017. Causal Explanation.” In The Oxford Handbook of Causal Reasoning. Oxford Library of Psychology.
Lundberg, Scott M., Erion, Chen, et al. 2020. From Local Explanations to Global Understanding with Explainable AI for Trees.” Nature Machine Intelligence.
Lundberg, Scott M, and Lee. 2017. A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems.
Miconi. 2017. The Impossibility of ‘Fairness’: A Generalized Impossibility Result for Decisions.”
Moosavi-Dezfooli, Fawzi, Fawzi, et al. 2016. Universal Adversarial Perturbations.” In arXiv:1610.08401 [Cs, Stat].
Nanda, Chan, Lieberum, et al. 2023. Progress Measures for Grokking via Mechanistic Interpretability.”
Nguyen, Yosinski, and Clune. 2016. Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks.” arXiv Preprint arXiv:1602.03616.
Power, Burda, Edwards, et al. 2022. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.”
Ribeiro, Singh, and Guestrin. 2016. ‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier.” In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16.
Rudin. 2019. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence.
Sundararajan, Taly, and Yan. 2017. Axiomatic Attribution for Deep Networks.”
Sweeney. 2013. Discrimination in Online Ad Delivery.” Queue.
Wisdom, Powers, Pitton, et al. 2016. Interpretable Recurrent Neural Networks Using Sequential Sparse Recovery.” In Advances in Neural Information Processing Systems 29.
Wu, and Zhang. 2016. Automated Inference on Criminality Using Face Images.” arXiv:1611.04135 [Cs].
Zemel, Wu, Swersky, et al. 2013. Learning Fair Representations.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13).