Model interpretation and explanation

Colorizing black boxes

The meeting point of differential privacy, accountability, interpretability, the tank detection story, clever horses in machine learning.

Closely related: am I explaining the model so I can see if it is fair?

There is much work; I understand little of it at the moment, but I keep needing to refer to papers.


Integrated gradients

Ancona et al. (2017);Sundararajan, Taly, and Yan (2017) Captum implementation seems neat.

When do neurons mean something?

  • Toy Models of Superposition

    It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn’t always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don’t? Why do some models and tasks have many of these clean neurons, while they’re vanishingly rare in others?



Aggarwal, Charu C., and Philip S. Yu. 2008. A General Survey of Privacy-Preserving Data Mining Models and Algorithms.” In Privacy-Preserving Data Mining, edited by Charu C. Aggarwal and Philip S. Yu, 11–52. Advances in Database Systems 34. Springer US.
Alain, Guillaume, and Yoshua Bengio. 2016. Understanding Intermediate Layers Using Linear Classifier Probes.” arXiv:1610.01644 [Cs, Stat], October.
Ancona, Marco, Enea Ceolini, Cengiz Öztireli, and Markus Gross. 2017. Towards Better Understanding of Gradient-Based Attribution Methods for Deep Neural Networks,” November.
Barocas, Solon, and Andrew D. Selbst. 2016. Big Data’s Disparate Impact.” SSRN Scholarly Paper ID 2477899. Rochester, NY: Social Science Research Network.
Bowyer, Kevin W., Michael King, and Walter Scheirer. 2020. The Criminality From Face Illusion.” arXiv:2006.03895 [Cs], June.
Burrell, Jenna. 2016. How the Machine ’Thinks’: Understanding Opacity in Machine Learning Algorithms.” Big Data & Society 3 (1): 2053951715622512.
Chipman, Hugh A., and Hong Gu. 2005. Interpretable Dimension Reduction.” Journal of Applied Statistics 32 (9): 969–87.
Colbrook, Matthew J., Vegard Antun, and Anders C. Hansen. 2022. The Difficulty of Computing Stable and Accurate Neural Networks: On the Barriers of Deep Learning and Smale’s 18th Problem.” Proceedings of the National Academy of Sciences 119 (12): e2107151119.
Din, Alexander Yom, Taelin Karidi, Leshem Choshen, and Mor Geva. 2023. Jump to Conclusions: Short-Cutting Transformers With Linear Transformations.”
Dwork, Cynthia, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness Through Awareness.” In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, 214–26. ITCS ’12. New York, NY, USA: ACM.
Feldman, Michael, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and Removing Disparate Impact.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 259–68. KDD ’15. New York, NY, USA: ACM.
Garcez, Artur d’Avila, and Luis C. Lamb. 2020. Neurosymbolic AI: The 3rd Wave.” arXiv.
Hardt, Moritz, Eric Price, and Nati Srebro. 2016. Equality of Opportunity in Supervised Learning.” In Advances in Neural Information Processing Systems, 3315–23.
Heaven, Douglas. 2019. Why Deep-Learning AIs Are so Easy to Fool.” Nature 574 (7777): 163–66.
Hidalgo, César A., Diana Orghian, Jordi Albo Canals, Filipa de Almeida, and Natalia Martín Cantero. 2021. How Humans Judge Machines. Cambridge, Massachusetts: The MIT Press.
Karimi, Amir-Hossein, Krikamol Muandet, Simon Kornblith, Bernhard Schölkopf, and Been Kim. 2022. On the Relationship Between Explanation and Prediction: A Causal View.” arXiv.
Kilbertus, Niki, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, and Bernhard Schölkopf. 2017. Avoiding Discrimination Through Causal Reasoning.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 656–66. Curran Associates, Inc.
Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan. 2016. Inherent Trade-Offs in the Fair Determination of Risk Scores,” September.
Lash, Michael T., Qihang Lin, W. Nick Street, Jennifer G. Robinson, and Jeffrey Ohlmann. 2016. Generalized Inverse Classification.” arXiv:1610.01675 [Cs, Stat], October.
Lipton, Zachary C. 2016. The Mythos of Model Interpretability.” In arXiv:1606.03490 [Cs, Stat].
Lombrozo, Tania. 2006. The Structure and Function of Explanations.” Trends in Cognitive Sciences 10 (10): 464–70.
Lombrozo, Tania, and Emily G. Liquin. 2023. Explanation Is Effective Because It Is Selective.” Current Directions in Psychological Science, March, 09637214231156106.
Lombrozo, Tania, and Nadya Vasilyeva. 2017. Causal Explanation.” In The Oxford Handbook of Causal Reasoning, 415–32. Oxford Library of Psychology. New York, NY, US: Oxford University Press.
Lundberg, Scott M, and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc.
Miconi, Thomas. 2017. The Impossibility of ‘Fairness’: A Generalized Impossibility Result for Decisions,” July.
Moosavi-Dezfooli, Seyed-Mohsen, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. 2016. Universal Adversarial Perturbations.” In arXiv:1610.08401 [Cs, Stat].
Nanda, Neel, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. Progress Measures for Grokking via Mechanistic Interpretability.” arXiv.
Nguyen, Anh, Jason Yosinski, and Jeff Clune. 2016. Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks.” arXiv Preprint arXiv:1602.03616.
Power, Alethea, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. 2022. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.” arXiv.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. ‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier.” In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–44. KDD ’16. New York, NY, USA: ACM.
Rudin, Cynthia. 2019. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1 (5): 206–15.
Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. 2017. Axiomatic Attribution for Deep Networks,” March.
Sweeney, Latanya. 2013. Discrimination in Online Ad Delivery.” Queue 11 (3): 10:10–29.
Wisdom, Scott, Thomas Powers, James Pitton, and Les Atlas. 2016. Interpretable Recurrent Neural Networks Using Sequential Sparse Recovery.” In Advances in Neural Information Processing Systems 29.
Wu, Xiaolin, and Xi Zhang. 2016. Automated Inference on Criminality Using Face Images.” arXiv:1611.04135 [Cs], November.
Zemel, Rich, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning Fair Representations.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 325–33.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.