The meeting point of differential privacy, accountability, interpretability, the tank detection story, clever horses in machine learning.
Closely related: am I explaining the model so I can see if it is fair?
There is much work; I understand little of it at the moment, but I keep needing to refer to papers.
Impossibility
Integrated gradients
Ancona et al. (2017);Sundararajan, Taly, and Yan (2017) Captum implementation seems neat.
When do neurons mean something?
It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?
Incoming
Are Model Explanations Useful in Practice? Rethinking How to Support Human-ML Interactions.
Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).
It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?
George Hosu, A Parable Of Explainability
Connection to Gödel : Mathematical paradoxes demonstrate the limits of AI (Colbrook, Antun, and Hansen 2022; Heaven 2019)
Frequently I need the link to LIME, a neat model that uses penalised regression to do local model explanations. (Ribeiro, Singh, and Guestrin 2016) See their blog post.
A cousin of LIME but for tree classifiers is SHAP
The deep dream “activation maximisation” images could sort of be classified as a type of model explanation, e.g. Multifaceted neuron visualization (Nguyen, Yosinski, and Clune 2016)
Belatedly I notice that the Data Skeptic podcast did a whole season on interpretability
How explainable artificial intelligence can help humans innovate
explainability trade-offs
- Cassie Kozyrkov, Explainable AI won’t deliver. Here’s why.
- Wolters Kluwer, peeking into the black box a design perspective on comprehensible ai part 1
- Rudin (2019) argues the opposite.
No comments yet. Why not leave one?