Draft

Explainability of human ethical algorithms

October 10, 2023 — October 10, 2023

bounded compute
collective knowledge
cooperation
culture
ethics
gene
incentive mechanisms
institutions
machine learning
neuron
sparser than thou
utility
wonk
Figure 1

Can we explain the ethical algorithms that humans run, in the same way that we attempt to explain black-box ML algorithms?

I’m thinking of things like influence functions, that determine what examples we reason from, as in (Grosse et al. 2023), but also reasoning from features would be interesting.

Figure 2: From the Twitter summary of Grosse et al. (2023)

1 Disgust

Interesting start point. See disgust.

2 Theory of moral sentiments

e.g. Haidt (2013).

3 References

DellaPosta, Shi, and Macy. 2015. Why Do Liberals Drink Lattes? American Journal of Sociology.
Feinberg, Antonenko, Willer, et al. 2014. Gut Check: Reappraisal of Disgust Helps Explain Liberal–Conservative Differences on Issues of Purity. Emotion.
Grosse, Bae, Anil, et al. 2023. Studying Large Language Model Generalization with Influence Functions.”
Haidt. 2013. The Righteous Mind: Why Good People Are Divided by Politics and Religion.
Inbar, Pizarro, and Bloom. 2009. Conservatives Are More Easily Disgusted Than Liberals.” Cognition and Emotion.
Inbar, Pizarro, Iyer, et al. 2012. Disgust Sensitivity, Political Conservatism, and Voting.” Social Psychological and Personality Science.
Moral Sentiments and Material Interests: The Foundations of Cooperation in Economic Life. 2006.
Storr. 2021. The Status Game: On Human Life and How to Play It.