⚠️ TODO: I play fast and loose with language about subgroups and interaction terms here. We can define each in terms of the other often, but they are not quite the same thing. Maybe this would benefit from me making that clearer.

Estimating interaction effects is hard, but also it is probably the important thing to do in any complex and/or human system. So how do we optimally trade-off answering the most specific questions with te rapidly growing expense and difficulty of experiments large enough to detect them? Also the rapidly growing number of possible interactions as problems grow.

Connection with problematic methodology, when the need for specificity manifests through researcher degrees of freedom, i.e. choosing which interactions to model *post hoc*.

That is, the world is probably built of hierarchical models but we do not always have the right data to identify them, or enough of it when we do.

Lots of ill-connected notes ATM.

## Review of limits of heterogeneous treatment effects literature

Data requirements, false discovery. If we want to learn interaction effects from observational studies then we need heroic amounts of data, to eliminate confounders and estimate the explosion of possible terms. Does this mean that by attempting to operate this way we are implicitly demanding a surveillance state?

## Subgroup identification

Classic experimental practice tries to estimate an effect, then either

- faces a thicket of onerous multiple testing challenges to do model selection to work out who it applies to, or
- applies for new funding to identify relevant subgroups with new data in a new experiment.

Can we estimate subgroups and effects simultaneously? How bad is our degrees-of-freedom situation in this case? Not clear, and I could not see an easy answer skimming the references (Foster, Taylor, and Ruberg 2011; Imai and Ratkovic 2013; Lipkovich, Dmitrienko, and B 2017; Su et al. 2009).

## Conditional average treatment effect

Working out how to condition on stuff is the bread and butter of causal inference, and there are a bunch of ways to analyse it there.

## As transferability

If we know what interacts our model has then we are closer to learning the correct conditioning set. See external validity.

## Ontological context

- Science in a High-Dimensional World
- The “It’s really complicated and sad” theory of obesity.
- interactions are probably always present; they just might be small — see Gwern’s Everything Is Correlated for a roundup on this theme.

## Scientific context

Over at social psychology, I’ve wondered about Peter Dorman’s comment:

the fixation on finding average effects when the structure of effect differences is what we ought to be interested in.

See Slime Mold Time Mold, Reality is Very Weird and You Need to be Prepared for That

But as we see from the history of scurvy, sometimes splitting is the right answer! In fact, there were meaningful differences in different kinds of citrus, and meaningful differences in different animals. Making a splitting argument to save a theory — “maybe our supplier switched to a different kind of citrus, we should check that out” — is a reasonable thing to do, especially if the theory was relatively successful up to that point.

Splitting is perfectly fair game, at least to an extent — doing it a few times is just prudent, though if you have gone down a dozen rabbitholes with no luck, then maybe it is time to start digging elsewhere.

Much commentary from Andrew Gelman et al on this theme. e.g. You need 16 times the sample size to estimate an interaction than to estimate a main effect (Gelman, Hill, and Vehtari 2021 ch 16.4).

C&C Epstein Barr and the Cause of Cause

Miller (2013) writes about basic data hygiene in this light for data journalists etc.

## Incoming

Kernel tricks for detecting 2 way interactions: Agrawal et al. (2019);Agrawal and Broderick (2021) See Tamara Broderick present this.

The Big Data Paradox in Clinical Practice (Msaouel 2022)

The big data paradox is a real-world phenomenon whereby as the number of patients enrolled in a study increases, the probability that the confidence intervals from that study will include the truth decreases. This occurs in both observational and experimental studies, including randomized clinical trials, and should always be considered when clinicians are interpreting research data. Furthermore, as data quantity continues to increase in today’s era of big data, the paradox is becoming more pernicious. Herein, I consider three mechanisms that underlie this paradox, as well as three potential strategies to mitigate it: (1) improving data quality; (2) anticipating and modeling patient heterogeneity; (3) including the systematic error, not just the variance, in the estimation of error intervals.

## References

*arXiv:2106.12408 [Stat]*, October.

*Proceedings of the 36th International Conference on Machine Learning*, 141–50. PMLR.

*Annals of Statistics*47 (2): 1148–78.

*Philosophy of Science*83 (1): 60–81.

*arXiv:2011.07051 [Econ, Stat]*, November.

*Statistics in Medicine*30 (24): 10.1002/sim.4322.

*Regression and other stories*. Cambridge, UK: Cambridge University Press.

*Behavioral and Brain Sciences*45.

*The Annals of Applied Statistics*7 (1): 443–70.

*The Econometrics Journal*24 (1): 134–61.

*Statistics in Medicine*36 (1): 136–96.

*Mathematical Models of Social Evolution: A Guide for the Perplexed*. University Of Chicago Press.

*The Chicago Guide to Writing about Multivariate Analysis*. Second edition. Chicago Guides to Writing, Editing, and Publishing. Chicago: University of Chicago Press.

*Cancer Investigation*40 (7): 1–10.

*Social Epistemology*33 (1): 23–41.

*Journal of Machine Learning Research*. Vol. 10. Rochester, NY.

## Social context

## Is this what intersectionality means?

A real question. If we are concerned with the inequality, then there is an implied graphical model which produces as outputs different outcomes based on who is being modeled, and these will have implications with regard to fairness.

It turns out people have engaged meaningfully in this. Bright, Malinsky, and Thompson (2016) suggests some testable models:

## The advice you found is probably not for you

Every pundit has a model for what the typical member of the public thinks, and directs their advice accordingly. For many reasons, the pundit’s model is likely to be wrong. The readers of various pundits are a self-selecting sample, and the pundit’s intuitive model of society is distorted and even if they surveyed their readership, it is hard to use that to know anything truly about the readership.

So all advice like “People should do more X” is suspect, because the advice is based on the author’s assumption that the readers are in class A but they in fact could easily be in class B, who maybe should do

lessX, possibly because X does not work for class B people in general, or because class B people are generally likely to have done too much X and maybe need to lay off the X for a while. See adverse advice selection.