External validity and dataset shift

Also transfer learning, learning under covariate shift, transferable learning, domain adaptation etc


This Maori gentleman (name unspecified) from the 1800s demonstrates an artful transfer learning from the western fashion domain

One could read Sebastian Ruder’s NN-style introduction to “transfer learning”. NN people like to think about this in particular way which I like because of the diversity of out-of-the-box ideas it invites and which I dislike because it is sloppy.

For me it seems natural to consider learning well-factored causal graphical models containing the necessary interaction effects as the platonic ideal here and everything else is just an approximation to that.

Although maybe I should should be thinging about feedback effects also — if everyone uses my algorithms, does this change the environment in which my algorithm is operating? For e.g. traffic routing algorithms the answer is clearly yes.

The reason this is hot topic in neural nets, I suspect, is that it is convenient for massive, low-human-effort neural networks to ignore graphical structure to get predictively good results from regressions in observational data by ignoring that structure, and this leads us into strife when the situation cahnges. To recover the causal consistency in a black-box model is even more tedious than a classical one. Also, it fits the social conventions of neural network research to reinvent methods to fix such problems without reference to previous conventions, for better and worse.

I am often confused by how surprised we are about the difficulties of transferring models between domains and how continual is the flow of new publications on this theme; e.g. Google AI Blog: How Underspecification Presents Challenges for Machine Learning.

One thing that the machine learning set up gives us which is an additional emphasis: external validity, the statistical framing, would ask you whether the model you have learnt is still useful on new data. The transfer learning set up invites use to consider if we can transfer some of the computational effort from learning on one data set to learning on new dataset, and if so, how much. Maybe that is a useful insight?

This connects also to semi-supervised learning and fairness, argues (Schölkopf, Bernhard et al. 2012; Schölkopf 2019).

Once again a different angle but possibly the same underlying idea, we could argue that interaction effects are probably what we want to learn.

Standard graphical models

We can just try some basic graphical model technology and see how far we get. If the right independences are enforced, presumably we are doing something not too far from learning a transferable model? Or, if we work out that the necessary parameters are not identifiable, then we discover that we cannot in fact learn a transferable model, right? (But maybe we can learn a somewhat transferable model?) I guess the key weakness is that graphical models will miss some types of transferability, specifically, independences that are dependent on particualr values of the nodes, so this might be less powerful.

External validity in policy

See also anthropic principles, science for policy.

I have lots of ideas about policy for the world and I think that some of the ideas are good because of some mix of scientific research and personal experience.1 So let us suppose that I am broadly sympathetic to some policy instrument (state ownership of power utilities? diversity quotas in hiring? etc) because I have seen them work in the past. The question is, how universally should I be in favour of that policy? How do I find out what are the circumstances that make these policy instruments achieve my desired outcomes? Here is one that arose in my workplace recently: Presumably a diversity quota requiring a certain percentage of the workforce be, say, women, would be pointless in a society with perfect gender equality, and ineffectual in a society which has failed to train any women at all with the required skills. Most societies will not be at either of those extremes, but what is the range of gender inequity where the hiring quotas would be a useful policy intervention? What other predictors will change their effectiveness? This policy is not a good idea in and of itself but rather in a particular context. Burying that essential context is common in debates observationally.

Rather than universal policy prescriptions, it is worth wondering what specificity policies have and constantly checking if they apply here.



salad is a library to easily setup experiments using the current state-of-the art techniques in domain adaptation. It features several of recent approaches, with the goal of being able to run fair comparisons between algorithms and transfer them to real-world use cases.


WILDS: A Benchmark of in-the-Wild Distribution Shifts

To facilitate the development of ML models that are robust to real-world distribution shifts, our ICML 2021 paper presents WILDS, a curated benchmark of 10 datasets that reflect natural distribution shifts arising from different cameras, hospitals, molecular scaffolds, experiments, demographics, countries, time periods, users, and codebases.


“Meta” (a.k.a. transfer?) learning in pytorch. I’m not actually sure. TBC.


Arjovsky, Martin, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2020. Invariant Risk Minimization.” arXiv:1907.02893 [Cs, Stat], March.
Bareinboim, Elias. 2014. Generalizability in Causal Inference: Theory and Algorithms.”
Bareinboim, Elias, and Judea Pearl. 2016. Causal Inference and the Data-Fusion Problem.” Proceedings of the National Academy of Sciences 113 (27): 7345–52.
Bongers, Stephan, Patrick Forré, Jonas Peters, Bernhard Schölkopf, and Joris M. Mooij. 2020. Foundations of Structural Causal Models with Cycles and Latent Variables.” arXiv:1611.06221 [Cs, Stat], October.
Bühlmann, Peter. 2020. Invariance, Causality and Robustness.” Statistical Science 35 (3): 404–26.
Christiansen, Rune, Niklas Pfister, Martin Emil Jakobsen, Nicola Gnecco, and Jonas Peters. 2020. A Causal Framework for Distribution Generalization,” June.
D’Amour, Alexander, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, et al. 2020. Underspecification Presents Challenges for Credibility in Modern Machine Learning.” arXiv:2011.03395 [Cs, Stat], November.
Deaton, Angus, and Nancy Cartwright. 2016. Understanding and Misunderstanding Randomized Controlled Trials.” Working Paper 22595. National Bureau of Economic Research.
Fernández-Loría, Carlos, and Foster Provost. 2021. Causal Decision Making and Causal Effect Estimation Are Not the Same… and Why It Matters.” arXiv:2104.04103 [Cs, Stat], September.
Gigerenzer, Gerd. n.d. We Need to Think More about How We Conduct Research.” Behavioral and Brain Sciences 45.
Hoffimann, Júlio, Maciel Zortea, Breno de Carvalho, and Bianca Zadrozny. 2021. Geostatistical Learning: Challenges and Opportunities.” Frontiers in Applied Mathematics and Statistics 7.
Kilbertus, Niki, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, and Bernhard Schölkopf. 2017. Avoiding Discrimination Through Causal Reasoning.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 656–66. Curran Associates, Inc.
Koh, Pang Wei, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, et al. 2021. WILDS: A Benchmark of in-the-Wild Distribution Shifts.” arXiv:2012.07421 [Cs], July.
Kulinski, Sean, and David I. Inouye. 2022. Towards Explaining Distribution Shifts.” arXiv.
Künzel, Sören R., Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. 2019. Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.” Proceedings of the National Academy of Sciences 116 (10): 4156–65.
Meinshausen, Nicolai. 2018. Causality from a Distributional Robustness Point of View.” In 2018 IEEE Data Science Workshop (DSW), 6–10.
Olteanu, Alexandra, Carlos Castillo, Fernando Diaz, and Emre Kıcıman. 2019. Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries.” Frontiers in Big Data 2.
Pearl, Judea, and Elias Bareinboim. 2014. External Validity: From Do-Calculus to Transportability Across Populations.” Statistical Science 29 (4): 579–95.
Peters, Jonas, Dominik Janzing, and Bernhard Schölkopf. 2017. Elements of Causal Inference: Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. Cambridge, Massachuestts: The MIT Press.
Quiñonero-Candela, Joaquin. 2009. Dataset Shift in Machine Learning. Cambridge, Mass.: MIT Press.
Ramchandran, Maya, and Rajarshi Mukherjee. 2021. On Ensembling Vs Merging: Least Squares and Random Forests Under Covariate Shift.” arXiv:2106.02589 [Math, Stat], June.
Rothenhäusler, Dominik, Nicolai Meinshausen, Peter Bühlmann, and Jonas Peters. 2020. Anchor Regression: Heterogeneous Data Meets Causality.” arXiv:1801.06229 [Stat], May.
Rubenstein, Paul K., Stephan Bongers, Bernhard Schölkopf, and Joris M. Mooij. 2018. From Deterministic ODEs to Dynamic Structural Causal Models.” In Uncertainty in Artificial Intelligence.
Runge, Jakob, Sebastian Bathiany, Erik Bollt, Gustau Camps-Valls, Dim Coumou, Ethan Deyle, Clark Glymour, et al. 2019. Inferring Causation from Time Series in Earth System Sciences.” Nature Communications 10 (1): 2553.
Schölkopf, Bernhard. 2019. Causality for Machine Learning.” arXiv:1911.10500 [Cs, Stat], December.
Schölkopf, Bernhard, Bernhard, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. 2012. On Causal and Anticausal Learning.” In ICML 2012.
Schölkopf, Bernhard, David W. Hogg, Dun Wang, Daniel Foreman-Mackey, Dominik Janzing, Carl-Johann Simon-Gabriel, and Jonas Peters. 2015. Removing Systematic Errors for Exoplanet Search via Latent Causes.” arXiv:1505.03036 [Astro-Ph, Stat], May.
Schram, Arthur. 2005. Artificiality: The Tension Between Internal and External Validity in Economic Experiments.” Journal of Economic Methodology 12 (2): 225–37.
Shi, Claudia, David M. Blei, and Victor Veitch. 2019. Adapting Neural Networks for the Estimation of Treatment Effects.” In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2507–17. Red Hook, NY, USA: Curran Associates Inc.
Subbaswamy, Adarsh, Peter Schulam, and Suchi Saria. 2019. Preventing Failures Due to Dataset Shift: Learning Predictive Models That Transport.” In The 22nd International Conference on Artificial Intelligence and Statistics, 3118–27. PMLR.
Tibshirani, Ryan J, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. 2019. Conformal Prediction Under Covariate Shift.” In Advances in Neural Information Processing Systems. Vol. 32. Curran Associates, Inc.
Veitch, Victor, and Anisha Zaveri. 2020. Sense and Sensitivity Analysis: Simple Post-Hoc Analysis of Bias Due to Unobserved Confounding,” March.
Verma, Sahil, John Dickerson, and Keegan Hines. 2020. “Counterfactual Explanations for Machine Learning: A Review.” In, 22.
Wang, Yixin, and Michael I. Jordan. 2021. Desiderata for Representation Learning: A Causal Perspective.” arXiv:2109.03795 [Cs, Stat], September.

  1. Although I realistically copied some ideas from my acquaintances, but maybe even those ideas have the same sort of empirical basis. Let us optimistically assume so for now 🤞.↩︎

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.