Learning under distribution shift

Also transfer learning, learning under covariate shift, transferable learning, domain adaptation etc

This Maori gentleman (name unspecified) from the 1800s demonstrates an artful transfer learning from the western fashion domain

I am often confused by how surprised we are about the difficulties of transferring models between domains and how continual is the flow of new publications on this theme; e.g. Google AI Blog: How Underspecification Presents Challenges for Machine Learning.

NN people like to think about this in particular way which I like because of the diversity of out-of-the-box ideas it invites and which I dislike because it is sloppy. One could read Sebastian Ruder’s NN-style introduction to “transfer learning”.

This connects also to semi-supervised learning and fairness, argues (Schölkopf et al. 2012; Schölkopf 2022).

Possibly the same underlying idea, we could argue that interaction effects are probably what we want to learn.

What is transfer learning actually?

Not sure. Everyone I talk to seems to have a different idea, and also think that their idea is canonical.

We need a taxonomy here.

Data fusion

Elias Bareinbohm’s framing (Bareinboim and Pearl 2016, 2013, 2014; Pearl and Bareinboim 2014).

Standard graphical models

Just do causal inference.

Invariant risk minimisation

A trick from Arjovsky et al. (2020). Ermin Orhan summarises the method plus several negative results about IRM:

Take invariant risk minimization (IRM), one of the more popular domain generalization methods proposed recently. IRM considers a classification problem that takes place in multiple domains or environments, \(e_1, e_2, \ldots, e_E\) (in an image classification setting, these could be natural images, drawings, paintings, computer-rendered images etc.). We decompose the learning problem into learning a feature backbone \(\Phi\) (a featurizer), and a linear readout \(\beta\) on top of it. Intuitively, in our classifier, we only want to make use of features that are invariant across different environments (for instance, the shapes of objects in our image classification example), and not features that vary from environment to environment (for example, the local textures of objects). This is because the invariant features are more likely to generalize to a new environment. We could, of course, do the old, boring empirical risk minimization (ERM), your grandmother's dumb method. This would simply lump the training data from all environments into one single giant training set and minimize the loss on that, with the hope that whatever features are more or less invariant across the environments will automatically emerge out of this optimization. Mathematically, ERM in this setting corresponds to solving the following well-known optimization problem (assuming the same amount of training data from each domain): \(\min _{\Phi, \beta} \frac{1}{E} \sum_c \mathfrak{R}^c(\Phi, \hat{\beta})\), where \(\mathfrak{R}^c\) is the empirical risk in environment \(e\). IRM proposes something much more complicated instead: why don't we learn a featurizer with the same optimal linear readout on top of it in every environment? The hope is that in this way, the extractor will only learn the invariant features, because the non-invariant features will change from environment to environment and can't be decoded optimally using the same fixed readout. The IRM objective thus involves a difficult bi-level optimization problem…

Justification for batch normalization

Should probably note some of the literature about that.



salad is a library to easily setup experiments using the current state-of-the art techniques in domain adaptation. It features several of recent approaches, with the goal of being able to run fair comparisons between algorithms and transfer them to real-world use cases.


WILDS: A Benchmark of in-the-Wild Distribution Shifts

To facilitate the development of ML models that are robust to real-world distribution shifts, our ICML 2021 paper presents WILDS, a curated benchmark of 10 datasets that reflect natural distribution shifts arising from different cameras, hospitals, molecular scaffolds, experiments, demographics, countries, time periods, users, and codebases.


Arjovsky, Martin, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2020. Invariant Risk Minimization.” arXiv.
Bareinboim, Elias, and Judea Pearl. 2013. A General Algorithm for Deciding Transportability of Experimental Results.” Journal of Causal Inference 1 (1): 107–34.
———. 2014. “Transportability from Multiple Environments with Limited Experiments: Completeness Results.” In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, 280–88. NIPS’14. Cambridge, MA, USA: MIT Press.
———. 2016. Causal Inference and the Data-Fusion Problem.” Proceedings of the National Academy of Sciences 113 (27): 7345–52.
Besserve, Michel, Arash Mehrjou, Rémy Sun, and Bernhard Schölkopf. 2019. Counterfactuals Uncover the Modular Structure of Deep Generative Models.” In arXiv:1812.03253 [Cs, Stat].
Ioffe, Sergey, and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arXiv.
Kaddour, Jean, Aengus Lynch, Qi Liu, Matt J. Kusner, and Ricardo Silva. 2022. Causal Machine Learning: A Survey and Open Problems.” arXiv.
Koh, Pang Wei, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, et al. 2021. WILDS: A Benchmark of in-the-Wild Distribution Shifts.” arXiv:2012.07421 [Cs], July.
Kosoy, Eliza, David M. Chan, Adrian Liu, Jasmine Collins, Bryanna Kaufmann, Sandy Han Huang, Jessica B. Hamrick, John Canny, Nan Rosemary Ke, and Alison Gopnik. 2022. Towards Understanding How Machines Can Learn Causal Overhypotheses.” arXiv.
Kulinski, Sean, and David I. Inouye. 2022. Towards Explaining Distribution Shifts.” arXiv.
Lattimore, Finnian Rachel. 2017. Learning How to Act: Making Good Decisions with Machine Learning.”
Pearl, Judea, and Elias Bareinboim. 2014. External Validity: From Do-Calculus to Transportability Across Populations.” Statistical Science 29 (4): 579–95.
Quiñonero-Candela, Joaquin. 2009. Dataset Shift in Machine Learning. Cambridge, Mass.: MIT Press.
Ramchandran, Maya, and Rajarshi Mukherjee. 2021. On Ensembling Vs Merging: Least Squares and Random Forests Under Covariate Shift.” arXiv:2106.02589 [Math, Stat], June.
Rothenhäusler, Dominik, Nicolai Meinshausen, Peter Bühlmann, and Jonas Peters. 2020. Anchor Regression: Heterogeneous Data Meets Causality.” arXiv:1801.06229 [Stat], May.
Schölkopf, Bernhard. 2022. Causality for Machine Learning.” In Probabilistic and Causal Inference: The Works of Judea Pearl, 1st ed., 36:765–804. New York, NY, USA: Association for Computing Machinery.
Schölkopf, Bernhard, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. 2012. On Causal and Anticausal Learning.” In ICML 2012.
Schölkopf, Bernhard, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021. Toward Causal Representation Learning.” Proceedings of the IEEE 109 (5): 612–34.
Subbaswamy, Adarsh, Peter Schulam, and Suchi Saria. 2019. Preventing Failures Due to Dataset Shift: Learning Predictive Models That Transport.” In The 22nd International Conference on Artificial Intelligence and Statistics, 3118–27. PMLR.
Tibshirani, Ryan J, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. 2019. Conformal Prediction Under Covariate Shift.” In Advances in Neural Information Processing Systems. Vol. 32. Curran Associates, Inc.
Ventola, Fabrizio, Steven Braun, Zhongjie Yu, Martin Mundt, and Kristian Kersting. 2023. Probabilistic Circuits That Know What They Don’t Know.” arXiv.org.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.