Learning under distribution shift

Also transfer learning, covariate shift, transferable learning, domain adaptation, transportability etc

Predictive models can be trained on independent or identically distributed data without much fus. Sometimes our data is not identically distributed but is drawn from several different distributions. Say, I am training a model which predicts customer behaviour, and I have customers in Australia and customers in India. Can I nonetheless train a model which works well on all of the data?

If we are using a parametric hierarchical model, we can pool data in the normal way. and learn interaction effects.

If we are doing Neural Network Stuff though, it is not really clear how to to that. We might be vexed, and then surprised, and then write an article about it. If we are a typical research, that article might be blind to prior art in statistics. e.g. Google AI Blog: How Underspecification Presents Challenges for Machine Learning, or, Sebastian Ruder’s NN-style introduction to “transfer learning”.

I'm waying this in a snarky way, but there can be virtue in reinventing things with fresh eyes. Transfer learning and domain adaptation and such, these are all concepts that arise in the NN framing, and sometimes the methods overlap with statistical classics and sometimes they extend the repertoire.

Here we will investigate all of them that I have time to.

What is transfer learning or domain adaptation actually?

Everyone I talk to seems to have a different notion, and also to think that their idea is canonical.

We need a taxonomy. How about this one? In thuml/A-Roadmap-for-Transfer-Learning Junguang Jiang, Yang Shu, Jianmin Wang and Mingsheng Long propose the following taxonomy of transfer methods(Jiang et al. 2022):

They handball to zhaoxin94/awesome-domain-adaptation for a finer domain adaptation taxonomy.

One survey paper not enough? Want a better taxonomy? Here are survey papers harvested from the above links:

(Csurka 2017; Gulrajani and Lopez-Paz 2020; Jiang et al. 2022; Kouw and Loog 2019; Ouali, Hudelot, and Tami 2020; Pan and Yang 2010; Patel et al. 2015; Sun, Shi, and Wu 2015; Tan et al. 2018; M. Wang and Deng 2018; Wilson and Cook 2020; Yuchen Zhang et al. 2019; J. Zhang et al. 2019; L. Zhang and Gao 2020; S. Zhao et al. 2020; Zhuang et al. 2020).

Transfer learning connects also to semi-supervised learning and fairness, argues (Schölkopf et al. 2012; Schölkopf 2022).

Graphical models

To my mind the most straightforward thing, Simply do causal inference in a hierarchical model which encodes all the causal constraints. All the tools of graphical modeling stuff are still well-posed. It is easy to explain in a Bayesian framework in particular. I think this is what is referred to in Elias Bareinbohm’s data fusion framing (Bareinboim and Pearl 2016, 2013, 2014; Pearl and Bareinboim 2014). In this case we can use standard stistical tooling, such as HMC to sample from some posterior under various interventions, e.g. a shift in some parameter of the population distribution.

The hairy part is that this breaks down in neural networks. There is a million-dimensional nuisance parameter that we need to integrate out, i.e. the neural weights. For reasons of size alone that is frequently impractical, with the computation cost blowing out.

Some other works that look related: (Gong et al. 2018; Moraffah et al. 2019; Yue et al. 2021; Xu, Wang, and Ni 2022; Rothenhäusler et al. 2020).

A graphical model approach has many things to recommend it if it works, though; We do not need to worry about missing values (they may also be inferred); we can estimate intervention distributions etc.


The LLM approach. Out of scope for my current investigation, but very much in the news

Sample weighting

If the proportion of the populations of various kinds has changed we can do Stratified sampling to estimate the quantity of interest over the new population

Bi-level / adversarial

OK, all that graphical model stuff failed to scale to my problem of interest;what next? As noted in Yuchen Zhang et al. (2019) many domain adaption strategies can be framed as bi-level optimisation problems of minimax type. so that presumable corresponds to Domain Adversarial Learning. I think that Invariant risk minimisation and probably can be put in this minimax framework too, but also “learning invariants” is somehow conceptionally separate.

Update: Yes, Ahuja et al. (2020) are helpful in giving us some semblance of taxonomy:

The standard risk minimization paradigm of machine learning is brittle when operating in environments whose test distributions are different from the training distribution due to spurious correlations. Training on data from many environments and finding invariant predictors reduces the effect of spurious features by concentrating models on features that have a causal relationship with the outcome. In this work, we pose such invariant risk minimization as finding the Nash equilibrium of an ensemble game among several environments. By doing so, we develop a simple training algorithm that uses best response dynamics and, in our experiments, yields similar or better empirical accuracy with much lower variance than the challenging bi-level optimization problem of Arjovsky et al. (2020). One key theoretical contribution is showing that the set of Nash equilibria for the proposed game are equivalent to the set of invariant predictors for any finite number of environments, even with nonlinear classifiers and transformations. As a result, our method also retains the generalization guarantees to a large set of environments shown in Arjovsky et al. (2020). The proposed algorithm adds to the collection of successful game-theoretic machine learning algorithms such as generative adversarial networks.

I’m a little confused that people seem to describe Arjovsky et al. (2020) method as bi-level optimisation; the paper discusses a bi-level optimization but they go on to implement an approximation which seems to be a basic single-level regularized optimization. I am missing something, either in the original paper or the detractors.

I will inspect IBM/OoD: Repository for theory and methods for Out-of-Distribution (OoD) generalization.

Semi-supervised learning

See Semi-Supervised Learning.

Source and target empirical risks

What does this heading even mean? I had some idea, but I have forgotten it, I confess (Ben-David et al. 2006; Ben-David et al. 2010; Blitzer et al. 2007; Mansour, Mohri, and Rostamizadeh 2009).

Learning invariants

I am not sure if the various sub-methods in this category are in fact distinct. H. Zhao et al. (2019) devises necessary conditions for invariant representation learning to work. Possibly this is a special case/particular framing of what I called “bi-level” optimisation, above.

Regularising features towards invariance

DAN (Long et al. 2015)

Recent studies reveal that a deep neural network can learn transferable features which generalize well to novel tasks for domain adaptation. However, as deep features eventually transition from general to specific along the network, the feature transferability drops significantly in higher layers with increasing domain discrepancy. Hence, it is important to formally reduce the dataset bias and enhance the transferability in task-specific layers. In this paper, we propose a new Deep Adaptation Network (DAN) architecture, which generalizes deep convolutional neural network to the domain adaptation scenario. In DAN, hidden representations of all task-specific layers are embedded in a reproducing kernel Hilbert space where the mean embeddings of different domain distributions can be explicitly matched. The domain discrepancy is further reduced using an optimal multi-kernel selection method for mean embedding matching. DAN can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding. Extensive empirical evidence shows that the proposed architecture yields state-of-the-art image classification error rates on standard domain adaptation benchmarks.

Invariant risk minimisation

A trick from Arjovsky et al. (2020). Ermin Orhan summarises the method plus several negative results (Gulrajani and Lopez-Paz 2020; Rosenfeld, Ravikumar, and Risteski 2020) about IRM:

Take invariant risk minimization (IRM), one of the more popular domain generalization methods proposed recently. IRM considers a classification problem that takes place in multiple domains or environments, \(e_1, e_2, \ldots, e_E\) (in an image classification setting, these could be natural images, drawings, paintings, computer-rendered images etc.). We decompose the learning problem into learning a feature backbone \(\Phi\) (a featurizer), and a linear readout \(\beta\) on top of it. Intuitively, in our classifier, we only want to make use of features that are invariant across different environments (for instance, the shapes of objects in our image classification example), and not features that vary from environment to environment (for example, the local textures of objects). This is because the invariant features are more likely to generalize to a new environment. We could, of course, do the old, boring empirical risk minimization (ERM), your grandmother’s dumb method. This would simply lump the training data from all environments into one single giant training set and minimize the loss on that, with the hope that whatever features are more or less invariant across the environments will automatically emerge out of this optimization. Mathematically, ERM in this setting corresponds to solving the following well-known optimization problem (assuming the same amount of training data from each domain): \(\min _{\Phi, \beta} \frac{1}{E} \sum_c \mathfrak {R}^c(\Phi, \hat{\beta})\), where \(\mathfrak {R}^c\) is the empirical risk in environment \(e\). IRM proposes something much more complicated instead: why don’t we learn a featurizer with the same optimal linear readout on top of it in every environment? The hope is that in this way, the extractor will only learn the invariant features, because the non-invariant features will change from environment to environment and can’t be decoded optimally using the same fixed readout. The IRM objective thus involves a difficult bi-level optimization problem…

Does it though? The general IRM objective is difficult, but there is a simple approximation in the paper, IRMv1 which is claimed to be easier. Either way, though, the critiques of (Gulrajani and Lopez-Paz 2020; Rosenfeld, Ravikumar, and Risteski 2020) are useful.

Interesting variants:

(Ahuja et al. 2022, 2020; Shah et al. 2021)


Conformal learning + distributional shift.

This Maori gentleman (name unspecified) from the 1800s demonstrates an artful transfer learning from the western fashion domain. Or maybe that is style transfer, I forget.

Justification for batch normalization

Apparently a thing? Should probably note some of the literature about that.



TLlib (Jiang et al. 2022) is an open-source and well-documented library for Transfer Learning. It is based on pure PyTorch with high performance and friendly API. Our code is pythonic, and the design is consistent with torchvision. You can easily develop new algorithms, or readily apply existing algorithms.

Our API is divided by methods, which include:

  • domain alignment methods (tllib.aligment)
  • domain translation methods (tllib.translation)
  • self-training methods (tllib.self\_training)
  • regularization methods (tllib.regularization)
  • data reweighting/resampling methods (tllib.reweight)
  • model ranking/selection methods (tllib.ranking)
  • normalization-based methods (tllib.normalization)


facebookresearch/DomainBed: DomainBed is a suite to test domain generalization algorithms

DomainBed is a PyTorch suite containing benchmark datasets and algorithms for domain generalization, as introduced in Gulrajani and Lopez-Paz (2020)


salad is a library to easily setup experiments using the current state-of-the art techniques in domain adaptation. It features several of recent approaches, with the goal of being able to run fair comparisons between algorithms and transfer them to real-world use cases.


WILDS: A Benchmark of in-the-Wild Distribution Shifts

To facilitate the development of ML models that are robust to real-world distribution shifts, our ICML 2021 paper presents WILDS, a curated benchmark of 10 datasets that reflect natural distribution shifts arising from different cameras, hospitals, molecular scaffolds, experiments, demographics, countries, time periods, users, and codebases.


Ahuja, Kartik, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. 2022. Invariance Principle Meets Information Bottleneck for Out-of-Distribution Generalization.” arXiv.
Ahuja, Kartik, Karthikeyan Shanmugam, Kush R. Varshney, and Amit Dhurandhar. 2020. Invariant Risk Minimization Games.” arXiv.
Arjovsky, Martin, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2020. Invariant Risk Minimization.” arXiv.
Bareinboim, Elias, and Judea Pearl. 2013. A General Algorithm for Deciding Transportability of Experimental Results.” Journal of Causal Inference 1 (1): 107–34.
———. 2014. “Transportability from Multiple Environments with Limited Experiments: Completeness Results.” In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, 280–88. NIPS’14. Cambridge, MA, USA: MIT Press.
———. 2016. Causal Inference and the Data-Fusion Problem.” Proceedings of the National Academy of Sciences 113 (27): 7345–52.
Ben-David, Shai, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A Theory of Learning from Different Domains.” Machine Learning 79 (1-2): 151–75.
Ben-David, Shai, John Blitzer, Koby Crammer, and Fernando Pereira. 2006. Analysis of Representations for Domain Adaptation.” In Advances in Neural Information Processing Systems. Vol. 19. MIT Press.
Besserve, Michel, Arash Mehrjou, Rémy Sun, and Bernhard Schölkopf. 2019. Counterfactuals Uncover the Modular Structure of Deep Generative Models.” In arXiv:1812.03253 [Cs, Stat].
Blitzer, John, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. 2007. Learning Bounds for Domain Adaptation.” In Advances in Neural Information Processing Systems. Vol. 20. Curran Associates, Inc.
Chapelle, Olivier, Bernhard Schölkopf, and Alexander Zien, eds. 2010. Semi-Supervised Learning. 1st MIT Press pbk. ed. Adaptive Computation and Machine Learning. Cambridge, Mass: MIT Press.
Chu, Xu, Yujie Jin, Wenwu Zhu, Yasha Wang, Xin Wang, Shanghang Zhang, and Hong Mei. 2022. DNA: Domain Generalization with Diversified Neural Averaging.” In Proceedings of the 39th International Conference on Machine Learning, 4010–34. PMLR.
Csurka, Gabriela. 2017. Domain Adaptation for Visual Applications: A Comprehensive Survey.” arXiv.
Dumoulin, Vincent, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron Courville, and Yoshua Bengio. 2018. Feature-Wise Transformations.” Distill 3 (7): e11.
Ganin, Yaroslav, and Victor Lempitsky. 2015. Unsupervised Domain Adaptation by Backpropagation.” In Proceedings of the 32nd International Conference on Machine Learning, 1180–89. PMLR.
Gong, Mingming, Kun Zhang, Biwei Huang, Clark Glymour, Dacheng Tao, and Kayhan Batmanghelich. 2018. Causal Generative Domain Adaptation Networks.” arXiv.
Gulrajani, Ishaan, and David Lopez-Paz. 2020. In Search of Lost Domain Generalization.” In.
Henzi, Alexander, Xinwei Shen, Michael Law, and Peter Bühlmann. 2023. Invariant Probabilistic Prediction.” arXiv.
Ioffe, Sergey, and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arXiv.
Jiang, Junguang, Yang Shu, Jianmin Wang, and Mingsheng Long. 2022. Transferability in Deep Learning: A Survey.” arXiv.
Kaddour, Jean, Aengus Lynch, Qi Liu, Matt J. Kusner, and Ricardo Silva. 2022. Causal Machine Learning: A Survey and Open Problems.” arXiv.
Koh, Pang Wei, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, et al. 2021. WILDS: A Benchmark of in-the-Wild Distribution Shifts.” arXiv:2012.07421 [Cs], July.
Kosoy, Eliza, David M. Chan, Adrian Liu, Jasmine Collins, Bryanna Kaufmann, Sandy Han Huang, Jessica B. Hamrick, John Canny, Nan Rosemary Ke, and Alison Gopnik. 2022. Towards Understanding How Machines Can Learn Causal Overhypotheses.” arXiv.
Kouw, Wouter M., and Marco Loog. 2019. An Introduction to Domain Adaptation and Transfer Learning.” arXiv.
Kulinski, Sean, and David I. Inouye. 2022. Towards Explaining Distribution Shifts.” arXiv.
Kuroki, Seiichi, Nontawat Charoenphakdee, Han Bao, Junya Honda, Issei Sato, and Masashi Sugiyama. 2018. Unsupervised Domain Adaptation Based on Source-Guided Discrepancy.” In. arXiv.
Lagemann, Kai, Christian Lagemann, Bernd Taschler, and Sach Mukherjee. 2023. Deep Learning of Causal Structures in High Dimensions Under Data Limitations.” Nature Machine Intelligence, October, 1–11.
Lattimore, Finnian Rachel. 2017. Learning How to Act: Making Good Decisions with Machine Learning.”
Li, Haoliang, Sinno Jialin Pan, Shiqi Wang, and Alex C. Kot. 2018. Domain Generalization with Adversarial Feature Learning.” In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5400–5409. Salt Lake City, UT: IEEE.
Long, Mingsheng, Yue Cao, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. 2019. Transferable Representation Learning with Deep Adaptation Networks.” IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (12): 3071–85.
Long, Mingsheng, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks.” In Proceedings of the 32nd International Conference on Machine Learning, 97–105. PMLR.
Long, Mingsheng, Han Zhu, Jianmin Wang, and Michael I. Jordan. 2017. Deep Transfer Learning with Joint Adaptation Networks.” In Proceedings of the 34th International Conference on Machine Learning, 2208–17. PMLR.
Mansour, Yishay, Mehryar Mohri, and Afshin Rostamizadeh. 2009. Domain Adaptation: Learning Bounds and Algorithms.” In. arXiv.
Moraffah, Raha, Kai Shu, Adrienne Raglin, and Huan Liu. 2019. Deep Causal Representation Learning for Unsupervised Domain Adaptation.” arXiv.
Ouali, Yassine, Céline Hudelot, and Myriam Tami. 2020. An Overview of Deep Semi-Supervised Learning.” arXiv.
Pan, Sinno Jialin, and Qiang Yang. 2010. A Survey on Transfer Learning.” IEEE Transactions on Knowledge and Data Engineering 22 (10): 1345–59.
Patel, Vishal M, Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. 2015. Visual Domain Adaptation: A Survey of Recent Advances.” IEEE Signal Processing Magazine 32 (3): 53–69.
Pearl, Judea, and Elias Bareinboim. 2014. External Validity: From Do-Calculus to Transportability Across Populations.” Statistical Science 29 (4): 579–95.
Perez, Ethan, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. 2017. FiLM: Visual Reasoning with a General Conditioning Layer.” arXiv.
Peters, Jonas, Peter Bühlmann, and Nicolai Meinshausen. 2016. Causal Inference by Using Invariant Prediction: Identification and Confidence Intervals.” Journal of the Royal Statistical Society Series B: Statistical Methodology 78 (5): 947–1012.
Quiñonero-Candela, Joaquin. 2009. Dataset Shift in Machine Learning. Cambridge, Mass.: MIT Press.
Ramchandran, Maya, and Rajarshi Mukherjee. 2021. On Ensembling Vs Merging: Least Squares and Random Forests Under Covariate Shift.” arXiv:2106.02589 [Math, Stat], June.
Rosenfeld, Elan, Pradeep Kumar Ravikumar, and Andrej Risteski. 2020. The Risks of Invariant Risk Minimization.” In.
Rothenhäusler, Dominik, Nicolai Meinshausen, Peter Bühlmann, and Jonas Peters. 2020. Anchor Regression: Heterogeneous Data Meets Causality.” arXiv:1801.06229 [Stat], May.
Schölkopf, Bernhard. 2022. Causality for Machine Learning.” In Probabilistic and Causal Inference: The Works of Judea Pearl, 1st ed., 36:765–804. New York, NY, USA: Association for Computing Machinery.
Schölkopf, Bernhard, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. 2012. On Causal and Anticausal Learning.” In ICML 2012.
Schölkopf, Bernhard, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021. Toward Causal Representation Learning.” Proceedings of the IEEE 109 (5): 612–34.
Shah, Abhin, Kartik Ahuja, Karthikeyan Shanmugam, Dennis Wei, Kush Varshney, and Amit Dhurandhar. 2021. Treatment Effect Estimation Using Invariant Risk Minimization.” arXiv.
Simchoni, Giora, and Saharon Rosset. 2023. Integrating Random Effects in Deep Neural Networks.” arXiv.
Subbaswamy, Adarsh, Peter Schulam, and Suchi Saria. 2019. Preventing Failures Due to Dataset Shift: Learning Predictive Models That Transport.” In The 22nd International Conference on Artificial Intelligence and Statistics, 3118–27. PMLR.
Sun, Shiliang, Honglei Shi, and Yuanbin Wu. 2015. A Survey of Multi-Source Domain Adaptation.” Information Fusion 24 (July): 84–92.
Tan, Chuanqi, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. 2018. A Survey on Deep Transfer Learning.” arXiv.
Tibshirani, Ryan J, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. 2019. Conformal Prediction Under Covariate Shift.” In Advances in Neural Information Processing Systems. Vol. 32. Curran Associates, Inc.
Ventola, Fabrizio, Steven Braun, Zhongjie Yu, Martin Mundt, and Kristian Kersting. 2023. Probabilistic Circuits That Know What They Don’t Know.” arXiv.org.
Wang, Jindong, and Yiqiang Chen. 2023. Introduction to Transfer Learning: Algorithms and Practice. Machine Learning: Foundations, Methodologies, and Applications. Singapore: Springer Nature.
Wang, Mei, and Weihong Deng. 2018. Deep Visual Domain Adaptation: A Survey.” Neurocomputing 312 (October): 135–53.
Wilson, Garrett, and Diane J. Cook. 2020. A Survey of Unsupervised Deep Domain Adaptation.” arXiv.
Xu, Minghao, Hang Wang, and Bingbing Ni. 2022. Graphical Modeling for Multi-Source Domain Adaptation.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1.
Yang, Qiang, Yu Zhang, Wenyuan Dai, and Sinno Jialin Pan. 2020. Transfer Learning. 1st ed. Cambridge: Cambridge University Press.
Yao, Yuling, Gregor Pirš, Aki Vehtari, and Andrew Gelman. 2022. Bayesian Hierarchical Stacking: Some Models Are (Somewhere) Useful.” Bayesian Analysis 17 (4): 1043–71.
Yue, Zhongqi, Qianru Sun, Xian-Sheng Hua, and Hanwang Zhang. 2021. Transporting Causal Mechanisms for Unsupervised Domain Adaptation.” In, 8599–8608.
Zellinger, Werner, Bernhard A. Moser, and Susanne Saminger-Platz. 2021. On Generalization in Moment-Based Domain Adaptation.” Annals of Mathematics and Artificial Intelligence 89 (3): 333–69.
Zhang, Jing, Wanqing Li, Philip Ogunbona, and Dong Xu. 2019. Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective.” arXiv.
Zhang, Lei, and Xinbo Gao. 2020. Transfer Adaptation Learning: A Decade Survey.” arXiv.
Zhang, Yabin, Bin Deng, Hui Tang, Lei Zhang, and Kui Jia. 2020. Unsupervised Multi-Class Domain Adaptation: Theory, Algorithms, and Practice.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1.
Zhang, Yuchen, Tianle Liu, Mingsheng Long, and Michael Jordan. 2019. Bridging Theory and Algorithm for Domain Adaptation.” In Proceedings of the 36th International Conference on Machine Learning, 7404–13. PMLR.
Zhao, Han, Remi Tachet des Combes, Kun Zhang, and Geoffrey J. Gordon. 2019. On Learning Invariant Representation for Domain Adaptation.” arXiv.
Zhao, Sicheng, Xiangyu Yue, Shanghang Zhang, Bo Li, Han Zhao, Bichen Wu, Ravi Krishna, et al. 2020. A Review of Single-Source Deep Unsupervised Visual Domain Adaptation.” arXiv.
Zhuang, Fuzhen, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. 2020. A Comprehensive Survey on Transfer Learning.” arXiv.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.