I am often confused by how surprised we are about the difficulties of transferring models between domains and how continual is the flow of new publications on this theme; e.g. Google AI Blog: How Underspecification Presents Challenges for Machine Learning.

NN people like to think about this in particular way which I like because of the diversity of out-of-the-box ideas it invites and which I dislike because it is sloppy. One could read Sebastian Ruder’s NN-style introduction to “transfer learning”.

This connects also to semi-supervised learning and fairness, argues (Schölkopf et al. 2012; Schölkopf 2022).

Possibly the same underlying idea, we could argue that interaction effects are probably what we want to learn.

## What is *transfer learning* actually?

Not sure. Everyone I talk to seems to have a different idea, and also think that their idea is canonical.

We need a taxonomy here.

## Data fusion

Elias Bareinbohm’s framing (Bareinboim and Pearl 2016, 2013, 2014; Pearl and Bareinboim 2014).

## Standard graphical models

Just do causal inference.

## Invariant risk minimisation

A trick from Arjovsky et al. (2020). Ermin Orhan summarises the method plus several negative results about IRM:

Take invariant risk minimization (IRM), one of the more popular domain generalization methods proposed recently. IRM considers a classification problem that takes place in multiple domains or environments, \(e_1, e_2, \ldots, e_E\) (in an image classification setting, these could be natural images, drawings, paintings, computer-rendered images etc.). We decompose the learning problem into learning a feature backbone \(\Phi\) (a featurizer), and a linear readout \(\beta\) on top of it. Intuitively, in our classifier, we only want to make use of features that are invariant across different environments (for instance, the shapes of objects in our image classification example), and not features that vary from environment to environment (for example, the local textures of objects). This is because the invariant features are more likely to generalize to a new environment. We could, of course, do the old, boring empirical risk minimization (ERM), your grandmother's dumb method. This would simply lump the training data from all environments into one single giant training set and minimize the loss on that, with the hope that whatever features are more or less invariant across the environments will automatically emerge out of this optimization. Mathematically, ERM in this setting corresponds to solving the following well-known optimization problem (assuming the same amount of training data from each domain): \(\min _{\Phi, \beta} \frac{1}{E} \sum_c \mathfrak{R}^c(\Phi, \hat{\beta})\), where \(\mathfrak{R}^c\) is the empirical risk in environment \(e\). IRM proposes something much more complicated instead: why don't we learn a featurizer with the same optimal linear readout on top of it in every environment? The hope is that in this way, the extractor will only learn the invariant features, because the non-invariant features will change from environment to environment and can't be decoded optimally using the same fixed readout. The IRM objective thus involves a difficult bi-level optimization problem…

## Justification for batch normalization

Should probably note some of the literature about that.

## Tools

### Salad

salad is a library to easily setup experiments using the current state-of-the art techniques in domain adaptation. It features several of recent approaches, with the goal of being able to run fair comparisons between algorithms and transfer them to real-world use cases.

### WILDS

WILDS: A Benchmark of in-the-Wild Distribution Shifts

To facilitate the development of ML models that are robust to real-world distribution shifts, our ICML 2021 paper presents WILDS, a curated benchmark of 10 datasets that reflect natural distribution shifts arising from different cameras, hospitals, molecular scaffolds, experiments, demographics, countries, time periods, users, and codebases.

## References

*Journal of Causal Inference*1 (1): 107–34.

*Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1*, 280–88. NIPS’14. Cambridge, MA, USA: MIT Press.

*Proceedings of the National Academy of Sciences*113 (27): 7345–52.

*arXiv:1812.03253 [Cs, Stat]*.

*arXiv:2012.07421 [Cs]*, July.

*Statistical Science*29 (4): 579–95.

*Dataset Shift in Machine Learning*. Cambridge, Mass.: MIT Press.

*arXiv:2106.02589 [Math, Stat]*, June.

*arXiv:1801.06229 [Stat]*, May.

*Probabilistic and Causal Inference: The Works of Judea Pearl*, 1st ed., 36:765–804. New York, NY, USA: Association for Computing Machinery.

*ICML 2012*.

*Proceedings of the IEEE*109 (5): 612–34.

*The 22nd International Conference on Artificial Intelligence and Statistics*, 3118–27. PMLR.

*Advances in Neural Information Processing Systems*. Vol. 32. Curran Associates, Inc.

## No comments yet. Why not leave one?