# Ensembling neural nets

Monte Carlo

December 14, 2020 — November 25, 2021

One of the practical forms of Bayesian inference for massively parameterised networks by model averaging.

## 1 Explicit ensembles

Train a collection of networks and calculate empirical means and variances to estimate means posterior predictive (He, Lakshminarayanan, and Teh 2020; Huang et al. 2016; Lakshminarayanan, Pritzel, and Blundell 2017; Wen, Tran, and Ba 2020; Xie, Xu, and Chuang 2013). This is neat and on one hand we might think there is nothing special to do here since it’s already more or less classical model ensembling, as near as I can tell. But in practice there are lots of tricks needed to make this go in a neural network context, in particular because models are already supposed to be so big that they strain the GPU; having *many* such models is presumably ridiculous. You need tricks. There are various such tricks. BatchEnsemble is one (Wen, Tran, and Ba 2020).

Cute: Justin Domke, in The human regression ensemble, creates ensembles of curves that he drew through datapoints on a PDF and gets pretty good results.

## 2 Dropout

Dropout is an *implicit* ensembling method. Or maybe *the* implicit ensembling method; I am not aware of others. Recommended reading: Foong et al. (2019);Gal, Hron, and Kendall (2017);Kingma, Salimans, and Welling (2015).

A popular kind of noise layer which randomly zeroes out some coefficients in the net when training (and optionally while predicting.) A coarse resemblance to random forests etc is pretty immediate, and indeed you can just use those instead. Here, however, we are trying to average over *strong* learners, not weak learners.

The key insight here is that dropout can be rationalized, apparently, as model averaging and thence as a kind of implicit probabilistic learning because in the limit it approaches a certain deep Gaussian process (Kingma, Salimans, and Welling 2015; Gal and Ghahramani 2016b, 2015). Leveraging this argument there are some papers that claim to approximate Bayesian inference by randomizing dropout (M. Kasim et al. 2019; M. F. Kasim et al. 2020).

AFAICT current consensus seems to be that highly cited and very simple model of Gal and Ghahramani (2015) is flawed, and that the rather more onerous approach of Kingma, Salimans, and Welling (2015) is how you would use dropout as a more reasonable posterior; So much was said in a seminar, but I have not really used either paper in practice so I cannot comment.

## 3 Alternate model combinations

Should we stop weighting hypotheses and start “stacking”? Yao et al. (2018) (also how is that different?)

The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability.

## 4 Distilling

So apparently you can train a model to emulate an ensemble of similar models? Great terminology here;Hinton, Vinyals, and Dean (2015) refer to *distilling* of *dark knowledge*.

See Bubeck on this: Three mysteries in deep learning: Ensemble, knowledge distillation, and self-distillation.

## 5 Via NTK

How does this work? He, Lakshminarayanan, and Teh (2020).

## 6 Questions

These methods focus generally on the posterior *predictive*. How do I find posteriors for parameter values in my model without including them in my predictive loss explicitly? If many of my parameters are not interpretable I am naturally tempted to fit some by Maximum Likelihood, take them as given then update posteriors over the remainder, but this does not look like a principled inference procedure.

## 7 Cascades

Google AI Blog: Model Ensembles Are Faster Than You Think (Wang et al. 2021).

## 8 References

*arXiv:2110.11216 [Cs, Math, Stat]*.

*Mathematics of Computation*.

*The Journal of Machine Learning Research*.

*arXiv:2012.07244 [Cs]*.

*arXiv:2106.14806 [Cs, Stat]*.

*2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*.

*4th Workshop on Bayesian Deep Learning (NeurIPS 2019)*.

*Proceedings of the 33rd International Conference on Machine Learning (ICML-16)*.

*arXiv:1512.05287 [Stat]*.

*4th International Conference on Learning Representations (ICLR) Workshop Track*.

*arXiv:1506.02157 [Stat]*.

*arXiv:1705.07832 [Stat]*.

*arXiv:1805.08034 [Cs, Math]*.

*Advances in Neural Information Processing Systems*.

*arXiv:1503.02531 [Cs, Stat]*.

*Computer Vision – ECCV 2016*. Lecture Notes in Computer Science.

*arXiv:2001.08055 [Physics, Stat]*.

*Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2*. NIPS’15.

*Inverse Problems*.

*Proceedings of the 31st International Conference on Neural Information Processing Systems*. NIPS’17.

*Bayesian Analysis*.

*JMLR*.

*Proceedings of ICML*.

*IEEE Transactions on Neural Networks*.

*Third Workshop on Bayesian Deep Learning (NeurIPS 2018), Montréal, Canada.*

*arXiv:2105.14594 [Cs, Stat]*.

*arXiv:2012.01988 [Cs]*.

*ICLR*.

*arXiv:2102.10472 [Cs]*.

*arXiv:1306.2759 [Cs, Stat]*.

*Bayesian Analysis*.