Ensembling neural nets

Monte Carlo



One of the practical forms of Bayesian inference for massively parameterised networks.

Explicit ensembles

Train a collection of networks and calculate empirical means and variances to estimate means posterior predictive (He, Lakshminarayanan, and Teh 2020; Huang et al. 2016; Lakshminarayanan, Pritzel, and Blundell 2017; Wen, Tran, and Ba 2020; Xie, Xu, and Chuang 2013). This is neat and on one hand we might think there is nothing special to do here since it’s already more or less classical model ensembling, as near as I can tell. But in practice there are lots of tricks needed to make this go in a neural network context, in particular because models are already supposed to be so big that they strain the GPU; having many such models is presumably ridiculous. You need tricks. There are various such tricks. BatchEnsemble is one (Wen, Tran, and Ba 2020).

Cute: Justin Domke, in The human regression ensemble, creates ensembles of curves that he drew through datapoints on a PDF and gets pretty good results.

Dropout

Dropout is an implicit ensembling method. Or maybe the implicit ensembling method; I am not aware of others. Recommended reading: Foong et al. (2019); Gal, Hron, and Kendall (2017); Kingma, Salimans, and Welling (2015).

A popular kind of noise layer which randomly zeroes out some coefficients in the net when training (and optionally while predicting.) A coarse resemblance to random forests etc is pretty immediate, and indeed you can just use those instead. Here, however, we are trying to average over strong learners, not weak learners.

The key insight here is that dropout can be rationalized, apparently, as model averaging and thence as a kind of implicit probabilistic learning because in the limit it approaches a certain deep Gaussian process (Kingma, Salimans, and Welling 2015; Gal and Ghahramani 2016b, 2015). Leveraging this argument there are some papers that claim to approximate Bayesian inference by randomizing dropout (M. Kasim et al. 2019; M. F. Kasim et al. 2020).

AFAICT current consensus seems to be that highly cited and very simple model of Gal and Ghahramani (2015) is flawed, and that the rather more onerous approach of Kingma, Salimans, and Welling (2015) is how you would use dropout as a more reasonable posterior; So much was said in a seminar, but I have not really used either paper in practice so I cannot comment.

Alternate model combinations

Should we stop weighting hypotheses and start “stacking”? Yao et al. (2018) (also how is that different?)

The widely recommended procedure of Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability.

Distilling

So apparently you can train a model to emulate an ensemble of similar models? Great terminology here; Hinton, Vinyals, and Dean (2015) refer to distilling of dark knowledge.

See Bubeck on this: Three mysteries in deep learning: Ensemble, knowledge distillation, and self-distillation.

Via NTK

How does this work? He, Lakshminarayanan, and Teh (2020).

Questions

These methods focus generally on the posterior predictive. How do I find posteriors for parameter values in my model without including them in my predictive loss explicitly? If many of my parameters are not interpretable I am naturally tempted to fit some by Maximum Likelihood, take them as given then update posteriors over the remainder, but this does not look like a principled inference procedure.

References

Alquier, Pierre. 2021. “User-Friendly Introduction to PAC-Bayes Bounds.” arXiv:2110.11216 [cs, Math, Stat], October. http://arxiv.org/abs/2110.11216.
Chipman, Hugh A, Edward I George, and Robert E Mcculloch. 2006. “Bayesian Ensemble Learning.” In, 8.
Clarke, Bertrand. 2003. “Comparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot Be Ignored.” The Journal of Machine Learning Research 4 (null): 683–712. https://doi.org/10.1162/153244304773936090.
Foong, Andrew Y K, David R Burt, Yingzhen Li, and Richard E Turner. 2019. “Pathologies of Factorised Gaussian and MC Dropout Posteriors in Bayesian Neural Networks.” In 4th Workshop on Bayesian Deep Learning (NeurIPS 2019), 17.
Gal, Yarin, and Zoubin Ghahramani. 2015. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning (ICML-16). http://arxiv.org/abs/1506.02142.
———. 2016a. “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.” In arXiv:1512.05287 [stat]. http://arxiv.org/abs/1512.05287.
———. 2016b. “Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference.” In 4th International Conference on Learning Representations (ICLR) Workshop Track. http://arxiv.org/abs/1506.02158.
———. 2016c. “Dropout as a Bayesian Approximation: Appendix.” arXiv:1506.02157 [stat], May. http://arxiv.org/abs/1506.02157.
Gal, Yarin, Jiri Hron, and Alex Kendall. 2017. “Concrete Dropout.” arXiv:1705.07832 [stat], May. http://arxiv.org/abs/1705.07832.
He, Bobby, Balaji Lakshminarayanan, and Yee Whye Teh. 2020. “Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33. https://proceedings.neurips.cc//paper_files/paper/2020/hash/0b1ec366924b26fc98fa7b71a9c249cf-Abstract.html.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” arXiv:1503.02531 [cs, Stat], March. http://arxiv.org/abs/1503.02531.
Huang, Gao, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. 2016. “Deep Networks with Stochastic Depth.” In Computer Vision – ECCV 2016, edited by Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, 646–61. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-46493-0_39.
Kasim, M. F., D. Watson-Parris, L. Deaconu, S. Oliver, P. Hatfield, D. H. Froula, G. Gregori, et al. 2020. “Up to Two Billion Times Acceleration of Scientific Simulations with Deep Neural Architecture Search.” arXiv:2001.08055 [physics, Stat], January. http://arxiv.org/abs/2001.08055.
Kasim, Muhammad, J Topp-Mugglestone, P Hatfield, D H Froula, G Gregori, M Jarvis, E Viezzer, and Sam Vinko. 2019. “A Million Times Speed up in Parameters Retrieval with Deep Learning.” In, 5.
Kingma, Diederik P., Tim Salimans, and Max Welling. 2015. “Variational Dropout and the Local Reparameterization Trick.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, 2575–83. NIPS’15. Cambridge, MA, USA: MIT Press. http://arxiv.org/abs/1506.02557.
Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2017. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” In Proceedings of the 31st International Conference on Neural Information Processing Systems, 6405–16. NIPS’17. Red Hook, NY, USA: Curran Associates Inc. http://arxiv.org/abs/1612.01474.
Le, Tri, and Bertrand Clarke. 2017. “A Bayes Interpretation of Stacking for \(\mathcal{M}\)-Complete and \(\mathcal{M}\)-Open Settings.” Bayesian Analysis 12 (3): 807–29. https://doi.org/10.1214/16-BA1023.
Mandt, Stephan, Matthew D. Hoffman, and David M. Blei. 2017. “Stochastic Gradient Descent as Approximate Bayesian Inference.” JMLR, April. http://arxiv.org/abs/1704.04289.
Minka, Thomas P. 2002. “Bayesian Model Averaging Is Not Model Combination.” http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.362.4927&rep=rep1&type=pdf.
Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. “Variational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML. http://arxiv.org/abs/1701.05369.
Pearce, Tim, Mohamed Zaki, and Andy Neely. 2018. “Bayesian Neural Network Ensembles.” Third Workshop on Bayesian Deep Learning (NeurIPS 2018), Montréal, Canada., 5.
Wen, Yeming, Dustin Tran, and Jimmy Ba. 2020. “BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning.” In ICLR. http://arxiv.org/abs/2002.06715.
Wortsman, Mitchell, Maxwell Horton, Carlos Guestrin, Ali Farhadi, and Mohammad Rastegari. 2021. “Learning Neural Network Subspaces.” arXiv:2102.10472 [cs], February. http://arxiv.org/abs/2102.10472.
Xie, Jingjing, Bing Xu, and Zhang Chuang. 2013. “Horizontal and Vertical Ensemble with Deep Representation for Classification.” arXiv:1306.2759 [cs, Stat], June. http://arxiv.org/abs/1306.2759.
Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2018. “Using Stacking to Average Bayesian Predictive Distributions.” Bayesian Analysis 13 (3): 917–1007. https://doi.org/10.1214/17-BA1091.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.