Model averaging

On keeping many incorrect hypotheses and using them all as one goodish one

A mere placeholder. For now see Ensemble learning on Wikipedia

Train a bunch of different models and use them all. Fashionable in the form of blending, stacking or staging in machine learning competitions, but also popular in classic frequentist inference as model averaging or bagging, or in e.g. posterior predictives in Bayes inference, which especially in the MCMC methods are easy to interpret as weighted ensembles.

I’ve seen the idea pop up in disconnected areas recently. Specifically: a Bayesian heuristic for dropout in neural nets, AIC for frequentist model averaging, Neural net ensembles, boosting/bagging, and in a statistical learning context for optimal time series prediction.

This vexingly incomplete article points out that something like model averaging might work for any convex loss thanks to Jensen’s inequality. I am most used to it with K-L loss.

Two articles (Clarke 2003; Minka 2002) point out that model averaging and combination are not the same and the difference is acute in the M-open setting.


Alquier, Pierre. 2021. β€œUser-Friendly Introduction to PAC-Bayes Bounds.” arXiv:2110.11216 [Cs, Math, Stat], October.
Bates, J. M., and C. W. J. Granger. 1969. β€œThe Combination of Forecasts.” Journal of the Operational Research Society 20 (4): 451–68.
Bishop, Christopher M., and Markus Svensen. 2012. β€œBayesian Hierarchical Mixtures of Experts.” arXiv:1212.2447 [Cs, Stat], October.
Breiman, Leo. 1996. β€œStacked Regressions.” Machine Learning 24 (1): 49–64.
Buckland, S. T., K. P. Burnham, and N. H. Augustin. 1997. β€œModel Selection: An Integral Part of Inference.” Biometrics 53 (2): 603–18.
Claeskens, Gerda, and Nils Lid Hjort. 2008. Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge ; New York: Cambridge University Press.
Clarke, Bertrand. 2003. β€œComparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot Be Ignored.” The Journal of Machine Learning Research 4 (null): 683–712.
Clyde, Merlise, and Edward I. George. 2004. β€œModel Uncertainty.” Statistical Science 19 (1): 81–94.
Fragoso, Tiago M., and Francisco Louzada Neto. 2015. β€œBayesian Model Averaging: A Systematic Review and Conceptual Classification.” arXiv:1509.08864 [Stat], September.
Gammerman, Alexander, and Vladimir Vovk. 2007. β€œHedging Predictions in Machine Learning.” The Computer Journal 50 (2): 151–63.
Hansen, Bruce E. 2007. β€œLeast Squares Model Averaging.” Econometrica 75 (4): 1175–89.
He, Bobby, Balaji Lakshminarayanan, and Yee Whye Teh. 2020. β€œBayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33.
Hinne, Max, Quentin Frederik Gronau, Don van den Bergh, and Eric-Jan Wagenmakers. 2019. β€œA Conceptual Introduction to Bayesian Model Averaging.” Preprint. PsyArXiv.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. β€œDistilling the Knowledge in a Neural Network.” arXiv:1503.02531 [Cs, Stat], March.
Hjort, Nils Lid, and Gerda Claeskens. 2003. β€œFrequentist Model Average Estimators.” Journal of the American Statistical Association 98 (464): 879–99.
Hoeting, Jennifer A., David Madigan, Adrian E. Raftery, and Chris T. Volinsky. 1999. β€œBayesian Model Averaging: A Tutorial.” Statistical Science 14 (4): 382–417.
Hu, Feifang, and James V. Zidek. 2002. β€œThe Weighted Likelihood.” The Canadian Journal of Statistics / La Revue Canadienne de Statistique 30 (3): 347–71.
Laan, Mark J. van der, Eric C Polley, and Alan E. Hubbard. 2007. β€œSuper Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).
Lawless, J. F., and Marc Fredette. 2005. β€œFrequentist Prediction Intervals and Predictive Distributions.” Biometrika 92 (3): 529–42.
Le, Tri M., and Bertrand S. Clarke. 2022. β€œModel Averaging Is Asymptotically Better Than Model Selection For Prediction.” Journal of Machine Learning Research 23 (33): 1–53.
Le, Tri, and Bertrand Clarke. 2017. β€œA Bayes Interpretation of Stacking for \(\mathcal{M}\)-Complete and \(\mathcal{M}\)-Open Settings.” Bayesian Analysis 12 (3): 807–29.
Leung, G., and A.R. Barron. 2006. β€œInformation Theory and Mixing Least-Squares Regressions.” IEEE Transactions on Information Theory 52 (8): 3396–3410.
Minka, Thomas P. 2002. β€œBayesian Model Averaging Is Not Model Combination.”
Phillips, Robert F. 1987. β€œComposite Forecasting: An Integrated Approach and Optimality Reconsidered.” Journal of Business & Economic Statistics 5 (3): 389–95.
Piironen, Juho, and Aki Vehtari. 2017. β€œComparison of Bayesian Predictive Methods for Model Selection.” Statistics and Computing 27 (3): 711–35.
Polley, Eric, and Mark van der Laan. 2010. β€œSuper Learner In Prediction.” U.C. Berkeley Division of Biostatistics Working Paper Series, May.
Ramchandran, Maya, and Rajarshi Mukherjee. 2021. β€œOn Ensembling Vs Merging: Least Squares and Random Forests Under Covariate Shift.” arXiv:2106.02589 [Math, Stat], June.
Shen, Xiaotong, and Hsin-Cheng Huang. 2006. β€œOptimal Model Assessment, Selection, and Combination.” Journal of the American Statistical Association 101 (474): 554–68.
Shwartz-Ziv, Ravid, and Amitai Armon. 2021. β€œTabular Data: Deep Learning Is Not All You Need.” arXiv:2106.03253 [Cs], June.
Wang, Haiying, Xinyu Zhang, and Guohua Zou. 2009. β€œFrequentist Model Averaging Estimation: A Review.” Journal of Systems Science and Complexity 22 (4): 732.
Wang, Xiaofang, Dan Kondratyuk, Eric Christiansen, Kris M. Kitani, Yair Alon, and Elad Eban. 2021. β€œWisdom of Committees: An Overlooked Approach To Faster and More Accurate Models.” arXiv:2012.01988 [Cs], October.
Waterhouse, Steve, David MacKay, and Anthony Robinson. 1995. β€œBayesian Methods for Mixtures of Experts.” In Advances in Neural Information Processing Systems, 8:7. MIT Press.
Wolpert, David H. 1992. β€œStacked Generalization.” Neural Networks 5 (2): 241–59.
Zhang, Tianfang, Rasmus Bokrantz, and Jimmy Olsson. 2021. β€œA Similarity-Based Bayesian Mixture-of-Experts Model.” arXiv:2012.02130 [Cs, Stat], July.
Zhang, Xinyu, and Hua Liang. 2011. β€œFocused Information Criterion and Model Averaging for Generalized Additive Partial Linear Models.” The Annals of Statistics 39 (1): 174–200.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.