Model averaging

On keeping many incorrect hypotheses and using them all as one goodish one

A mere placeholder. For now see Ensemble learning on Wikipedia

Train a bunch of different models and use them all. Fashionable in the form of blending, stacking or staging in machine learning competitions, but also popular in classic frequentist inference as model averaging or bagging, or in e.g. posterior predictives in Bayes inference, which especially in the MCMC methods are easy to interpret as weighted ensembles.

I’ve seen the idea pop up in disconnected areas recently. Specifically: a Bayesian heuristic for dropout in neural nets, AIC for frequentist model averaging, Neural net ensembles, boosting/bagging, and in a statistical learning context for optimal time series prediction.

This vexingly incomplete article points out that something like model averaging might work for any convex loss thanks to Jensen’s inequality. I am most used to it with K-L loss.

Two articles (Clarke 2003; Minka 2002) point out that model averaging and combination are not the same and the difference is acute in the M-open setting.


Bates, J. M., and C. W. J. Granger. 1969. “The Combination of Forecasts.” Journal of the Operational Research Society 20 (4): 451–68.
Breiman, Leo. 1996. “Stacked Regressions.” Machine Learning 24 (1): 49–64.
Buckland, S. T., K. P. Burnham, and N. H. Augustin. 1997. “Model Selection: An Integral Part of Inference.” Biometrics 53 (2): 603–18.
Claeskens, Gerda, and Nils Lid Hjort. 2008. Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge ; New York: Cambridge University Press.
Clarke, Bertrand. 2003. “Comparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot Be Ignored.” The Journal of Machine Learning Research 4 (null): 683–712.
Clyde, Merlise, and Edward I. George. 2004. “Model Uncertainty.” Statistical Science 19 (1): 81–94.
Fragoso, Tiago M., and Francisco Louzada Neto. 2015. “Bayesian Model Averaging: A Systematic Review and Conceptual Classification.” arXiv:1509.08864 [stat], September.
Hansen, Bruce E. 2007. “Least Squares Model Averaging.” Econometrica 75 (4): 1175–89.
He, Bobby, Balaji Lakshminarayanan, and Yee Whye Teh. 2020. “Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33.
Hinne, Max, Quentin Frederik Gronau, Don van den Bergh, and Eric-Jan Wagenmakers. 2019. “A Conceptual Introduction to Bayesian Model Averaging.” Preprint. PsyArXiv.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” arXiv:1503.02531 [cs, Stat], March.
Hjort, Nils Lid, and Gerda Claeskens. 2003. “Frequentist Model Average Estimators.” Journal of the American Statistical Association 98 (464): 879–99.
Hoeting, Jennifer A., David Madigan, Adrian E. Raftery, and Chris T. Volinsky. 1999. “Bayesian Model Averaging: A Tutorial.” Statistical Science 14 (4): 382–417.
Hu, Feifang, and James V. Zidek. 2002. “The Weighted Likelihood.” The Canadian Journal of Statistics / La Revue Canadienne de Statistique 30 (3): 347–71.
Laan, Mark J. van der, Eric C Polley, and Alan E. Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).
Lawless, J. F., and Marc Fredette. 2005. “Frequentist Prediction Intervals and Predictive Distributions.” Biometrika 92 (3): 529–42.
Le, Tri, and Bertrand Clarke. 2017. “A Bayes Interpretation of Stacking for \(\mathcal{M}\)-Complete and \(\mathcal{M}\)-Open Settings.” Bayesian Analysis 12 (3): 807–29.
Leung, G., and A.R. Barron. 2006. “Information Theory and Mixing Least-Squares Regressions.” IEEE Transactions on Information Theory 52 (8): 3396–3410.
Minka, Thomas P. 2002. “Bayesian Model Averaging Is Not Model Combination.”
Phillips, Robert F. 1987. “Composite Forecasting: An Integrated Approach and Optimality Reconsidered.” Journal of Business & Economic Statistics 5 (3): 389–95.
Piironen, Juho, and Aki Vehtari. 2017. “Comparison of Bayesian Predictive Methods for Model Selection.” Statistics and Computing 27 (3): 711–35.
Polley, Eric, and Mark van der Laan. 2010. “Super Learner In Prediction.” U.C. Berkeley Division of Biostatistics Working Paper Series, May.
Shen, Xiaotong, and Hsin-Cheng Huang. 2006. “Optimal Model Assessment, Selection, and Combination.” Journal of the American Statistical Association 101 (474): 554–68.
Shwartz-Ziv, Ravid, and Amitai Armon. 2021. “Tabular Data: Deep Learning Is Not All You Need.” arXiv:2106.03253 [cs], June.
Wang, Haiying, Xinyu Zhang, and Guohua Zou. 2009. “Frequentist Model Averaging Estimation: A Review.” Journal of Systems Science and Complexity 22 (4): 732.
Wolpert, David H. 1992. “Stacked Generalization.” Neural Networks 5 (2): 241–59.
Zhang, Xinyu, and Hua Liang. 2011. “Focused Information Criterion and Model Averaging for Generalized Additive Partial Linear Models.” The Annals of Statistics 39 (1): 174–200.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.