Model averaging

On keeping many incorrect hypotheses and using them all as one goodish one

Ensemble methods. A mere placeholder to remind me to create a model averaging notebook, since I’ve seen the idea pop up in disconnected areas recently. specifically a Bayesian heuristic for dropout in neural nets, AIC for frequentist model averaging, and in a statistical learning context for optimal time series prediction.

Relationship to Bayesian posterior predictive distributions?

This seems to not be quite the same thing as bagging, in that when you take a

Model weights are often in terms of degrees-of-freedom penalties. It would probably be an instructive exercise for me to work out why for myself.

Hinton, Vinyals, and Dean (2015) has some work on making a model that already includes several averaged models into itself, somehow. This is apparently called “dark knowledge”.


Bates, J. M., and C. W. J. Granger. 1969. “The Combination of Forecasts.” Journal of the Operational Research Society 20 (4): 451–68.
Buckland, S. T., K. P. Burnham, and N. H. Augustin. 1997. “Model Selection: An Integral Part of Inference.” Biometrics 53 (2): 603–18.
Claeskens, Gerda, and Nils Lid Hjort. 2008. Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge ; New York: Cambridge University Press.
Clyde, Merlise, and Edward I. George. 2004. “Model Uncertainty.” Statistical Science 19 (1): 81–94.
Fragoso, Tiago M., and Francisco Louzada Neto. 2015. “Bayesian Model Averaging: A Systematic Review and Conceptual Classification.” September 29, 2015.
Hansen, Bruce E. 2007. “Least Squares Model Averaging.” Econometrica 75 (4): 1175–89.
He, Bobby, Balaji Lakshminarayanan, and Yee Whye Teh. 2020. “Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33.
Hinne, Max, Quentin Frederik Gronau, Don van den Bergh, and Eric-Jan Wagenmakers. 2019. “A Conceptual Introduction to Bayesian Model Averaging.” Preprint. PsyArXiv.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” March 9, 2015.
Hjort, Nils Lid, and Gerda Claeskens. 2003. “Frequentist Model Average Estimators.” Journal of the American Statistical Association 98 (464): 879–99.
Hoeting, Jennifer A., David Madigan, Adrian E. Raftery, and Chris T. Volinsky. 1999. “Bayesian Model Averaging: A Tutorial.” Statistical Science 14 (4): 382–417.
Hu, Feifang, and James V. Zidek. 2002. “The Weighted Likelihood.” The Canadian Journal of Statistics / La Revue Canadienne de Statistique 30 (3, 3): 347–71.
Laan, Mark J. van der, Eric C Polley, and Alan E. Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).
Lawless, J. F., and Marc Fredette. 2005. “Frequentist Prediction Intervals and Predictive Distributions.” Biometrika 92 (3): 529–42.
Le, Tri, and Bertrand Clarke. 2017. “A Bayes Interpretation of Stacking for $\mathcal{}M{}$-Complete and $\mathcal{}M{}$-Open Settings.” Bayesian Analysis 12 (3): 807–29.
Leung, G., and A. R. Barron. 2006. “Information Theory and Mixing Least-Squares Regressions.” IEEE Transactions on Information Theory 52 (8): 3396–3410.
Phillips, Robert F. 1987. “Composite Forecasting: An Integrated Approach and Optimality Reconsidered.” Journal of Business & Economic Statistics 5 (3): 389–95.
Piironen, Juho, and Aki Vehtari. 2017. “Comparison of Bayesian Predictive Methods for Model Selection.” Statistics and Computing 27 (3): 711–35.
Polley, Eric, and Mark van der Laan. 2010. “Super Learner In Prediction.” U.C. Berkeley Division of Biostatistics Working Paper Series, May.
Shen, Xiaotong, and Hsin-Cheng Huang. 2006. “Optimal Model Assessment, Selection, and Combination.” Journal of the American Statistical Association 101 (474): 554–68.
Wang, Haiying, Xinyu Zhang, and Guohua Zou. 2009. “Frequentist Model Averaging Estimation: A Review.” Journal of Systems Science and Complexity 22 (4): 732.
Zhang, Xinyu, and Hua Liang. 2011. “Focused Information Criterion and Model Averaging for Generalized Additive Partial Linear Models.” The Annals of Statistics 39 (1): 174–200.