Model averaging

On keeping many incorrect hypotheses and using them all as one goodish one


Train a bunch of different models and use them all. Fashionable in the form of blending, stacking or or staging in machine learning competitions, but also popular in classic inference.

A mere placeholder. For now see Ensemble learning on Wikipedia I’ve seen the idea pop up in disconnected areas recently. Specifically: a Bayesian heuristic for dropout in neural nets, AIC for frequentist model averaging, Neural net ensembles, boosting/bagging, and in a statistical learning context for optimal time series prediction.

This vexingly incomplete article points out that something like model averaging might work for any convex loss thanks to Jensen’s inequality. I am most used to it with K-L loss.

References

Bates, J. M., and C. W. J. Granger. 1969. “The Combination of Forecasts.” Journal of the Operational Research Society 20 (4): 451–68. https://doi.org/10.1057/jors.1969.103.
Breiman, Leo. 1996. “Stacked Regressions.” Machine Learning 24 (1, 1): 49–64. https://doi.org/10.1007/BF00117832.
Buckland, S. T., K. P. Burnham, and N. H. Augustin. 1997. “Model Selection: An Integral Part of Inference.” Biometrics 53 (2): 603–18. https://doi.org/10.2307/2533961.
Claeskens, Gerda, and Nils Lid Hjort. 2008. Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge ; New York: Cambridge University Press.
Clyde, Merlise, and Edward I. George. 2004. “Model Uncertainty.” Statistical Science 19 (1): 81–94. https://doi.org/10.1214/088342304000000035.
Fragoso, Tiago M., and Francisco Louzada Neto. 2015. “Bayesian Model Averaging: A Systematic Review and Conceptual Classification.” September 29, 2015. http://arxiv.org/abs/1509.08864.
Hansen, Bruce E. 2007. “Least Squares Model Averaging.” Econometrica 75 (4): 1175–89. https://doi.org/10.1111/j.1468-0262.2007.00785.x.
He, Bobby, Balaji Lakshminarayanan, and Yee Whye Teh. 2020. “Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33. https://proceedings.neurips.cc//paper_files/paper/2020/hash/0b1ec366924b26fc98fa7b71a9c249cf-Abstract.html.
Hinne, Max, Quentin Frederik Gronau, Don van den Bergh, and Eric-Jan Wagenmakers. 2019. “A Conceptual Introduction to Bayesian Model Averaging.” Preprint. PsyArXiv. https://doi.org/10.31234/osf.io/wgb64.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” March 9, 2015. http://arxiv.org/abs/1503.02531.
Hjort, Nils Lid, and Gerda Claeskens. 2003. “Frequentist Model Average Estimators.” Journal of the American Statistical Association 98 (464): 879–99. https://doi.org/10.1198/016214503000000828.
Hoeting, Jennifer A., David Madigan, Adrian E. Raftery, and Chris T. Volinsky. 1999. “Bayesian Model Averaging: A Tutorial.” Statistical Science 14 (4): 382–417. https://doi.org/10.1214/ss/1009212519.
Hu, Feifang, and James V. Zidek. 2002. “The Weighted Likelihood.” The Canadian Journal of Statistics / La Revue Canadienne de Statistique 30 (3, 3): 347–71. https://doi.org/10.2307/3316141.
Laan, Mark J. van der, Eric C Polley, and Alan E. Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1). https://doi.org/10.2202/1544-6115.1309.
Lawless, J. F., and Marc Fredette. 2005. “Frequentist Prediction Intervals and Predictive Distributions.” Biometrika 92 (3): 529–42. https://doi.org/10.1093/biomet/92.3.529.
Le, Tri, and Bertrand Clarke. 2017. “A Bayes Interpretation of Stacking for $\Mathcal{}M{}$-Complete and $\Mathcal{}M{}$-Open Settings.” Bayesian Analysis 12 (3): 807–29. https://doi.org/10.1214/16-BA1023.
Leung, G., and A. R. Barron. 2006. “Information Theory and Mixing Least-Squares Regressions.” IEEE Transactions on Information Theory 52 (8): 3396–3410. https://doi.org/10.1109/TIT.2006.878172.
Phillips, Robert F. 1987. “Composite Forecasting: An Integrated Approach and Optimality Reconsidered.” Journal of Business & Economic Statistics 5 (3): 389–95. https://doi.org/10.1080/07350015.1987.10509603.
Piironen, Juho, and Aki Vehtari. 2017. “Comparison of Bayesian Predictive Methods for Model Selection.” Statistics and Computing 27 (3): 711–35. https://doi.org/10.1007/s11222-016-9649-y.
Polley, Eric, and Mark van der Laan. 2010. “Super Learner In Prediction.” U.C. Berkeley Division of Biostatistics Working Paper Series, May. https://biostats.bepress.com/ucbbiostat/paper266.
Shen, Xiaotong, and Hsin-Cheng Huang. 2006. “Optimal Model Assessment, Selection, and Combination.” Journal of the American Statistical Association 101 (474): 554–68. https://doi.org/10.1198/016214505000001078.
Wang, Haiying, Xinyu Zhang, and Guohua Zou. 2009. “Frequentist Model Averaging Estimation: A Review.” Journal of Systems Science and Complexity 22 (4): 732. https://doi.org/10.1007/s11424-009-9198-y.
Wolpert, David H. 1992. “Stacked Generalization.” Neural Networks 5 (2): 241–59. https://doi.org/10.1016/S0893-6080(05)80023-1.
Zhang, Xinyu, and Hua Liang. 2011. “Focused Information Criterion and Model Averaging for Generalized Additive Partial Linear Models.” The Annals of Statistics 39 (1): 174–200. https://doi.org/10.1214/10-AOS832.

Warning! Experimental comments system! If is does not work for you, let me know via the contact form.

No comments yet!

GitHub-flavored Markdown & a sane subset of HTML is supported.