Model averaging

On keeping many incorrect hypotheses and using them all as one goodish one

Witness the Feejee mermaid

A mere placeholder. For now see Ensemble learning on Wikipedia

Train a bunch of different models and use them all. Fashionable in the form of blending, stacking or staging in machine learning competitions, but also popular in classic frequentist inference as model averaging or bagging, or in e.g. posterior predictives in Bayes inference, which especially in the MCMC methods are easy to interpret as weighted ensembles.

I’ve seen the idea pop up in disconnected areas recently. Specifically: a Bayesian heuristic for dropout in neural nets, AIC for frequentist model averaging, Neural net ensembles, boosting/bagging, and in a statistical learning context for optimal time series prediction.

This vexingly incomplete article points out that something like model averaging might work for any convex loss thanks to Jensen’s inequality.

Two articles (Clarke 2003; Minka 2002) point out that model averaging and combination are not the same and the difference is acute in the M-open setting.


Alternate fun branding: β€œsuper learning”. Not actually model averaging, but looks pretty similar if you squint.

Breiman (1996); Clarke (2003); T. Le and Clarke (2017); Naimi and Balzer (2018); Ting and Witten (1999); Wolpert (1992); Yao et al. (2022); Y. Zhang et al. (2022)

Bayesian stacking {#bayesian-stacking{

As above, but Bayesian. Motivates suggestive invocation of M-open machinery. (Clarke 2003; Clyde and Iversen 2013; Hoeting et al. 1999; T. Le and Clarke 2017; T. M. Le and Clarke 2022; Minka 2002; Naimi and Balzer 2018; Polley 2010; Ting and Witten 1999; Wolpert 1992; Yao et al. 2022, 2018)


Alquier, Pierre. 2021. β€œUser-Friendly Introduction to PAC-Bayes Bounds.” arXiv:2110.11216 [Cs, Math, Stat], October.
Bates, J. M., and C. W. J. Granger. 1969. β€œThe Combination of Forecasts.” Journal of the Operational Research Society 20 (4): 451–68.
Bishop, Christopher M., and Markus Svensen. 2012. β€œBayesian Hierarchical Mixtures of Experts.” arXiv:1212.2447 [Cs, Stat], October.
Breiman, Leo. 1996. β€œStacked Regressions.” Machine Learning 24 (1): 49–64.
Buckland, S. T., K. P. Burnham, and N. H. Augustin. 1997. β€œModel Selection: An Integral Part of Inference.” Biometrics 53 (2): 603–18.
Card, Dallas, Michael Zhang, and Noah A. Smith. 2019. β€œDeep Weighted Averaging Classifiers.” In Proceedings of the Conference on Fairness, Accountability, and Transparency, 369–78.
Claeskens, Gerda, and Nils Lid Hjort. 2008. Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge ; New York: Cambridge University Press.
Clarke, Bertrand. 2003. β€œComparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot Be Ignored.” The Journal of Machine Learning Research 4 (null): 683–712.
Clyde, Merlise, and Edward I. George. 2004. β€œModel Uncertainty.” Statistical Science 19 (1): 81–94.
Clyde, Merlise, and Edwin S Iversen. 2013. β€œBayesian Model Averaging in the M-Open Framework.” In Bayesian Theory and Applications, edited by Paul Damien, Petros Dellaportas, Nicholas G. Polson, and David A. Stephens, 0. Oxford University Press.
Fragoso, Tiago M., and Francisco Louzada Neto. 2015. β€œBayesian Model Averaging: A Systematic Review and Conceptual Classification.” arXiv:1509.08864 [Stat], September.
Gammerman, Alexander, and Vladimir Vovk. 2007. β€œHedging Predictions in Machine Learning.” The Computer Journal 50 (2): 151–63.
Ganaie, M. A., Minghui Hu, A. K. Malik, M. Tanveer, and P. N. Suganthan. 2022. β€œEnsemble Deep Learning: A Review.” Engineering Applications of Artificial Intelligence 115 (October): 105151.
Hansen, Bruce E. 2007. β€œLeast Squares Model Averaging.” Econometrica 75 (4): 1175–89.
He, Bobby, Balaji Lakshminarayanan, and Yee Whye Teh. 2020. β€œBayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33.
Hinne, Max, Quentin Frederik Gronau, Don van den Bergh, and Eric-Jan Wagenmakers. 2019. β€œA Conceptual Introduction to Bayesian Model Averaging.” Preprint. PsyArXiv.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. β€œDistilling the Knowledge in a Neural Network.” arXiv:1503.02531 [Cs, Stat], March.
Hjort, Nils Lid, and Gerda Claeskens. 2003. β€œFrequentist Model Average Estimators.” Journal of the American Statistical Association 98 (464): 879–99.
Hoeting, Jennifer A., David Madigan, Adrian E. Raftery, and Chris T. Volinsky. 1999. β€œBayesian Model Averaging: A Tutorial.” Statistical Science 14 (4): 382–417.
Hu, Feifang, and James V. Zidek. 2002. β€œThe Weighted Likelihood.” The Canadian Journal of Statistics / La Revue Canadienne de Statistique 30 (3): 347–71.
Laan, Mark J. van der, Eric C. Polley, and Alan E. Hubbard. 2007. β€œSuper Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).
Lawless, J. F., and Marc Fredette. 2005. β€œFrequentist Prediction Intervals and Predictive Distributions.” Biometrika 92 (3): 529–42.
Le, Tri M., and Bertrand S. Clarke. 2022. β€œModel Averaging Is Asymptotically Better Than Model Selection For Prediction.” Journal of Machine Learning Research 23 (33): 1–53.
Le, Tri, and Bertrand Clarke. 2017. β€œA Bayes Interpretation of Stacking for M-Complete and M-Open Settings.” Bayesian Analysis 12 (3): 807–29.
Leung, G., and A.R. Barron. 2006. β€œInformation Theory and Mixing Least-Squares Regressions.” IEEE Transactions on Information Theory 52 (8): 3396–3410.
Minka, Thomas P. 2002. β€œBayesian Model Averaging Is Not Model Combination.”
Naimi, Ashley I., and Laura B. Balzer. 2018. β€œStacked Generalization: An Introduction to Super Learning.” European Journal of Epidemiology 33 (5): 459–64.
Phillips, Robert F. 1987. β€œComposite Forecasting: An Integrated Approach and Optimality Reconsidered.” Journal of Business & Economic Statistics 5 (3): 389–95.
Piironen, Juho, and Aki Vehtari. 2017. β€œComparison of Bayesian Predictive Methods for Model Selection.” Statistics and Computing 27 (3): 711–35.
Polley, Eric C. 2010. β€œSuper Learner In Prediction.” U.C. Berkeley Division of Biostatistics Working Paper Series, May.
Ramchandran, Maya, and Rajarshi Mukherjee. 2021. β€œOn Ensembling Vs Merging: Least Squares and Random Forests Under Covariate Shift.” arXiv:2106.02589 [Math, Stat], June.
Shen, Xiaotong, and Hsin-Cheng Huang. 2006. β€œOptimal Model Assessment, Selection, and Combination.” Journal of the American Statistical Association 101 (474): 554–68.
Shwartz-Ziv, Ravid, and Amitai Armon. 2021. β€œTabular Data: Deep Learning Is Not All You Need.” arXiv:2106.03253 [Cs], June.
Ting, K. M., and I. H. Witten. 1999. β€œIssues in Stacked Generalization.” Journal of Artificial Intelligence Research 10 (May): 271–89.
Wang, Haiying, Xinyu Zhang, and Guohua Zou. 2009. β€œFrequentist Model Averaging Estimation: A Review.” Journal of Systems Science and Complexity 22 (4): 732.
Wang, Xiaofang, Dan Kondratyuk, Eric Christiansen, Kris M. Kitani, Yair Alon, and Elad Eban. 2021. β€œWisdom of Committees: An Overlooked Approach To Faster and More Accurate Models.” arXiv:2012.01988 [Cs], October.
Waterhouse, Steve, David MacKay, and Anthony Robinson. 1995. β€œBayesian Methods for Mixtures of Experts.” In Advances in Neural Information Processing Systems, 8:7. MIT Press.
Wolpert, David H. 1992. β€œStacked Generalization.” Neural Networks 5 (2): 241–59.
Yao, Yuling, Gregor PirΕ‘, Aki Vehtari, and Andrew Gelman. 2022. β€œBayesian Hierarchical Stacking: Some Models Are (Somewhere) Useful.” Bayesian Analysis 17 (4): 1043–71.
Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2018. β€œUsing Stacking to Average Bayesian Predictive Distributions.” Bayesian Analysis 13 (3): 917–1007.
Zhang, Tianfang, Rasmus Bokrantz, and Jimmy Olsson. 2021. β€œA Similarity-Based Bayesian Mixture-of-Experts Model.” arXiv:2012.02130 [Cs, Stat], July.
Zhang, Xinyu, and Hua Liang. 2011. β€œFocused Information Criterion and Model Averaging for Generalized Additive Partial Linear Models.” The Annals of Statistics 39 (1): 174–200.
Zhang, Yuzhen, Jun Ma, Shunlin Liang, Xisheng Li, and Jindong Liu. 2022. β€œA Stacking Ensemble Algorithm for Improving the Biases of Forest Aboveground Biomass Estimations from Multiple Remotely Sensed Datasets.” GIScience & Remote Sensing 59 (1): 234–49.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.