Model averaging, model stacking, model ensembling

On keeping many incorrect hypotheses and using them all as one goodish one

June 20, 2017 — July 24, 2023

model selection
Figure 1: Witness the Feejee mermaid

Train a bunch of different models and use them all. Fashionable in the form of blending, stacking or staging in machine learning competitions, but also popular in classic frequentist inference as model averaging or bagging, or in e.g. posterior predictives in Bayes inference, which especially in the MCMC methods are easy to interpret as weighted ensembles.

I’ve seen the idea pop up in disconnected areas recently. Specifically: a Bayesian heuristic for dropout in neural nets, AIC for frequentist model averaging, Neural net ensembles, boosting/bagging, and in a statistical learning context for optimal time series prediction.

This vexingly incomplete article points out that something like model averaging might work for any convex loss thanks to Jensen’s inequality.

Two articles (Clarke 2003; Minka 2002) point out that model averaging and combination are not the same and the difference is acute in the M-open setting.

1 Mixtures of models

See mixtue models.

2 Stacking

Alternate fun branding: “super learning”. Not actually model averaging, but looks pretty similar if you squint.

Breiman (1996); Clarke (2003); T. Le and Clarke (2017); Naimi and Balzer (2018); Ting and Witten (1999); Wolpert (1992); Yao et al. (2022); Y. Zhang et al. (2022)

3 Bayesian stacking

As above, but Bayesian. Motivates suggestive invocation of M-open machinery. (Clarke 2003; Clyde and Iversen 2013; Hoeting et al. 1999; T. Le and Clarke 2017; T. M. Le and Clarke 2022; Minka 2002; Naimi and Balzer 2018; Polley 2010; Ting and Witten 1999; Wolpert 1992; Yao et al. 2022, 2018)

4 Forecasting

Time series prediction? Try ensemble methods for time series.

5 References

Alquier. 2021. User-Friendly Introduction to PAC-Bayes Bounds.” arXiv:2110.11216 [Cs, Math, Stat].
Bates, and Granger. 1969. The Combination of Forecasts.” Journal of the Operational Research Society.
Bishop, and Svensen. 2012. Bayesian Hierarchical Mixtures of Experts.” arXiv:1212.2447 [Cs, Stat].
Breiman. 1996. Stacked Regressions.” Machine Learning.
Buckland, Burnham, and Augustin. 1997. Model Selection: An Integral Part of Inference.” Biometrics.
Card, Zhang, and Smith. 2019. Deep Weighted Averaging Classifiers.” In Proceedings of the Conference on Fairness, Accountability, and Transparency.
Claeskens, and Hjort. 2008. Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics.
Clarke. 2003. Comparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot Be Ignored.” The Journal of Machine Learning Research.
Clyde, and George. 2004. Model Uncertainty.” Statistical Science.
Clyde, and Iversen. 2013. Bayesian Model Averaging in the M-Open Framework.” In Bayesian Theory and Applications.
Fragoso, and Neto. 2015. Bayesian Model Averaging: A Systematic Review and Conceptual Classification.” arXiv:1509.08864 [Stat].
Gammerman, and Vovk. 2007. Hedging Predictions in Machine Learning.” The Computer Journal.
Ganaie, Hu, Malik, et al. 2022. Ensemble Deep Learning: A Review.” Engineering Applications of Artificial Intelligence.
Hansen. 2007. Least Squares Model Averaging.” Econometrica.
He, Lakshminarayanan, and Teh. 2020. Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems.
Hinne, Gronau, van den Bergh, et al. 2019. A Conceptual Introduction to Bayesian Model Averaging.” Preprint.
Hinton, Vinyals, and Dean. 2015. Distilling the Knowledge in a Neural Network.” arXiv:1503.02531 [Cs, Stat].
Hjort, and Claeskens. 2003. Frequentist Model Average Estimators.” Journal of the American Statistical Association.
Hoeting, Madigan, Raftery, et al. 1999. Bayesian Model Averaging: A Tutorial.” Statistical Science.
Hu, and Zidek. 2002. The Weighted Likelihood.” The Canadian Journal of Statistics / La Revue Canadienne de Statistique.
Lawless, and Fredette. 2005. Frequentist Prediction Intervals and Predictive Distributions.” Biometrika.
Le, Tri, and Clarke. 2017. A Bayes Interpretation of Stacking for M-Complete and M-Open Settings.” Bayesian Analysis.
Le, Tri M., and Clarke. 2022. Model Averaging Is Asymptotically Better Than Model Selection For Prediction.” Journal of Machine Learning Research.
Leung, and Barron. 2006. Information Theory and Mixing Least-Squares Regressions.” IEEE Transactions on Information Theory.
Minka. 2002. Bayesian Model Averaging Is Not Model Combination.”
Naimi, and Balzer. 2018. Stacked Generalization: An Introduction to Super Learning.” European Journal of Epidemiology.
Phillips. 1987. Composite Forecasting: An Integrated Approach and Optimality Reconsidered.” Journal of Business & Economic Statistics.
Piironen, and Vehtari. 2017. Comparison of Bayesian Predictive Methods for Model Selection.” Statistics and Computing.
Polley. 2010. Super Learner In Prediction.” U.C. Berkeley Division of Biostatistics Working Paper Series.
Ramchandran, and Mukherjee. 2021. On Ensembling Vs Merging: Least Squares and Random Forests Under Covariate Shift.” arXiv:2106.02589 [Math, Stat].
Shen, and Huang. 2006. Optimal Model Assessment, Selection, and Combination.” Journal of the American Statistical Association.
Shwartz-Ziv, and Armon. 2021. Tabular Data: Deep Learning Is Not All You Need.” arXiv:2106.03253 [Cs].
Ting, and Witten. 1999. Issues in Stacked Generalization.” Journal of Artificial Intelligence Research.
van der Laan, Polley, and Hubbard. 2007. Super Learner.” Statistical Applications in Genetics and Molecular Biology.
Wang, Xiaofang, Kondratyuk, Christiansen, et al. 2021. Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models.” arXiv:2012.01988 [Cs].
Wang, Haiying, Zhang, and Zou. 2009. Frequentist Model Averaging Estimation: A Review.” Journal of Systems Science and Complexity.
Waterhouse, MacKay, and Robinson. 1995. Bayesian Methods for Mixtures of Experts.” In Advances in Neural Information Processing Systems.
Wolpert. 1992. Stacked Generalization.” Neural Networks.
Yao, Pirš, Vehtari, et al. 2022. Bayesian Hierarchical Stacking: Some Models Are (Somewhere) Useful.” Bayesian Analysis.
Yao, Vehtari, Simpson, et al. 2018. Using Stacking to Average Bayesian Predictive Distributions.” Bayesian Analysis.
Zhang, Tianfang, Bokrantz, and Olsson. 2021. A Similarity-Based Bayesian Mixture-of-Experts Model.” arXiv:2012.02130 [Cs, Stat].
Zhang, Xinyu, and Liang. 2011. Focused Information Criterion and Model Averaging for Generalized Additive Partial Linear Models.” The Annals of Statistics.
Zhang, Yuzhen, Ma, Liang, et al. 2022. A Stacking Ensemble Algorithm for Improving the Biases of Forest Aboveground Biomass Estimations from Multiple Remotely Sensed Datasets.” GIScience & Remote Sensing.