Model mixing, model averaging for regression

Switching regression, mixture of experts

March 29, 2016 — August 27, 2024

Bayes
classification
clustering
compsci
convolution
density
information
linear algebra
nonparametric
probability
sparser than thou
statistics
Figure 1

Mixtures where the target is the predictor-conditional posterior density, by likelihood weighting of each sub model. Non-likelihood approaches are reference ins model averaging or neural mixtures.

1 Bayesian Inference with Mixture Priors

When dealing with Bayesian inference where the prior is a mixture density, the resulting posterior distribution will also generally be a mixture density.

Assume the prior for \(\theta\) is a mixture of \(K\) densities \(p_k(\theta)\) with mixture weights \(\pi_k\), where \(\sum_{k=1}^K \pi_k = 1\):

\[ p(\theta) = \sum_{k=1}^K \pi_k p_k(\theta) \]

Using Bayes’ theorem, the posterior distribution of \(\theta\) given data \(x\) is:

\[ p(\theta | x) = \frac{p(x | \theta) p(\theta)}{p(x)} \]

Substituting the mixture prior into Bayes’ theorem gives:

\[ p(\theta | x) = \frac{p(x | \theta) \sum_{k=1}^K \pi_k p_k(\theta)}{p(x)} \]

The numerator of the posterior can be expanded as:

\[ p(\theta | x) = \frac{\sum_{k=1}^K \pi_k p(x | \theta) p_k(\theta)}{p(x)} \]

The marginal likelihood \(p(x)\) is computed using the law of total probability:

\[ p(x) = \int p(x | \theta) p(\theta) d\theta = \sum_{k=1}^K \pi_k p(x | k) \]

where \(p(x | k)\) is defined as:

\[ p(x | k) = \int p(x | \theta) p_k(\theta) d\theta \]

The final form of the posterior is:

\[ p(\theta | x) = \frac{\sum_{k=1}^K \pi_k p(x | \theta) p_k(\theta)}{\sum_{k=1}^K \pi_k p(x | k)} \]

This can be further simplified to a mixture form:

\[ p(\theta | x) = \sum_{k=1}^K w_k(x) p_k(\theta | x) \]

where the posterior weights \(w_k(x)\) are:

\[ w_k(x) = \frac{\pi_k p(x | k)}{\sum_{j=1}^K \pi_j p(x | j)} \]

and \(p_k(\theta | x)\) is the component-specific posterior for \(\theta\), updated based on the \(k\)-th component of the prior:

\[ p_k(\theta | x) = \frac{p(x | \theta) p_k(\theta)}{p(x | k)} \]

The posterior distribution \(p(\theta | x)\) is a mixture of the component-specific posteriors \(p_k(\theta | x)\), with each component weighted by \(w_k(x)\). These weights are updated based on the explanatory power of each component regarding the observed data \(x\), adjusted by the original prior weights \(\pi_k\).

In Bayesian inference, using a mixture prior leads to a posterior that is also a mixture, effectively combining different models or beliefs about the parameters, each updated according to its relative contribution to explaining the new data.

2 Under mis-specification

See M-open for a discussion of the M-open setting.

3 References

Barron, Rissanen, and Yu. 1998. The Minimum Description Length Principle in Coding and Modeling.” IEEE Transactions on Information Theory.
Battey, and Sancetta. 2013. Conditional Estimation for Dependent Functional Data.” Journal of Multivariate Analysis.
Bernardo, and Smith. 2000. Bayesian Theory.
Bishop, and Svensen. 2012. Bayesian Hierarchical Mixtures of Experts.” arXiv:1212.2447 [Cs, Stat].
Boyd, Hastie, Boyd, et al. 2016. Saturating Splines and Feature Selection.” arXiv:1609.06764 [Stat].
Chamroukhi, Pham, Hoang, et al. 2024. Functional Mixtures-of-Experts.” Statistics and Computing.
Claeskens, and Hjort. 2008. Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics.
Clarke. 2003. Comparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot Be Ignored.” The Journal of Machine Learning Research.
Clyde, and Iversen. 2013. Bayesian Model Averaging in the M-Open Framework.” In Bayesian Theory and Applications.
Dempster. 1973. “Alternatives to Least Squares in Multiple Regression.” Multivariate Statistical Inference.
Draper. 1995. Assessment and Propagation of Model Uncertainty.” Journal of the Royal Statistical Society: Series B (Methodological).
Eilers, and Marx. 1996. Flexible Smoothing with B-Splines and Penalties.” Statistical Science.
Fragoso, and Neto. 2015. Bayesian Model Averaging: A Systematic Review and Conceptual Classification.” arXiv:1509.08864 [Stat].
Hansen. 2007. Least Squares Model Averaging.” Econometrica.
Hinton, Vinyals, and Dean. 2015. Distilling the Knowledge in a Neural Network.” arXiv:1503.02531 [Cs, Stat].
Hjort, and Claeskens. 2003. Frequentist Model Average Estimators.” Journal of the American Statistical Association.
Hoeting, Madigan, Raftery, et al. 1999. Bayesian Model Averaging: A Tutorial.” Statistical Science.
Hurn, Justel, and Robert. 2003. Estimating Mixtures of Regressions.” Journal of Computational and Graphical Statistics.
Le, and Clarke. 2022. Model Averaging Is Asymptotically Better Than Model Selection For Prediction.” Journal of Machine Learning Research.
Leung, and Barron. 2006. Information Theory and Mixing Least-Squares Regressions.” IEEE Transactions on Information Theory.
Marin, Mengersen, and Robert. 2005. Bayesian Modelling and Inference on Mixtures of Distributions.” In Handbook of Statistics.
Masegosa. 2020. Learning Under Model Misspecification: Applications to Variational and Ensemble Methods.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20.
Minka. 2002. Bayesian Model Averaging Is Not Model Combination.”
Nguyen, Trungtin, Forbes, Arbel, et al. 2023. Bayesian Nonparametric Mixture of Experts for Inverse Problems.”
Nguyen, TrungTin, Nguyen, Chamroukhi, et al. 2024. Non-Asymptotic Oracle Inequalities for the Lasso in High-Dimensional Mixture of Experts.”
Nott, Tan, Villani, et al. 2012. Regression Density Estimation With Variational Methods and Stochastic Approximation.” Journal of Computational and Graphical Statistics.
Piironen, and Vehtari. 2017. Comparison of Bayesian Predictive Methods for Model Selection.” Statistics and Computing.
Raftery, and Zheng. 2003. Discussion: Performance of Bayesian Model Averaging.” Journal of the American Statistical Association.
Tan, and Nott. 2014. Variational Approximation for Mixtures of Linear Mixed Models.” Journal of Computational and Graphical Statistics.
Viele, and Tong. 2002. Modeling with Mixtures of Linear Regressions.” Statistics and Computing.
Wang, Zhang, and Zou. 2009. Frequentist Model Averaging Estimation: A Review.” Journal of Systems Science and Complexity.
Waterhouse, MacKay, and Robinson. 1995. Bayesian Methods for Mixtures of Experts.” In Advances in Neural Information Processing Systems.
Yao, Pirš, Vehtari, et al. 2022. Bayesian Hierarchical Stacking: Some Models Are (Somewhere) Useful.” Bayesian Analysis.
Zeevi, Meir, and Maiorov. 1998. Error Bounds for Functional Approximation and Estimation Using Mixtures of Experts.” IEEE Transactions on Information Theory.
Zhang, Tianfang, Bokrantz, and Olsson. 2021. A Similarity-Based Bayesian Mixture-of-Experts Model.” arXiv:2012.02130 [Cs, Stat].
Zhang, Xinyu, and Liang. 2011. Focused Information Criterion and Model Averaging for Generalized Additive Partial Linear Models.” The Annals of Statistics.
Zhang, Kaiqi, and Wang. 2022. Deep Learning Meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?