# Model mixing, model averaging for regression

Switching regression, mixture of experts

March 29, 2016 — August 27, 2024

Mixtures where the target is the predictor-conditional posterior density, by likelihood weighting of each sub-model. Non-likelihood approaches are referenced in model averaging or neural mixtures.

## 1 Bayesian Inference with Mixture Priors

When dealing with Bayesian inference where the prior is a mixture density, the resulting posterior distribution will also generally be a mixture density.

Assume the prior for \(\theta\) is a mixture of \(K\) densities \(p_k(\theta)\) with mixture weights \(\pi_k\), where \(\sum_{k=1}^K \pi_k = 1\):

\[ p(\theta) = \sum_{k=1}^K \pi_k p_k(\theta) \]

Using Bayes’ theorem, the posterior distribution of \(\theta\) given data \(x\) is:

\[ p(\theta | x) = \frac{p(x | \theta) p(\theta)}{p(x)} \]

Substituting the mixture prior into Bayes’ theorem gives:

\[ p(\theta | x) = \frac{p(x | \theta) \sum_{k=1}^K \pi_k p_k(\theta)}{p(x)} \]

The numerator of the posterior can be expanded as:

\[ p(\theta | x) = \frac{\sum_{k=1}^K \pi_k p(x | \theta) p_k(\theta)}{p(x)} \]

The marginal likelihood \(p(x)\) is computed using the law of total probability:

\[ p(x) = \int p(x | \theta) p(\theta) d\theta = \sum_{k=1}^K \pi_k p(x | k) \]

where \(p(x | k)\) is defined as:

\[ p(x | k) = \int p(x | \theta) p_k(\theta) d\theta \]

The final form of the posterior is:

\[ p(\theta | x) = \frac{\sum_{k=1}^K \pi_k p(x | \theta) p_k(\theta)}{\sum_{k=1}^K \pi_k p(x | k)} \]

This can be further simplified to a mixture form:

\[ p(\theta | x) = \sum_{k=1}^K w_k(x) p_k(\theta | x) \]

where the posterior weights \(w_k(x)\) are:

\[ w_k(x) = \frac{\pi_k p(x | k)}{\sum_{j=1}^K \pi_j p(x | j)} \]

and \(p_k(\theta | x)\) is the component-specific posterior for \(\theta\), updated based on the \(k\)-th component of the prior:

\[ p_k(\theta | x) = \frac{p(x | \theta) p_k(\theta)}{p(x | k)} \]

The posterior distribution \(p(\theta | x)\) is a mixture of the component-specific posteriors \(p_k(\theta | x)\), with each component weighted by \(w_k(x)\). These weights are updated based on the explanatory power of each component regarding the observed data \(x\), adjusted by the original prior weights \(\pi_k\).

In Bayesian inference, using a mixture prior leads to a posterior that is also a mixture, effectively combining different models or beliefs about the parameters, each updated according to its relative contribution to explaining the new data.

## 2 Under mis-specification

See M-open for a discussion of the M-open setting.

## 3 References

*IEEE Transactions on Information Theory*.

*Journal of Multivariate Analysis*.

*Bayesian Theory*.

*arXiv:1212.2447 [Cs, Stat]*.

*arXiv:1609.06764 [Stat]*.

*Statistics and Computing*.

*Model Selection and Model Averaging*. Cambridge Series in Statistical and Probabilistic Mathematics.

*The Journal of Machine Learning Research*.

*Bayesian Theory and Applications*.

*Multivariate Statistical Inference*.

*Journal of the Royal Statistical Society: Series B (Methodological)*.

*Statistical Science*.

*arXiv:1509.08864 [Stat]*.

*Econometrica*.

*arXiv:1503.02531 [Cs, Stat]*.

*Journal of the American Statistical Association*.

*Statistical Science*.

*Journal of Computational and Graphical Statistics*.

*Journal of Machine Learning Research*.

*IEEE Transactions on Information Theory*.

*Handbook of Statistics*.

*Proceedings of the 34th International Conference on Neural Information Processing Systems*. NIPS’20.

*Journal of Computational and Graphical Statistics*.

*Statistics and Computing*.

*Journal of the American Statistical Association*.

*Journal of Computational and Graphical Statistics*.

*Statistics and Computing*.

*Journal of Systems Science and Complexity*.

*Advances in Neural Information Processing Systems*.

*Bayesian Analysis*.

*IEEE Transactions on Information Theory*.

*arXiv:2012.02130 [Cs, Stat]*.

*The Annals of Statistics*.