Model mixing, model averaging for regression
Switching regression, mixture of experts
March 29, 2016 — August 27, 2024
Mixtures where the target is the predictor-conditional posterior density, by likelihood weighting of each sub-model. Non-likelihood approaches are referenced in model averaging or neural mixtures.
1 Bayesian Inference with Mixture Priors
When dealing with Bayesian inference where the prior is a mixture density, the resulting posterior distribution will also generally be a mixture density.
Assume the prior for \(\theta\) is a mixture of \(K\) densities \(p_k(\theta)\) with mixture weights \(\pi_k\), where \(\sum_{k=1}^K \pi_k = 1\):
\[ p(\theta) = \sum_{k=1}^K \pi_k p_k(\theta) \]
Using Bayes’ theorem, the posterior distribution of \(\theta\) given data \(x\) is:
\[ p(\theta | x) = \frac{p(x | \theta) p(\theta)}{p(x)} \]
Substituting the mixture prior into Bayes’ theorem gives:
\[ p(\theta | x) = \frac{p(x | \theta) \sum_{k=1}^K \pi_k p_k(\theta)}{p(x)} \]
The numerator of the posterior can be expanded as:
\[ p(\theta | x) = \frac{\sum_{k=1}^K \pi_k p(x | \theta) p_k(\theta)}{p(x)} \]
The marginal likelihood \(p(x)\) is computed using the law of total probability:
\[ p(x) = \int p(x | \theta) p(\theta) d\theta = \sum_{k=1}^K \pi_k p(x | k) \]
where \(p(x | k)\) is defined as:
\[ p(x | k) = \int p(x | \theta) p_k(\theta) d\theta \]
The final form of the posterior is:
\[ p(\theta | x) = \frac{\sum_{k=1}^K \pi_k p(x | \theta) p_k(\theta)}{\sum_{k=1}^K \pi_k p(x | k)} \]
This can be further simplified to a mixture form:
\[ p(\theta | x) = \sum_{k=1}^K w_k(x) p_k(\theta | x) \]
where the posterior weights \(w_k(x)\) are:
\[ w_k(x) = \frac{\pi_k p(x | k)}{\sum_{j=1}^K \pi_j p(x | j)} \]
and \(p_k(\theta | x)\) is the component-specific posterior for \(\theta\), updated based on the \(k\)-th component of the prior:
\[ p_k(\theta | x) = \frac{p(x | \theta) p_k(\theta)}{p(x | k)} \]
The posterior distribution \(p(\theta | x)\) is a mixture of the component-specific posteriors \(p_k(\theta | x)\), with each component weighted by \(w_k(x)\). These weights are updated based on the explanatory power of each component regarding the observed data \(x\), adjusted by the original prior weights \(\pi_k\).
In Bayesian inference, using a mixture prior leads to a posterior that is also a mixture, effectively combining different models or beliefs about the parameters, each updated according to its relative contribution to explaining the new data.
2 Under mis-specification
See M-open for a discussion of the M-open setting.