Expectation maximisation

August 17, 2014 — May 21, 2024

algebra
graphical models
hidden variables
hierarchical models
networks
optimization
probabilistic algorithms
probability
statistics
Figure 1

An optimisation method popular in statisticsm that alternates gradient updates with some “tractable” variables, and sampling for some fiddly ones. Famous in statistics for solving problems such as data imputation. Pairs well with Monte Carlo Gradients.

There are many variants of the EM family, so perhaps it is best to introduce the classic and then discuss some of the variations.

We have an experimental process that generates a random vector \(B\cup Y\) according to parameter \(\theta\). We wish to estimate the parameter of interest \(\theta\) by maximum likelihood. However, we only observe i.i.d. samples \(b_i\) drawn from \(B\). The likelihood function of the incomplete data \(L(\theta, b)\) is tedious or intractable to maximise. But the “complete” joint likelihood of both the observed and unobserved components, \(L(\theta, \{b_i\}, y)\), is easier to maximise.

Call \(\theta^{(k)}\) the estimate of \(\theta\) at step \(k\). Write \(\ell(\theta, \{b_i\}, y)\equiv\log L(\theta, \{b_i\}, y)\) because we work in log likelihoods why not.

The following form of the algorithm works when the log-likelihood \(\ell(\theta, b, y)\) is linear in \(b\).

At step \(k=0\) we start with an estimate of \(\theta^{(0)}\) chosen arbitrarily or by our favourite approximate method.

We attempt to improve our estimate of the parameter of interest by the following iterative algorithm:

  1. “Expectation”: Under the completed data model with joint distribution \(F(b,y,\theta^{(k)})\) we estimate \(y\) as

    \[ y^{(k)}=E_{\theta^{(k)}}[Y|b] \]

  2. “Maximisation”: Solve a (hopefully easier) maximisation problem:

    \[ \theta^{(k+1)}=\operatorname{arg max}_\theta \ell(\theta, b, y^{(k)}) \]

In the case that this log likelihood is not linear in \(b\), we are supposed to instead take

\[ \theta^{(k+1)}=\operatorname{arg max}_\theta E_{\theta^{(k)}}[\ell(\theta, b, Y)|b] \]

Notably we haven’t assumed that \(Y\) is continuous, and thus this algorithm works where we might otherwise have a difficult discrete optimisation problem. Do see discrete gradients for some alternative approaches to that, however.

Even if we do the right thing, EM may not converge especially well, or to the global maximum, but it can be easy and robust to get started with, and at least it doesn’t make things worse.

Literature note — apparently the proofs in Dempster, Laird, and Rubin (1977) are dicey; See Wu (1983) for an improved (i.e. correct) versions or Wainwright and Jordan (2008) for an interpretation in terms of graphical models wherein the algorithm is a form of message passing.

James Coughlan’s Transparent Interpretation of the EM Algorithm connects us to the Kullback-Leibler variational inference literature. We write data \(z\), latent variable \(y\), parameter of interest \(\theta\). Then…

[…] maximizing Neal and Hinton’s joint function of \(\theta\) and a distribution on \(y\) is equivalent to maximum likelihood estimation.

The key point is to note that maximizing \(\log P(z|\theta)\) over \(\theta\) is equivalent to maximizing

\[ \log P (z|\theta)-D(\tilde{P}(y)\|P(y|z,\theta)) \]

jointly over \(\theta\) and \(\tilde{P}(y)\). […]

[…We rewrite this cost function]

\[ H(\tilde{P}) + \sum_y\tilde{P}(y)\log \{P(y|z,\theta)P(z|\theta)\}, \]

where \(H(\tilde{P})=-\sum_y\tilde{P}(y)\log\tilde{P}(y)\) is the entropy of \(\tilde{P}\). This expression is in turn equivalent to

\[ H(\tilde{P}) +\sum_y\tilde{P}(y)\log P(y,z|\theta), \]

which is the same as the function \(F(\tilde{P},\theta)\) given in Neal and Hinton. This function is maximized iteratively, where each iteration consists of two separate maximizations, one over \(\theta\) and another \(\tilde{P}\)

See also Dan Piponi, Expectation-Maximization with Less Arbitrariness

My goal is to fill in the details of one key step in the derivation of the EM algorithm in a way that makes it inevitable rather than arbitrary.

Chen et al. (2018) looks fun:

Expectation-Maximization (EM) is a popular tool for learning latent variable models, but the vanilla batch EM does not scale to large data sets because the whole data set is needed at every E-step. Stochastic Expectation Maximization (sEM) reduces the cost of E-step by stochastic approximation. However, sEM has a slower asymptotic convergence rate than batch EM, and requires a decreasing sequence of step sizes, which is difficult to tune. In this paper, we propose a variance reduced stochastic EM (sEM-vr) algorithm inspired by variance reduced stochastic gradient descent algorithms. We show that sEM-vr has the same exponential asymptotic convergence rate as batch EM. Moreover, sEM-vr only requires a constant step size to achieve this rate, which alleviates the burden of parameter tuning. We compare sEM-vr with batch EM, sEM and other algorithms on Gaussian mixture models and probabilistic latent semantic analysis, and sEM-vr converges significantly faster than these baselines.

1 References

Bilmes. 1998. A Gentle Tutorial of the EM Algorithm and Its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models.” International Computer Science Institute.
Celeux, Chauveau, and Diebolt. 1995. On Stochastic Versions of the EM Algorithm.” Report.
Celeux, Chretien, Forbes, et al. 2001. A Component-Wise EM Algorithm for Mixtures.” Journal of Computational and Graphical Statistics.
Celeux, Forbes, and Peyrard. 2003. EM Procedures Using Mean Field-Like Approximations for Markov Model-Based Image Segmentation.” Pattern Recognition.
Chen, Zhu, Teh, et al. 2018. Stochastic Expectation Maximization with Variance Reduction.” In Advances in Neural Information Processing Systems.
Delyon, Lavielle, and Moulines. 1999. Convergence of a Stochastic Approximation Version of the EM Algorithm.” The Annals of Statistics.
Dempster, Laird, and Rubin. 1977. Maximum Likelihood from Incomplete Data Via the EM Algorithm.” Journal of the Royal Statistical Society: Series B (Methodological).
Kuhn, and Lavielle. 2004. Coupling a Stochastic Approximation Version of EM with an MCMC Procedure.” ESAIM: Probability and Statistics.
Lee, and Scott. 2012. EM Algorithms for Multivariate Gaussian Mixture Models with Truncated and Censored Data.” Computational Statistics & Data Analysis.
McLachlan, Geoffrey J, and Krishnan. 2008. The EM algorithm and extensions.
McLachlan, Geoffrey J., Krishnan, and Ng. 2004. The EM Algorithm.” 2004,24.
Miyahara, Tsumura, and Sughiyama. 2016. Relaxation of the EM Algorithm via Quantum Annealing for Gaussian Mixture Models.” In arXiv:1701.03268 [Cond-Mat, Physics:quant-Ph, Stat].
Navidi. 1997. A Graphical Illustration of the EM Algorithm.” The American Statistician.
Neal, and Hinton. 1998. A View of the EM Algorithm That Justifies Incremental, Sparse, and Other Variants.” In Learning in Graphical Models. NATO ASI Series 89.
Prescher. 2004. A Tutorial on the Expectation-Maximization Algorithm Including Maximum-Likelihood Estimation and EM Training of Probabilistic Context-Free Grammars.” arXiv:cs/0412015.
Roche. 2011. EM Algorithm and Variants: An Informal Tutorial.” arXiv:1105.1476 [Stat].
Wainwright, and Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning.
Wei, and Tanner. 1990. A Monte Carlo Implementation of the EM Algorithm and the Poor Man’s Data Augmentation Algorithms.” Journal of the American Statistical Association.
Wu. 1983. On the Convergence Properties of the EM Algorithm.” The Annals of Statistics.