# Expectation maximisation

A particular optimisation method for statistics for that gets you a maximum likelihood estimate despite various annoyances such as missing data.

Vague description of the algorithm:

We have an experimental process that generates a random vector $$B\cup Y$$ according to parameter $$\theta$$. We wish to estimate the parameter of interest $$\theta$$ by maximum likelihood. However, we only observe i.i.d. samples $$b_i$$ drawn from $$B$$. The likelihood function of the incomplete data $$L(\theta, b)$$ is tedious or intractable to maximise. But the “complete” joint likelihood of both the observed and unobserved components, $$L(\theta, \{b_i\}, y)$$, is easier to maximise. Then we are potentially in a situation where expectation maximisation can help.

Call $$\theta^{(k)}$$ the estimate of $$\theta$$ at step $$k$$. Write $$\ell(\theta, \{b_i\}, y)\equiv\log L(\theta, \{b_i\}, y)$$ because we work in log likelihoods why not.

The following form of the algorithm works when the log-likelihood $$\ell(\theta, b, y)$$ is linear in $$b$$. (Which is equivalent to it being in a exponential family I believe, but should check.)

At time $$k=0$$ we start with an estimate of $$\theta^{(0)}$$ chosen arbitrarily or by our favourite approximate method.

We attempt to improve our estimate of the parameter of interest by the following iterative algorithm:

1. “Expectation”: Under the completed data model with joint distribution $$F(b,y,\theta^{(k)})$$ we estimate $$y$$ as

$y^{(k)}=E_{\theta^{(k)}}[Y|b]$

2. “Maximisation”: Solve a (hopefully easier) maximisation problem:

$\theta^{(k+1)}=\operatorname{arg max}_\theta \ell(\theta, b, y^{(k)})$

In the case that this log likelihood is not linear in $$b$$, you are supposed to instead take

$\theta^{(k+1)}=\operatorname{arg max}_\theta E_{\theta^{(k)}}[\ell(\theta, b, Y)|b]$

In practice this nicety is often ignored.

Even if you do the right thing, EM may not converge especially well, or to the global maximum, but it can be easy and robust to get started with, and at least it doesn’t make things worse.

Literature note — apparently the proofs in Dempster, Laird, and Rubin (1977) are dicey; See Wu (1983) for an improved (i.e. correct) versions or Wainwright and Jordan (2008) for an interpretation in terms of graphical models wherein the algorithm is a form of message passing.

• A Transparent Interpretation of the EM Algorithm by James Coughlan makes an interesting brief point. We write data $$z$$, latent variable $$y$$, parameter of interest $$\theta$$. Then…

[…] maximizing Neal and Hinton’s joint function of $$\theta$$ and a distribution on $$y$$ is equivalent to maximum likelihood estimation.

The key point is to note that maximizing $$\log P(z|\theta)$$ over $$\theta$$ is equivalent to maximizing

$\log P (z|\theta)-D(\tilde{P}(y)\|P(y|z,\theta))$

jointly over $$\theta$$ and $$\tilde{P}(y)$$. […]

[…We rewrite this cost function]

$H(\tilde{P}) + \sum_y\tilde{P}(y)\log \{P(y|z,\theta)P(z|\theta)\},$

where $$H(\tilde{P})=-\sum_y\tilde{P}(y)\log\tilde{P}(y)$$ is the entropy of $$\tilde{P}$$. This expression is in turn equivalent to

$H(\tilde{P}) +\sum_y\tilde{P}(y)\log P(y,z|\theta),$

which is the same as the function $$F(\tilde{P},\theta)$$ given in Neal and Hinton. This function is maximized iteratively, where each iteration consists of two separate maximizations, one over $$\theta$$ and another $$\tilde{P}$$

My goal is to fill in the details of one key step in the derivation of the EM algorithm in a way that makes it inevitable rather than arbitrary.

## References

Bilmes, Jeff A. 1998. International Computer Science Institute 4 (510): 126.
Celeux, Gilles, Didier Chauveau, and Jean Diebolt. 1995. Report.
Celeux, Gilles, Stephane Chretien, Florence Forbes, and Abdallah Mkhadri. 2001. Journal of Computational and Graphical Statistics 10 (4): 697–712.
Celeux, Gilles, Florence Forbes, and Nathalie Peyrard. 2003. Pattern Recognition 36 (1): 131–44.
Delyon, Bernard, Marc Lavielle, and Eric Moulines. 1999. The Annals of Statistics 27 (1): 94–128.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1): 1–22.
Kuhn, Estelle, and Marc Lavielle. 2004. ESAIM: Probability and Statistics 8 (September): 115–31.
Lee, Gyemin, and Clayton Scott. 2012. Computational Statistics & Data Analysis 56 (9): 2816–29.
McLachlan, Geoffrey J., Thriyambakam Krishnan, and See Ket Ng. 2004. 2004,24. Humboldt-Universität Berlin, Center for Applied Statistics and Economics (CASE).
McLachlan, Geoffrey J, and T Krishnan. 2008. The EM algorithm and extensions. Hoboken, N.J.: Wiley-Interscience.
Miyahara, Hideyuki, Koji Tsumura, and Yuki Sughiyama. 2016. In arXiv:1701.03268 [Cond-Mat, Physics:quant-Ph, Stat], 4674–79.
Navidi, William. 1997. The American Statistician 51 (1): 29–31.
Neal, Radford M., and Geoffrey E. Hinton. 1998. In Learning in Graphical Models, edited by Michael I. Jordan, 355–68. NATO ASI Series 89. Springer Netherlands.
Prescher, Detlef. 2004. arXiv:cs/0412015, December.
Roche, Alexis. 2011. arXiv:1105.1476 [Stat], May.
Wainwright, Martin J., and Michael I. Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Vol. 1. Foundations and Trends® in Machine Learning. Now Publishers.
Wei, Greg C. G., and Martin A. Tanner. 1990. Journal of the American Statistical Association 85 (411): 699–704.
Wu, C. F. Jeff. 1983. The Annals of Statistics 11 (1): 95–103.

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.