An optimisation method popular in statistics that alternates gradient updates with some “tractable” variables and sampling for some fiddly ones. Famous in statistics for solving problems such as data imputation. Pairs well with Monte Carlo Gradients.
There are many variants of the EM family, so perhaps it is best to introduce the classic and then discuss some of the variations.
We have an experimental process that generates a random vector
Call
The following form of the algorithm works when the log-likelihood
At step
We attempt to improve our estimate of the parameter of interest by the following iterative algorithm:
“Expectation”: Under the completed data model with joint distribution
we estimate as“Maximisation”: Solve a (hopefully easier) maximisation problem:
In the case that this log likelihood is not linear in
Notably we haven’t assumed that
Even if we do the right thing, EM may not converge especially well, or to the global maximum, but it can be easy and robust to get started with, and at least it doesn’t make things worse.
Literature note — apparently the proofs in Dempster, Laird, and Rubin (1977) are dicey; See Wu (1983) for an improved (i.e. correct) version or Wainwright and Jordan (2008) for an interpretation in terms of graphical models wherein the algorithm is a form of message passing.
James Coughlan’s Transparent Interpretation of the EM Algorithm connects us to the Kullback-Leibler variational inference literature. We write data
[…] maximising Neal and Hinton’s joint function of
and a distribution on is equivalent to maximum likelihood estimation. The key point is to note that maximising
over is equivalent to maximising
jointly over
and $(y). […] […We rewrite this cost function]
where
is the entropy of . This expression is in turn equivalent to
which is the same as the function
given in Neal and Hinton. This function is maximized iteratively, where each iteration consists of two separate maximisations, one over and another
See also Dan Piponi, Expectation-Maximisation with Less Arbitrariness
My goal is to fill in the details of one key step in the derivation of the EM algorithm in a way that makes it inevitable rather than arbitrary.
Chen et al. (2018) looks fun:
Expectation-Maximisation (EM) is a popular tool for learning latent variable models, but the vanilla batch EM does not scale to large data sets because the whole data set is needed at every E-step. Stochastic Expectation Maximisation (sEM) reduces the cost of E-step by stochastic approximation. However, sEM has a slower asymptotic convergence rate than batch EM, and requires a decreasing sequence of step sizes, which is difficult to tune. In this paper, we propose a variance reduced stochastic EM (sEM-vr) algorithm inspired by variance reduced stochastic gradient descent algorithms. We show that sEM-vr has the same exponential asymptotic convergence rate as batch EM. Moreover, sEM-vr only requires a constant step size to achieve this rate, which alleviates the burden of parameter tuning. We compare sEM-vr with batch EM, sEM and other algorithms on Gaussian mixture models and probabilistic latent semantic analysis, and sEM-vr converges significantly faster than these baselines.