ELBO
Evidence lower bound, variational free energy
2020-10-02 — 2025-10-30
Wherein the evidence lower bound is presented as a decomposition into an approximating KL term and an expected log‑likelihood plus entropy, and importance‑weighted sampling is indicated as next step.
\[ \renewcommand{\Ex}{\mathbb{E}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\kl}{\operatorname{KL}} \renewcommand{\H}{\mathbb{H}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\pd}{\partial} \]
On using the most convenient probability metric (i.e. KL divergence) to do variational inference.
There is nothing novel in this post. But everyone doing variational inference has to work through this once, and I’m doing that here.
TODO: connection to Bregman divergences.
Yuge Shi’s introduction is the best short intro that gets to the state of the art. A deeper intro is Matthews (2017), who did a thesis on it. Section 21.2 of Murphy (2012) is also pretty good.
We often want a variational approximation to the marginal (log-)likelihood \(\log p_{\theta}(\vv{x})\) (a.k.a. “evidence”) of some probabilistic model with observations \(\vv{x},\) unobserved latent factors \(\vv{z}\), model parameters \(\theta\) and variational parameters \(\phi\).
Here’s one. The steps are elementary, although realising we should take them isn’t, IMO.
By convention, we assume everything has a density with respect to some unspecified dominating measure over \(\vv{x}\) and \(\vv{z}\), which is usually OK.1
\[\begin{aligned} \log p_{\theta}(\vv{x}) &=\log (p_{\theta}(\vv{x})) \Ex_{ \vrv{z}\sim q_{\phi}(\vv{z}|\vv{x})}\left[ 1 \right] \\ &=\Ex_{ \vrv{z}\sim q_{\phi}(\vv{z}|\vv{x})}\left[ \log (p_{\theta}(\vv{x})) \right]\\ &=\Ex_{ \vrv{z}\sim q_{\phi}(\vv{z}|\vv{x})}\left[ \log \left( \frac{ p_{\theta}(\vv{x},\vv{z}) }{ p_{\theta}(\vv{z}|\vv{x}) } \right) \right]\\ &=\Ex_{ \vrv{z}\sim q_{\phi}(\vv{z}|\vv{x})}\left[ \log\left( \frac{ q_{\phi}(\vv{z}|\vv{x})p_{\theta}(\vv{x},\vv{z}) }{ p_{\theta}(\vv{z}|\vv{x})q_{\phi}(\vv{z}|\vv{x}) }\right) \right]\\ &=\Ex_{ \vrv{z}\sim q_{\phi}(\vv{z}|\vv{x})}\left[\log\left(\frac{ q_{\phi}(\vv{z}|\vv{x}) }{ p_{\theta}(\vv{z}|\vv{x}) }\right) \right] \\ &\qquad+\Ex_{ \vrv{z}\sim q_{\phi}(\vv{z}|\vv{x})}\left[ \log\left(\frac{ p_{\theta}(\vv{x},\vv{z}) }{ q_{\phi}(\vv{z}|\vv{x}) }\right) \right] \nonumber\\ &=\kl_{\vv{z}}( q_{\phi}(\vv{z}|\vv{x}) \| p_{\theta}(\vv{z}|\vv{x}) ) +\mathcal{L}(\vv{x},\theta,\phi) \end{aligned}\]
\[ \mathcal{L}(\vv{x},\theta,\phi):=\Ex_{ \vrv{z}\sim q_{\phi}(\vv{z}|\vv{x})}\left[\log\left(\frac{ p_{\theta}(\vv{x},\vv{z}) }{ q_{\phi}(\vv{z}|\vv{x})}\right) \right] \] is called the free energy or the Evidence Lower Bound.
We can understand this as a decomposition of the total marginal evidence into two parts that are somewhat interpretable. \(\kl_{\vv{z}}( q_{\phi}(\vv{z}|\vv{x}) \| p_{\theta}(\vv{z}|\vv{x}))\) represents the cost of approximating the exact \(\theta\) posterior over the latents with the \(\phi\) distribution. We can’t actually evaluate this in general, but for fancy enough \(q_{\phi}\) it could be small.
The other term, \(\mathcal{L}(\vv{x},\theta,\phi)\), represents an objective we can actually maximize, which motivates a whole bunch of technology. We can stop there and think of it as just that, or we can decompose it further. The traditional next step, if we want to decompose further, is to observe that
\[\begin{aligned} \mathcal{L}(\vv{x},\theta,\phi) &=\Ex_{ \vrv{z}\sim q_{\phi}(\vrv{z}|\vv{x})}\left[\log\left(\frac{ p_{\theta}(\vv{x},\vrv{z}) }{ q_{\phi}(\vrv{z}|\vv{x}) } \right) \right]\\ &=\Ex_{ \vrv{z}\sim q_{\phi}(\vrv{z}|\vv{x})}\left[\log\left(\frac{ p_{\theta}(\vv{x}|\vrv{z})p_{\theta}(\vrv{z}) }{ q_{\phi}(\vrv{z}|\vv{x}) } \right) \right]\\ &=\Ex_{ \vrv{z}\sim q_{\phi}(\vrv{z}|\vv{x})}\left[\log(p_{\theta}(\vv{x}|\vrv{z})) \right] -\Ex_{ \vrv{z}\sim q_{\phi}(\vrv{z}|\vv{x})}\left[\log\left(\frac{ q_{\phi}(\vrv{z}|\vv{x}) }{ p_{\theta}(\vrv{z}) } \right) \right]\\ &=\underbrace{\Ex_{ \vrv{z}\sim q_{\phi}(\vrv{z}|\vv{x})}\left[\log(p_{\theta}(\vv{x}|\vrv{z})) \right]}_{\text{Expected log likelihood}} - \underbrace{\kl_{\vrv{z}}( q_{\phi}(\vrv{z}|\vv{x}) \| p_{\theta}(\vrv{z}))}_{\text{KL of approx posterior update}} \end{aligned}\]
This suggests we should intuitively view maximising the ELBO as maximising the data-conditional likelihood of the latents while penalising them for diverging too far from their prior. I find a marginal prior on the latents to be a weird concept here, and this formulation makes my head hurt.
Yuge Shi, summarising Hoffman and Johnson (2016) and Mathieu et al. (2019), observes that we can break this down by regarding the per-observation latents as a coding problem. Suppose we index our \(N\) observations by \(n\) and they are independent. Then we can write this bad boy using the marginal \(q(\vv{z})\):
\[\begin{aligned} \mathcal{L}(\vv{x},\theta,\phi) &= \underbrace{\left[ \frac{1}{N} \sum^N_{n=1} \Ex_{q(z_n\mid x_n)} [\log p(x_n \mid z_n)] \right]}_{\color{#4059AD}{\text{(1) Average reconstruction}}} - \underbrace{(\log N - \Ex_{q(z)}[\H[q(x_n\mid z)]])}_{\color{#EE6C4D}{\text{(2) Index-code mutual info}}} \\ &\quad + \underbrace{\kl(q(z)\mid p(z))}_{\color{#86CD82}{\text{(3) KL between q and p}}} \notat \\ \end{aligned}\] \(\H\left[p(\vv{z}) \right] \triangleq-\Ex_{p(\vv{z})} \left[\log p(\vv{z})\right]\) is the entropy.
NB: I got an AI to transcribe that equation from a screenshot, and it looks wrong. TODO: fix it.
Another way to rewrite the ELBO is \[\mathcal{L}(\vv{x},\theta,\phi)=\Ex_{q_{\phi}(\vv{z} \mid \vv{x})}\left[\log p_{\theta}(\vv{z}, \vv{x})\right]+\H\left[q_{\phi}(\vv{z} \mid \vv{x})\right].\] The log joint \(\log p_{\theta}(z, x)\) is what physics people call the “negative energy”. This version highlights that a good posterior approximation \(q_{\phi}(z \mid x)\) must assign most of its probability mass to regions of low energy (i.e. high joint probability). At the same time, the entropy term in the ELBO prevents \(q_{\phi}(z \mid x)\) from collapsing to an atom, unlike in, say, a MAP estimate.
Next up, Importance weighted sampling in variational inference. Also recommended by Yuge Shi; see Adam Kosiorek’s What is wrong with VAEs?, which finds autoencoders via importance sampling.
1 Bethe free energy
Everything so far has been about the Helmholtz free energy. In graphical models we’re concerned with a related, more general(?) free energy called the Bethe free energy (J. S. Yedidia, Freeman, and Weiss 2005; Jonathan S. Yedidia, Freeman, and Weiss 2000; M. J. Wainwright and Jordan 2008). TODO: expand this section.
