Energy based models
Inference with kinda-tractable un-normalized potentials
2021-06-07 — 2025-09-28
Wherein the EBM is framed as a learned scalar potential whose gradient, not the partition function, is required for inference, and conditioning is shown to follow by simple addition of energies.
An EBM specifies a density on \(x\in\mathcal X\) using a scalar energy \(E_\theta(x)\): \[ p_\theta(x);=;\frac{1}{Z(\theta)},\exp{-E_\theta(x)},\quad Z(\theta)=!\int! e^{-E_\theta(x)}dx. \] We parametrize the unnormalized log-density \(-E_\theta(x)\) directly and learn \(\theta\) from data. Inference and sampling require \(\nabla_x E_\theta(x)\), not \(Z(\theta)\).
OK, that much is easy to write out. But my questions about them have always been things like “What is the USP? Is this just MCMC? How do you train them? How do they relate to graphical models? When do they work well in practice?”
Notes on that theme follow.
1 EBMs I’d seen before under a different name
Undirected graphical models are a special case of EBMs with structure.
For an undirected model on variables \(x=(x_i)_{i\in V}\) with cliques \(\mathcal C\), an MRF/CRF is written \[ p_\theta(x)\propto \exp\Big\{-E_\theta(x)\Big\},\qquad E_\theta(x)=\sum_{c\in\mathcal C}\psi_{c,\theta}(x_c). \] This is precisely the EBM parameterization. Factorization over cliques (Hammersley–Clifford) gives the usual conditional independences; the “energy” is just the negative sum of the log-potentials. Early imaging MRFs formalized this Boltzmann view and popularised Gibbs sampling (Geman and Geman 1984, 1987). The LeCun–Chopra–Hadsell tutorial recasts many classical models in this EBM form (LeCun et al. 2006).
CRFs are EBMs for \(p_\theta(y\mid x)\) with energy \(E_\theta(x,y)\) and input-dependent partition function \(Z_\theta(x)\). Training by NLL has the familiar feature-expectation mismatch gradient (data expectation minus model/marginal expectation) and uses exact marginals (e.g., forward–backward) in linear-chain CRFs (Lafferty, McCallum, and Pereira 2001). Swap “features” for “deep features” and you have modern neural EBMs.
Let \(\ell(\theta)=\sum_i\log p_\theta(x^{(i)})\) denote i.i.d. data. Then \[ \nabla_\theta \ell(\theta)= -\underbrace{\mathbb E_{\text{data}}\big[\nabla_\theta E_\theta(X)\big]}_{\text{data term}} +\underbrace{\mathbb E_{p_\theta}\big[\nabla_\theta E_\theta(X)\big]}_{\text{model term}}. \] The model term is the intractable part; this is the same “moment-matching” obstruction we see in undirected exponential families.
Classical workarounds:
- Pseudolikelihood maximises \(\prod_i p_\theta(x_i\mid x_{-i})\) and side-steps the global \(Z(\theta)\). It is consistent under standard conditions and has been a workhorse for large MRFs (Besag 1977).
- Contrastive Divergence (\(CD_k\)) estimates the gradient by replacing \(\mathbb E_{p_\theta}\) with a short MCMC chain from the data—cheap, biased, effective (G. E. Hinton 2002).
- Score-matching fits the score \(\nabla_x\log p_\theta(x)=-\nabla_x E_\theta(x)\) via the Fisher-divergence objective; denoising variants avoid second derivatives and are widespread (Hyvarinen 2007).
- Noise-Contrastive Estimation (NCE) reframes learning an unnormalised model as logistic regression of data vs. noise—consistent and simple to implement (Gutmann and Hyvärinen 2010).
These three—the pseudolikelihood, CD, and score-matching—seem tightly connected; Hyvärinen showed formal links for continuous models (Hyvarinen 2007).
2 Inference and conditioning look exactly like you expect
Because energies add, conditioning is straightforward: \[ \underbrace{-\log p_\theta(x\mid y)}_{\text{posterior energy}} =E_{\theta}(x)+\underbrace{\big(-\log p(y\mid x)\big)}_{\text{likelihood energy}} + \text{const}. \] We can do MAP by minimizing the sum, or posterior sampling with Langevin/HMC using only \(\nabla_x E\) and \(\nabla_x \log p(y\mid x)\). This is a major practical win for EBMs over time-conditioned score models.
3 Worked example
3.1 Ising/pairwise binary MRF: pseudolikelihood and CD in two lines
Let \(x_i\in\{-1,+1\}\) on a graph \(G=(V,E)\) with energy \[ E_\theta(x)=-\sum_{(i,j)\in E} J_{ij} x_i x_j - \sum_i h_i x_i. \] Conditionals are logistic: \[ p_\theta(x_i=1\mid x_{-i})=\sigma\left(2h_i + 2\sum_{j\in\partial i} J_{ij}x_j\right), \quad \sigma(u)=(1+e^{-u})^{-1}. \]
- Pseudolikelihood: fit \(\{J_{ij},h_i\}\) by logistic regressions that predict each \(x_i\) from its neighbours—no global \(Z\) (Besag 1977).
- CD(\(k\)): initialize chains at the observed \(x\), perform \(k\) Gibbs/Langevin-like flips, and update \(\theta\) using \(-\nabla_\theta E(x_{\text{data}})+\nabla_\theta E(x_{\text{neg}})\) (G. E. Hinton 2002). The same template scales to deep energies \(E_\theta\).
3.2 Continuous EBM for a Gaussian
Let \(E_\theta(x)=\tfrac12 (x-\mu)^2/\sigma^2 + \text{const}\). Then \(p_\theta\) equals \(N(\mu,\sigma^2)\). The MLE gradient matches moments: \[ \nabla_\mu \ell=\sigma^{-2}\Big(\mathbb E_{\text{data}}[X]-\mathbb E_{p_\theta}[X]\Big),\qquad \nabla_{\sigma^2}\ell=\tfrac12\sigma^{-4}\Big(\mathbb E_{\text{data}}[(X-\mu)^2]-\mathbb E_{p_\theta}[(X-\mu)^2]\Big). \] For general \(E_\theta\), the same “data minus model expectation” structure holds. Only the model expectation becomes intractable, so we replace it with MCMC (CD/MALA/SGLD) or SM/NCE.
4 Fancy EBM wins
Structured prediction.
- CRFs (linear-chain and grid CRFs) dominated sequence labelling (POS, NER) and were standard in early vision (denoising/segmentation). They are conditional EBMs with tractable marginals in special structures (Lafferty, McCallum, and Pereira 2001).
Pre-deep-diffusion generative modeling.
- RBM/DBN era: CD-trained RBMs and products-of-experts were the unsupervised pretraining workhorses and proved the practicality of learning unnormalized models (G. E. Hinton 2002).
Modern EBM revivals & hybrids.
- Implicit EBMs for images and audio have shown competitive generation and strong compositional priors (adding energies to express constraints) (Du and Mordatch 2019).
- Diffusion-recovery likelihood (DRL) learns EBMs on noised data levels and samples top-down—stabilizing classical CD training (Gao et al. 2021).
- Energy Matching (Balcerak et al. 2025) trains a single time-independent scalar potential that behaves like an OT-flow far from the data and like a Boltzmann EBM near it, closing a longstanding quality gap to diffusion/flows while retaining likelihood/energy for conditioning. It reports large FID gains over prior EBMs on CIFAR-10/ImageNet-32 and demonstrates inverse-problem conditioning and diversity via interaction energies (see the two-regime description and applications).
Compared to diffusions/flows: EBMs are simpler to compose (energies add) and operate at \(t=0\) (no time-conditioning). However, pure EBM training is harder to stabilize; diffusion/flows still dominate on turnkey sample quality and wall-clock sampling robustness. Hybrid approaches (DRL; Energy Matching) narrow the gap while keeping EBM perks (Gao et al. 2021).
5 In practice
5.1 Training objectives
- CD (with persistent replay): \(\Delta\theta \propto -\nabla_\theta E_\theta(x_{\text{data}})+\nabla_\theta E_\theta(x_{\text{neg}})\) with SGLD/MALA steps inside the negative phase (G. E. Hinton 2002; Song and Kingma 2021).
- Score-matching / denoising-SM for continuous data when we can backprop Hessians or avoid them via denoising (Hyvarinen 2007).
- NCE when we can pick a good \(q(x)\) (e.g., mixture of Gaussians or a coarse generator) (Gutmann and Hyvärinen 2010).
- DRL / Hybrid if pure CD is unstable (Gao et al. 2021); Energy Matching if we want a single potential that supports both long-range transport and local Boltzmann sampling (see the JKO-style objective in the PDF).
5.2 Samplers
- Langevin/MALA/SGLD are the workhorses; they only need \(\nabla_x E_\theta\). SGLD mixes into training naturally; MALA/HMC help when energies are smooth (Xifara et al. 2014; Song and Kingma 2021).
5.3 Conditioning/posterior sampling/inversions
- Posterior energy: \(U(x)=E_\theta(x)+\tfrac{1}{2\sigma^2}|y-Ax|^2\). We run Langevin/HMC on \(U\). Interaction energies (repulsion, constraints) seem trivial to add as extra terms, at least in Balcerak et al. (2025).
6 Cheat sheet
MLE gradient for EBMs (data–model split) \[ \nabla_\theta \ell(\theta)= -\mathbb E_{\text{data}}[\nabla_\theta E_\theta(X)]+\mathbb E_{p_\theta}[\nabla_\theta E_\theta(X)]. \]
Overdamped Langevin (Euler–Maruyama) \[ x_{t+1}=x_t-\eta\nabla_x E_\theta(x_t)+\sqrt{2\eta}\xi_t,\quad \xi_t\sim\mathcal N(0,I). \]
Score matching (Hyvärinen) \[ \mathcal L_{\text{SM}}(\theta)=\mathbb E_{p_{\text{data}}}\left[\tfrac12|\nabla_x\log p_\theta(x)|^2+\nabla_x^2\log p_\theta(x)\right], \quad \nabla_x\log p_\theta(x)=-\nabla_x E_\theta(x). \] (Hyvarinen 2007)
NCE logit \[ \text{logit}(x)=\log p_\theta(x)-\log q(x)\quad\text{with}\quad p_\theta(x)\propto e^{-E_\theta(x)}. \] (Gutmann and Hyvärinen 2010)
7 FAQ
I think I can now answer some of my questions.
“EBMs are just ‘unnormalised models’, so how are they different from HMC/SGLD?” HMC/SGLD perform inference for a given unnormalised target; EBMs learn that target (the energy) from data. The hard part is the model expectation in the gradient—not the absence of \(Z\) per se. (G. E. Hinton 2002; Song and Kingma 2021)
“Aren’t score models the same as EBMs?” Vanilla score models learn a vector field \(s(x)\approx\nabla_x\log p(x)\). Unless we constrain \(s\) to be conservative (a gradient field), it may be non-integrable (no scalar potential). EBMs parameterize the scalar \(E_\theta(x)\) directly. (Bridging methods—DRL and Energy Matching—reduce the practical gap.) (Gao et al. 2021)
8 Further reading
- Foundations/tutorials: (LeCun et al. 2006; Song and Kingma 2021)
- Graphical models & CRFs: (Geman and Geman 1984, 1987; Lafferty, McCallum, and Pereira 2001; Besag 1977)
- Training methods: (G. E. Hinton 2002; Hyvarinen 2007; Gutmann and Hyvärinen 2010)
- Modern EBMs & hybrids: (Du and Mordatch 2019; Gao et al. 2021; Schröder et al. 2023)
- Sampling: (Xifara et al. 2014)
- Bridging flows and EBMs: Energy Matching (Balcerak et al. 2025) (two-regime scalar potential).