Score matching
2021-11-10 — 2025-11-07
Wherein the data’s gradient field is inferred by denoising noisy samples with Gaussian perturbations, so that the learned score is directed toward the expected clean sample and is recovered as noise vanishes.
Can we learn the distribution indirectly by learning its score function from data? It’s very famous now thanks to neural diffusions.
This is especially interesting when we learn the score function without seeing it, as in denoising.
1 Denoising
This method was used in Generative Modeling by Estimating Gradients of the Data Distribution. It was extended in denoising diffusions.
Suppose we want to fit an unnormalized model. \[ p_\theta(y\mid x) \propto \exp!\big(E_\theta(x,y)\big), \] where E_is our “energy”. The partition function Z_(x) is intractable, so we can’t do maximum likelihood directly.
However, we can match the scores (the gradients of the log density): \[ \nabla_y \log p_\theta(y\mid x) \quad\text{to}\quad \nabla_y \log p_{\text{data}}(y\mid x). \]
Minimizing the Fisher divergence \[ D_F(p_{\text{data}}|p_\theta) = \tfrac12,\mathbb E_{p_{\text{data}}} \big[|\nabla_y\log p_\theta(y\mid x) - \nabla_y\log p_{\text{data}}(y\mid x)|^2\big] \] This makes the model’s density match the data up to a normalizing constant. This is what score matching (Hyvärinen 2005) does.
Except how? Because we don’t know the “score of the data”. We don’t know \(\nabla_y\log p_{\text{data}}\); we only have samples \(y_i\).
The denoising trick (Vincent 2011) showed that we can estimate the same Fisher divergence without knowing the true score by adding a small Gaussian noise perturbation: \[ \tilde y = y + \varepsilon,\quad \varepsilon\sim \mathcal N(0,\sigma^2 I). \]
We then train the model’s score function. \[ s_\theta(\tilde y\mid x) ;\equiv; \nabla_{\tilde y}\log p_\theta(\tilde y\mid x) =\tfrac{1}{\tau}\nabla_{\tilde y} BL_\theta(x,\tilde y) \] We denoise the corrupted samples using the loss function. \[ \mathcal L_{\text{DSM}} =\mathbb E_{(x,y),\varepsilon}! \left[ \big|s_\theta(\tilde y\mid x) \tfrac{1}{\sigma^2}(\tilde y - y) \big|^2 \right]. \]
If we fix x and assume a perfect model, the optimal score function that minimizes this loss satisfies \[ s_\theta^*(\tilde y\mid x) = -\tfrac{1}{\sigma^2}(\tilde y - \mathbb E[y\mid \tilde y,x]). \]
That is, the score points from the noisy sample \(\tilde y\) point back toward the expected clean sample \(y\) given that noise realization.
So the model learns a vector field that, for each slightly perturbed \(y\), points in the direction that reduces noise. In the limit \(\sigma\to0\), this field becomes the true data score \(\nabla_y\log p_{\text{data}}(y\mid x)\).
2 Sliced
3 Incoming
We see suggestive connections to thermodynamics (Sohl-Dickstein et al. 2015), score estimators in gradient, and Bregman divergences (Gutmann and Hirayama 2011).
