Score matching

2021-11-10 — 2025-11-07

Wherein the data’s gradient field is inferred by denoising noisy samples with Gaussian perturbations, so that the learned score is directed toward the expected clean sample and is recovered as noise vanishes.

approximation
Bayes
Bregman
generative
Monte Carlo
neural nets
optimization
probabilistic algorithms
probability
score function
statistics

Can we learn the distribution indirectly by learning its score function from data? It’s very famous now thanks to neural diffusions.

Figure 1

This is especially interesting when we learn the score function without seeing it, as in denoising.

1 Denoising

This method was used in Generative Modeling by Estimating Gradients of the Data Distribution. It was extended in denoising diffusions.

Suppose we want to fit an unnormalized model. \[ p_\theta(y\mid x) \propto \exp!\big(E_\theta(x,y)\big), \] where E_is our “energy”. The partition function Z_(x) is intractable, so we can’t do maximum likelihood directly.

However, we can match the scores (the gradients of the log density): \[ \nabla_y \log p_\theta(y\mid x) \quad\text{to}\quad \nabla_y \log p_{\text{data}}(y\mid x). \]

Minimizing the Fisher divergence \[ D_F(p_{\text{data}}|p_\theta) = \tfrac12,\mathbb E_{p_{\text{data}}} \big[|\nabla_y\log p_\theta(y\mid x) - \nabla_y\log p_{\text{data}}(y\mid x)|^2\big] \] This makes the model’s density match the data up to a normalizing constant. This is what score matching (Hyvärinen 2005) does.

Except how? Because we don’t know the “score of the data”. We don’t know \(\nabla_y\log p_{\text{data}}\); we only have samples \(y_i\).

The denoising trick (Vincent 2011) showed that we can estimate the same Fisher divergence without knowing the true score by adding a small Gaussian noise perturbation: \[ \tilde y = y + \varepsilon,\quad \varepsilon\sim \mathcal N(0,\sigma^2 I). \]

We then train the model’s score function. \[ s_\theta(\tilde y\mid x) ;\equiv; \nabla_{\tilde y}\log p_\theta(\tilde y\mid x) =\tfrac{1}{\tau}\nabla_{\tilde y} BL_\theta(x,\tilde y) \] We denoise the corrupted samples using the loss function. \[ \mathcal L_{\text{DSM}} =\mathbb E_{(x,y),\varepsilon}! \left[ \big|s_\theta(\tilde y\mid x) \tfrac{1}{\sigma^2}(\tilde y - y) \big|^2 \right]. \]

If we fix x and assume a perfect model, the optimal score function that minimizes this loss satisfies \[ s_\theta^*(\tilde y\mid x) = -\tfrac{1}{\sigma^2}(\tilde y - \mathbb E[y\mid \tilde y,x]). \]

That is, the score points from the noisy sample \(\tilde y\) point back toward the expected clean sample \(y\) given that noise realization.

So the model learns a vector field that, for each slightly perturbed \(y\), points in the direction that reduces noise. In the limit \(\sigma\to0\), this field becomes the true data score \(\nabla_y\log p_{\text{data}}(y\mid x)\).

2 Sliced

Sliced Score Matching (Y. Song et al. 2019).

3 Incoming

We see suggestive connections to thermodynamics (Sohl-Dickstein et al. 2015), score estimators in gradient, and Bregman divergences (Gutmann and Hirayama 2011).

4 References

Bao, Chipilski, Liang, et al. 2024. Nonlinear Ensemble Filtering with Diffusion Models: Application to the Surface Quasi-Geostrophic Dynamics.”
Bao, Zhang, and Zhang. 2024a. An Ensemble Score Filter for Tracking High-Dimensional Nonlinear Dynamical Systems.”
———. 2024b. A Score-Based Filter for Nonlinear Data Assimilation.” Journal of Computational Physics.
Dockhorn, Vahdat, and Kreis. 2022. GENIE: Higher-Order Denoising Diffusion Solvers.” In.
Gutmann, and Hirayama. 2011. Bregman Divergence as General Framework to Estimate Unnormalized Statistical Models.” In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence. UAI’11.
Holzschuh, Vegetti, and Thuerey. 2022. “Score Matching via Differentiable Physics.”
Hyvärinen. 2005. Estimation of Non-Normalized Statistical Models by Score Matching.” The Journal of Machine Learning Research.
Lim, Kovachki, Baptista, et al. 2023. Score-Based Diffusion Models in Function Space.”
McAllester. 2023. On the Mathematics of Diffusion Models.”
Rozet, and Louppe. 2023a. Score-Based Data Assimilation.”
———. 2023b. Score-Based Data Assimilation for a Two-Layer Quasi-Geostrophic Model.”
Schröder, Ou, Lim, et al. 2023. Energy Discrepancies: A Score-Independent Loss for Energy-Based Models.”
Sharrock, Simons, Liu, et al. 2022. Sequential Neural Score Estimation: Likelihood-Free Inference with Conditional Score Based Diffusion Models.”
Sohl-Dickstein, Weiss, Maheswaranathan, et al. 2015. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.”
Song, Yang, and Ermon. 2020a. Generative Modeling by Estimating Gradients of the Data Distribution.” In Advances In Neural Information Processing Systems.
———. 2020b. Improved Techniques for Training Score-Based Generative Models.” In Advances In Neural Information Processing Systems.
Song, Yang, Garg, Shi, et al. 2019. Sliced Score Matching: A Scalable Approach to Density and Score Estimation.”
Song, Jiaming, Meng, and Ermon. 2021. Denoising Diffusion Implicit Models.” arXiv:2010.02502 [Cs].
Song, Yang, Sohl-Dickstein, Kingma, et al. 2022. Score-Based Generative Modeling Through Stochastic Differential Equations.” In.
Swersky, Ranzato, Buchman, et al. 2011. “On Autoencoders and Score Matching for Energy Based Models.” In Proceedings of the 28th International Conference on Machine Learning (ICML-11).
Tran, Rossi, Milios, et al. 2021. Model Selection for Bayesian Autoencoders.” In Advances in Neural Information Processing Systems.
Vincent. 2011. A connection between score matching and denoising autoencoders.” Neural Computation.
Zhuang, Abnar, Gu, et al. 2022. Diffusion Probabilistic Fields.” In.