The (Fisher) score function
November 14, 2024 — November 15, 2024
The score function, also known as the gradient of the log-likelihood, is a weirdly handy concept in statistics and machine learning.
We usually run into the score function in the context of maximum likelihood estimation (MLE). It quantifies the sensitivity of the likelihood function to changes in the parameter of interest and traditionally plays a crucial role in parameter estimation, hypothesis testing, and the development of asymptotic theory. Since it results in calculations that are “easy to normalise” in some sense even in high dimensions, it is a popular choice for optimisation and estimation in machine learning and statistics.
1 Definition
Let \(\mathbf{X}\) be a random vector with probability density function (pdf) \(f(\mathbf{x}; \boldsymbol{\theta})\), where \(\boldsymbol{\theta}\) is an unknown vector of parameters to be estimated.
The score function \(\mathbf{S}(\boldsymbol{\theta})\) is defined as the gradient of the natural logarithm of the likelihood function with respect to \(\boldsymbol{\theta}\):
\[ \mathbf{S}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \ln f(\mathbf{X}; \boldsymbol{\theta}). \]
For a set of independent samples \(\{\mathbf{X}_1, \mathbf{X}_2, \dots, \mathbf{X}_n\}\), the joint pdf is \(f(\mathbf{X}; \boldsymbol{\theta}) = \prod_{i=1}^n f(\mathbf{X}_i; \boldsymbol{\theta})\), and the score function becomes:
\[ \mathbf{S}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \ln f(\mathbf{X}; \boldsymbol{\theta}) = \sum_{i=1}^n \nabla_{\boldsymbol{\theta}} \ln f(\mathbf{X}_i; \boldsymbol{\theta}). \]
The score function measures how sensitive the log-likelihood function is to changes in the parameter vector \(\boldsymbol{\theta}\). It indicates the direction in which we should adjust \(\boldsymbol{\theta}\) to increase the likelihood of observing the data.
1.1 Stein-score/Fisher score
We use here the definition for the score function that is called a Fisher score. Some authors draw a distinction with the Stein score:
\[ \mathbf{S}(\boldsymbol{x}) = \nabla_{\boldsymbol{x}} \ln f(\mathbf{X}; \boldsymbol{\theta}). \]
In the Bayesian context, the distinction between these is not so sharp, since all quantities are random variables; if you regard one quantity as “fixed” this distinction becomes important.
2 Handy Properties
The classic use of the score function is in maximum likelihood estimation, where we set the score function to zero to find the maximum likelihood estimate of the parameter vector; let’s look at some fancier properties.
2.1 Zero Mean Property
Under regularity conditions (i.e., we can exchange integration and differentiation), the expected value of the score function is zero:
\[ \mathbb{E}[\mathbf{S}(\boldsymbol{\theta})] = \mathbb{E}\left[ \nabla_{\boldsymbol{\theta}} \ln f(\mathbf{X}; \boldsymbol{\theta}) \right] = \mathbf{0}. \]
2.2 Variance of the Score Function and Fisher Information
The variance of the score function is known as the Fisher information matrix \(\mathbf{I}(\boldsymbol{\theta})\):
\[ \mathbf{I}(\boldsymbol{\theta}) = \mathbb{E}\left[ \mathbf{S}(\boldsymbol{\theta}) \mathbf{S}(\boldsymbol{\theta})^\top \right]. \]
Fisher information quantifies the amount of information that an observable random vector \(\mathbf{X}\) carries about the unknown parameter vector \(\boldsymbol{\theta}\) and is heavily weaponised at information geometry.
2.3 Fisher’s Identity
Fisher’s identity provides a relationship between the score function and the expected value of the second derivative (Hessian) of the log-likelihood function:
\[ \mathbf{I}(\boldsymbol{\theta}) = -\mathbb{E}\left[ \nabla_{\boldsymbol{\theta}}^2 \ln f(\mathbf{X}; \boldsymbol{\theta}) \right], \]
where \(\nabla_{\boldsymbol{\theta}}^2\) denotes the Hessian matrix with respect to \(\boldsymbol{\theta}\).
This identity connects the variability of the score function to the curvature of the log-likelihood function, linking the observed information and the expected information.
3 Conditional Score Functions
Consider random vectors \(\mathbf{X}\) and \(\mathbf{Y}\) with joint pdf \(f_{\mathbf{X}, \mathbf{Y}}(\mathbf{x}, \mathbf{y}; \boldsymbol{\theta})\). The conditional pdf of \(\mathbf{X}\) given \(\mathbf{Y} = \mathbf{y}\) is:
\[ f_{\mathbf{X} \mid \mathbf{Y}}(\mathbf{x} \mid \mathbf{y}; \boldsymbol{\theta}) = \frac{f_{\mathbf{X}, \mathbf{Y}}(\mathbf{x}, \mathbf{y}; \boldsymbol{\theta})}{f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta})}, \]
where \(f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta}) = \int f_{\mathbf{X}, \mathbf{Y}}(\mathbf{x}, \mathbf{y}; \boldsymbol{\theta}) \, d\mathbf{x}\) is the marginal pdf of \(\mathbf{Y}\).
The conditional score function \(\mathbf{S}_{\mathbf{X} \mid \mathbf{Y}}(\boldsymbol{\theta})\) is defined as:
\[ \mathbf{S}_{\mathbf{X} \mid \mathbf{Y}}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \ln f_{\mathbf{X} \mid \mathbf{Y}}(\mathbf{x} \mid \mathbf{y}; \boldsymbol{\theta}). \]
Using the properties of logarithms:
\[ \mathbf{S}_{\mathbf{X} \mid \mathbf{Y}}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \left[ \ln f_{\mathbf{X}, \mathbf{Y}}(\mathbf{x}, \mathbf{y}; \boldsymbol{\theta}) - \ln f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta}) \right] = \mathbf{S}_{\mathbf{X}, \mathbf{Y}}(\boldsymbol{\theta}) - \mathbf{S}_{\mathbf{Y}}(\boldsymbol{\theta}), \]
where
- \(\mathbf{S}_{\mathbf{X}, \mathbf{Y}}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \ln f_{\mathbf{X}, \mathbf{Y}}(\mathbf{x}, \mathbf{y}; \boldsymbol{\theta})\) is the joint score function.
- \(\mathbf{S}_{\mathbf{Y}}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \ln f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta})\) is the marginal score function of \(\mathbf{Y}\).
We can read this expression as saying that the conditional score function measures the additional information about \(\boldsymbol{\theta}\) provided by \(\mathbf{X}\) given \(\mathbf{Y}\).
4 Bayes’ Rule in Score Functions
Bayes’ rule updates our beliefs about the parameter vector \(\boldsymbol{\theta}\) in light of observed data \(\mathbf{y}\). The posterior distribution of \(\boldsymbol{\theta}\) given \(\mathbf{Y} = \mathbf{y}\) is:
\[ \pi(\boldsymbol{\theta} \mid \mathbf{y}) = \frac{f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta}) \pi(\boldsymbol{\theta})}{m(\mathbf{y})}, \]
where
- \(\pi(\boldsymbol{\theta})\) is the prior distribution of \(\boldsymbol{\theta}\).
- \(m(\mathbf{y}) = \int f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta}) \pi(\boldsymbol{\theta}) \, d\boldsymbol{\theta}\) is the marginal likelihood of \(\mathbf{y}\).
The score function for the posterior distribution (posterior score function?) is:
\[ \mathbf{S}_{\text{Bayes}}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \ln \pi(\boldsymbol{\theta} \mid \mathbf{y}) = \nabla_{\boldsymbol{\theta}} \left[ \ln f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta}) + \ln \pi(\boldsymbol{\theta}) - \ln m(\mathbf{y}) \right]. \]
Since \(m(\mathbf{y})\) does not depend on \(\boldsymbol{\theta}\), its derivative is zero. Thus,
\[ \mathbf{S}_{\text{Bayes}}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \ln f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta}) + \nabla_{\boldsymbol{\theta}} \ln \pi(\boldsymbol{\theta}) = \mathbf{S}_{\mathbf{Y}}(\boldsymbol{\theta}) + \mathbf{S}_{\pi}(\boldsymbol{\theta}), \]
where \(\mathbf{S}_{\pi}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \ln \pi(\boldsymbol{\theta})\) is the score function of the prior.
5 Generating a Conditional Score Function via a Differential Forward Model
Consider a model \(\mathbf{G}: \mathbf{X} \mapsto \mathbf{Y}\), where \(\mathbf{Y}\) is observed, and \(\mathbf{X}\) is unobserved. Our goal is to derive the conditional score function \(\mathbf{S}(\mathbf{X} \mid \mathbf{Y} = \mathbf{y})\).
Assume
- \(\mathbf{X}\) is a random vector with pdf \(f_{\mathbf{X}}(\mathbf{x}; \boldsymbol{\theta})\).
- \(\mathbf{Y} = \mathbf{G}(\mathbf{X})\), where \(\mathbf{G}\) is a differentiable and invertible function.
- We observe \(\mathbf{Y} = \mathbf{y}\) and wish to infer information about \(\boldsymbol{\theta}\) through \(\mathbf{X}\).
The joint pdf of \(\mathbf{X}\) and \(\mathbf{Y}\) is:
\[ f_{\mathbf{X}, \mathbf{Y}}(\mathbf{x}, \mathbf{y}; \boldsymbol{\theta}) = f_{\mathbf{X}}(\mathbf{x}; \boldsymbol{\theta}) \delta(\mathbf{y} - \mathbf{G}(\mathbf{x})), \]
where \(\delta(\cdot)\) is the Dirac delta function generalized to vector arguments.
The marginal pdf of \(\mathbf{Y}\) is:
\[ f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta}) = \int f_{\mathbf{X}, \mathbf{Y}}(\mathbf{x}, \mathbf{y}; \boldsymbol{\theta}) \, d\mathbf{x} = \int f_{\mathbf{X}}(\mathbf{x}; \boldsymbol{\theta}) \delta(\mathbf{y} - \mathbf{G}(\mathbf{x})) \, d\mathbf{x}. \]
By the sifting property of the Dirac delta function:
\[ f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta}) = f_{\mathbf{X}}(\mathbf{G}^{-1}(\mathbf{y}); \boldsymbol{\theta}) \left| \det\left( \frac{\partial \mathbf{G}^{-1}(\mathbf{y})}{\partial \mathbf{y}} \right) \right|, \]
where \(\det(\cdot)\) denotes the determinant, and \(\frac{\partial \mathbf{G}^{-1}(\mathbf{y})}{\partial \mathbf{y}}\) is the Jacobian matrix of \(\mathbf{G}^{-1}\).
The conditional pdf of \(\mathbf{X}\) given \(\mathbf{Y} = \mathbf{y}\) is:
\[ f_{\mathbf{X} \mid \mathbf{Y}}(\mathbf{x} \mid \mathbf{y}; \boldsymbol{\theta}) = \frac{f_{\mathbf{X}, \mathbf{Y}}(\mathbf{x}, \mathbf{y}; \boldsymbol{\theta})}{f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta})} = \frac{f_{\mathbf{X}}(\mathbf{x}; \boldsymbol{\theta}) \delta(\mathbf{y} - \mathbf{G}(\mathbf{x}))}{f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta})}. \]
The conditional score function is:
\[ \mathbf{S}_{\mathbf{X} \mid \mathbf{Y}}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \ln f_{\mathbf{X} \mid \mathbf{Y}}(\mathbf{x} \mid \mathbf{y}; \boldsymbol{\theta}). \]
Substituting the expression for \(f_{\mathbf{X} \mid \mathbf{Y}}(\mathbf{x} \mid \mathbf{y}; \boldsymbol{\theta})\):
\[ \mathbf{S}_{\mathbf{X} \mid \mathbf{Y}}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \left[ \ln f_{\mathbf{X}}(\mathbf{x}; \boldsymbol{\theta}) + \ln \delta(\mathbf{y} - \mathbf{G}(\mathbf{x})) - \ln f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta}) \right]. \]
Since \(\delta(\mathbf{y} - \mathbf{G}(\mathbf{x}))\) does not depend on \(\boldsymbol{\theta}\) (given \(\mathbf{x}\) and \(\mathbf{y}\)), its derivative with respect to \(\boldsymbol{\theta}\) is zero. Therefore:
\[ \mathbf{S}_{\mathbf{X} \mid \mathbf{Y}}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \ln f_{\mathbf{X}}(\mathbf{x}; \boldsymbol{\theta}) - \nabla_{\boldsymbol{\theta}} \ln f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta}) = \mathbf{S}_{\mathbf{X}}(\boldsymbol{\theta}) - \mathbf{S}_{\mathbf{Y}}(\boldsymbol{\theta}). \]
i.e. the conditional score function is the difference between the score function of \(\mathbf{X}\) and the score function of \(\mathbf{Y}\).
5.1 Example: Multivariate Normal Distribution with a Linear Transformation
Suppose the following:
- \(\mathbf{X} \sim \mathcal{N}(\boldsymbol{\theta}, \mathbf{\Sigma})\).
- \(\mathbf{Y} = \mathbf{G}(\mathbf{X}) = \mathbf{A}\mathbf{X} + \mathbf{b}\), where \(\mathbf{A}\) is a known invertible matrix and \(\mathbf{b}\) is a known vector.
- We observe \(\mathbf{Y} = \mathbf{y}\) and wish to compute \(\mathbf{S}_{\mathbf{X} \mid \mathbf{Y}}(\boldsymbol{\theta})\).
We have
\[ f_{\mathbf{X}}(\mathbf{x}; \boldsymbol{\theta}) = \frac{1}{(2\pi)^{n/2} |\mathbf{\Sigma}|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\theta})^\top \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\theta}) \right). \]
Since \(\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}\),
\[ f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta}) = \frac{1}{(2\pi)^{n/2} |\mathbf{A}\mathbf{\Sigma}\mathbf{A}^\top|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{y} - \mathbf{A}\boldsymbol{\theta} - \mathbf{b})^\top (\mathbf{A}\mathbf{\Sigma}\mathbf{A}^\top)^{-1} (\mathbf{y} - \mathbf{A}\boldsymbol{\theta} - \mathbf{b}) \right). \]
Thus, \(\mathbf{Y} \sim \mathcal{N}(\mathbf{A}\boldsymbol{\theta} + \mathbf{b}, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^\top)\). We compute the score functions:
\[ \begin{aligned} \mathbf{S}_{\mathbf{X}}(\boldsymbol{\theta}) &= \nabla_{\boldsymbol{\theta}} \ln f_{\mathbf{X}}(\mathbf{x}; \boldsymbol{\theta}) = \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\theta})\\ \mathbf{S}_{\mathbf{Y}}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \ln f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta}) = \mathbf{A}^\top (\mathbf{A}\mathbf{\Sigma}\mathbf{A}^\top)^{-1} (\mathbf{y} - \mathbf{A}\boldsymbol{\theta} - \mathbf{b}). \]
Since \(\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}\), we can express \((\mathbf{y} - \mathbf{A}\boldsymbol{\theta} - \mathbf{b})\) in terms of \((\mathbf{x} - \boldsymbol{\theta})\),
\[ \mathbf{y} - \mathbf{A}\boldsymbol{\theta} - \mathbf{b} = \mathbf{A}(\mathbf{x} - \boldsymbol{\theta}). \]
Therefore
\[ \mathbf{S}_{\mathbf{Y}}(\boldsymbol{\theta}) = \mathbf{A}^\top (\mathbf{A}\mathbf{\Sigma}\mathbf{A}^\top)^{-1} \mathbf{A} (\mathbf{x} - \boldsymbol{\theta}). \]
Since \(\mathbf{A}\) is invertible, we have
\[ (\mathbf{A}\mathbf{\Sigma}\mathbf{A}^\top)^{-1} = (\mathbf{A}^\top)^{-1} \mathbf{\Sigma}^{-1} \mathbf{A}^{-1}. \]
Therefore
\[ \begin{aligned} \mathbf{S}_{\mathbf{Y}}(\boldsymbol{\theta}) &= \mathbf{A}^\top (\mathbf{A}^\top)^{-1} \mathbf{\Sigma}^{-1} \mathbf{A}^{-1} \mathbf{A} (\mathbf{x} - \boldsymbol{\theta}) = \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\theta}) = \mathbf{S}_{\mathbf{X}}(\boldsymbol{\theta})\\ \mathbf{S}_{\mathbf{X} \mid \mathbf{Y}}(\boldsymbol{\theta}) &= \mathbf{S}_{\mathbf{X}}(\boldsymbol{\theta}) - \mathbf{S}_{\mathbf{Y}}(\boldsymbol{\theta}) = \mathbf{0}. \]
In this example, the conditional score function \(\mathbf{S}_{\mathbf{X} \mid \mathbf{Y}}(\boldsymbol{\theta})\) is zero, indicating that, given \(\mathbf{Y} = \mathbf{y}\), there is no additional information about \(\boldsymbol{\theta}\) from \(\mathbf{X}\) beyond what is already provided by \(\mathbf{Y}\). We can persuade ourselves this is reasonable by noting that, because the transformation \(\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}\) is linear and invertible, observing \(\mathbf{Y}\) is equivalent to observing \(\mathbf{X}\) up to a deterministic transformation.
5.2 Nonlinear Example
To illustrate a case where the conditional score function is non-zero, consider a nonlinear transformation.
Suppose:
- \(\mathbf{X} \sim \mathcal{N}(\boldsymbol{\theta}, \mathbf{\Sigma})\).
- \(\mathbf{Y} = \mathbf{G}(\mathbf{X})\), where \(\mathbf{G}\) is a nonlinear function.
- We observe \(\mathbf{Y} = \mathbf{y}\) and wish to compute \(\mathbf{S}_{\mathbf{X} \mid \mathbf{Y}}(\boldsymbol{\theta})\).
Same as before:
\[ \mathbf{S}_{\mathbf{X}}(\boldsymbol{\theta}) = \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\theta}). \]
Computing \(\mathbf{S}_{\mathbf{Y}}(\boldsymbol{\theta})\) involves differentiating \(\ln f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta})\) with respect to \(\boldsymbol{\theta}\). In the case of a nonlinear \(\mathbf{G}\), \(f_{\mathbf{Y}}(\mathbf{y}; \boldsymbol{\theta})\) generally does not have a closed-form expression, and we may need to resort to numerical methods or approximations.
However, the conditional score function is:
\[ \mathbf{S}_{\mathbf{X} \mid \mathbf{Y}}(\boldsymbol{\theta}) = \mathbf{S}_{\mathbf{X}}(\boldsymbol{\theta}) - \mathbf{S}_{\mathbf{Y}}(\boldsymbol{\theta}). \]
This expression is generally non-zero and depends on both \(\mathbf{x}\) and \(\mathbf{y}\). It reflects the additional information about \(\boldsymbol{\theta}\) provided by \(\mathbf{X}\) beyond what is captured by \(\mathbf{Y}\).
In this nonlinear example, the conditional score function \(\mathbf{S}_{\mathbf{X} \mid \mathbf{Y}}(\boldsymbol{\theta})\) is non-zero, indicating that, given \(\mathbf{Y} = \mathbf{y}\), observing \(\mathbf{X}\) provides additional information about \(\boldsymbol{\theta}\). This is due to the loss of information when transforming \(\mathbf{X}\) through the nonlinear function \(\mathbf{G}(\mathbf{X})\), which is not invertible over its domain and may compress information in certain dimensions.
6 Sampling \(\mathbf{\theta}\sim f(x;\mathbf{\theta})\) via the Score Function
Can we find, or estimate \(\nabla_{\boldsymbol{\theta}} \ln f(\mathbf{X}; \boldsymbol{\theta})\) and use it to sample from \(f(\boldsymbol{\theta})\).
Yes. This is the main thing I do with the score function, TBH. Hamiltonian Monte Carlo, Langevin Monte Carlo, and score diffusion are all (related) methods that use the score function to sample from a distribution.
7 Other useful score function tricks
Score function gradient estimators are a popular choice for estimating gradients in stochastic optimisation and reinforcement learning.