The score function, also known as the gradient of the log-likelihood, is a weirdly handy concept in statistics and machine learning.
We usually run into the score function in the context of maximum likelihood estimation (MLE). It quantifies the sensitivity of the likelihood function to changes in the parameter of interest and traditionally plays a crucial role in parameter estimation, hypothesis testing, and the development of asymptotic theory. Since it results in calculations that are “easy to normalise” in some sense even in high dimensions, it is a popular choice for optimisation and estimation in machine learning and statistics.
1 Definition
Let
The score function
For a set of independent samples
The score function measures how sensitive the log-likelihood function is to changes in the parameter vector
1.1 Stein-score/Fisher score
We use here the definition for the score function that is called a Fisher score. Some authors draw a distinction with the Stein score:
In the Bayesian context, the distinction between these is not so sharp, since all quantities are random variables; if you regard one quantity as “fixed” this distinction becomes important.
2 Handy Properties
The classic use of the score function is in maximum likelihood estimation, where we set the score function to zero to find the maximum likelihood estimate of the parameter vector; let’s look at some fancier properties.
2.1 Zero Mean Property
Under regularity conditions (i.e., we can exchange integration and differentiation), the expected value of the score function is zero:
2.2 Variance of the Score Function and Fisher Information
The variance of the score function is known as the Fisher information matrix
Fisher information quantifies the amount of information that an observable random vector
2.3 Fisher’s Identity
Fisher’s identity provides a relationship between the score function and the expected value of the second derivative (Hessian) of the log-likelihood function:
where
This identity connects the variability of the score function to the curvature of the log-likelihood function, linking the observed information and the expected information.
3 Conditional Score Functions
Consider random vectors
where
The conditional score function
Using the properties of logarithms:
where
is the joint score function. is the marginal score function of .
We can read this expression as saying that the conditional score function measures the additional information about
4 Bayes’ Rule in Score Functions
Bayes’ rule updates our beliefs about the parameter vector
where
is the prior distribution of . is the marginal likelihood of .
The score function for the posterior distribution (posterior score function?) is:
Since
where
5 Generating a Conditional Score Function via a Differential Forward Model
Consider a model
Assume
is a random vector with pdf . , where is a differentiable and invertible function.- We observe
and wish to infer information about through .
The joint pdf of
where
The marginal pdf of
By the sifting property of the Dirac delta function:
where
The conditional pdf of
The conditional score function is:
Substituting the expression for
Since
i.e. the conditional score function is the difference between the score function of
5.1 Example: Multivariate Normal Distribution with a Linear Transformation
Suppose the following:
. , where is a known invertible matrix and is a known vector.- We observe
and wish to compute .
We have
Since
Thus,
Since
Therefore
Since
Therefore
In this example, the conditional score function
5.2 Nonlinear Example
To illustrate a case where the conditional score function is non-zero, consider a nonlinear transformation.
Suppose:
. , where is a nonlinear function.- We observe
and wish to compute .
Same as before:
Computing
However, the conditional score function is:
This expression is generally non-zero and depends on both
In this nonlinear example, the conditional score function
6 Sampling via the Score Function
Can we find, or estimate
Yes. This is the main thing I do with the score function, TBH. Hamiltonian Monte Carlo, Langevin Monte Carlo, and score diffusion are all (related) methods that use the score function to sample from a distribution.
7 Other useful score function tricks
Score function gradient estimators are a popular choice for estimating gradients in stochastic optimisation and reinforcement learning.