Monte Carlo gradient estimation
September 30, 2020 — July 2, 2024
Taking gradients through integrals/expectations using randomness, i.e. can I estimate this?
\[ g[f] := \frac{\partial}{\partial \theta} \mathbb{E}_{\mathsf{x}\sim p(\mathsf{x};\theta)}[f(\mathsf{x})] \]
A concept with similar name but which is not the same is Stochastic Gradient MCMC, which uses stochastic gradients to sample from a target posterior distribution. Some similar tools and concepts pop up in both applications.
1 Score function estimator
A.k.a. REINFORCE (all-caps, for some reason?) A generic method that works on lots of things, including discrete variables; notoriously high variance if done naïvely. Credited to (Williams 1992), but surely it must be older than that?
\[ \bgin{aligned} \hat{g}_{\text{REINFORCE}}(f) &= \frac{\partial}{\partial \theta} f(\mathsf{x}) \mathbb{E}_{\mathsf{x}\sim p(\mathsf{x};\theta)} \log p(\mathsf{x};\theta)\\ &= f(\mathsf{x}) \mathbb{E}_{\mathsf{x}\sim p(\mathsf{x};\theta)} \frac{\partial}{\partial \theta}\log p(\mathsf{x};\theta) \end{aligned} \]
I am pretty sure this was called a “score function estimator” in my statistics degree.
For unifying overviews, see (Mohamed et al. 2020; Schulman et al. 2015; van Krieken, Tomczak, and Teije 2021) and the Storchastic docs.
- Shakir Mohamed, Log Derivative Trick
- Syed Ashar Javed, REINFORCE vs Reparameterization Trick
It is annoyingly hard to find a simple example of this method online, despite its simplicity; all the code example I can see always wrap it up with reinforcement learning or some other unnecessarily-specific complexity.
Laurence Davies and I knocked together this demo, in which we try to find the parameters which minimises a difference between the categorical distribution we sample from, and some target distribution.
import torch
# True target distribution probabilities
true_probs = torch.tensor([0.1, 0.6, 0.3])
# Optimization parameters
n_batch = 1000
n_iter = 3000
lr = 0.01
def f(x):
"""
The target function, a likelihood for a categorical distribution with
the given probabilities.
The minus sign is important, since this algorithm _minimises_
"""
return -torch.distributions.Multinomial(
total_count=1, probs=true_probs).log_prob(x)
# Set the seed for reproducibility
torch.manual_seed(42)
# Initialize the parameter estimates
theta_hat = torch.nn.Parameter(torch.tensor([0., 0., 0.]))
optimizer = torch.optim.Adam([theta_hat], lr=lr)
for epoch in range(n_iter):
optimizer.zero_grad()
# Sample from the estimated distribution
x_sample = torch.distributions.Multinomial(
1, logits=theta_hat).sample((n_batch,))
# exaluate log density at the sample points
log_p_theta_x = torch.distributions.Multinomial(
1, logits=theta_hat).log_prob(x_sample)
# Evaluate the target function at the sample points
f_hat = f(x_sample)
# Compute the gradient of the log density with respect to parameters
# The `grad_outputs` should multiply the `f_hat` with the gradient directly
grad_log_p_theta_x = torch.autograd.grad(
outputs=log_p_theta_x, inputs=theta_hat,
grad_outputs=torch.ones_like(log_p_theta_x),
create_graph=True)[0]
# The final gradients are weighted over the sample points
final_gradients = (
f_hat.detach().unsqueeze(1)
* grad_log_p_theta_x
).mean(dim=0)
theta_hat.grad = final_gradients
optimizer.step()
if epoch % 100 == 0:
print(f"Epoch {epoch}, Estimated Probs:"
f"{torch.softmax(theta_hat, dim=0).detach().numpy()}")
# Display the final estimated probabilities
estimated_final_probs = torch.softmax(theta_hat, dim=0)
print("Final Estimated Probabilities: "
f" {estimated_final_probs.detach().numpy()}"
f" (True Probabilities: {true_probs.detach().numpy()}")
Note that the batch size there is very large. If we set it to be smaller, the variance of the estimator is too high to be useful; that is the notorious problem of this estimator.
Classically we would address this with a diminishing learning rate as per SGD theory, but I have lazily not done that here.
1.1 Rao-Blackwellization
Rao-Blackwellization (Casella and Robert 1996) seems like a natural trick to think about for gradient like the above estimators to reduce their variance. How would it work? Liu et al. (2019) is a contemporary example; I have a vague feeling that I saw something similar in Reuven Y. Rubinstein and Kroese (2016). TODO: follow up.
2 Reparameterization trick
Define some base distribution \(\mathsf{e}\sim p(\mathsf{e})\) such that \(f(\mathsf{x};\theta) \simeq f(T(\mathsf{e});\theta)\) for some transform \(T\). Then \[\begin{aligned} \hat{g}_{\text{reparam}}(f) &= \frac{\partial}{\partial \theta} \mathbb{E}_{\mathsf{e}\sim p(\mathsf{e})}[f(\mathsf{x})] \\ &= \mathbb{E}_{\mathsf{e}\sim p(\mathsf{e})}\left[\frac{\partial f}{\partial T}\frac{\partial T}{\partial \theta}(\mathsf{e};\theta)\right]. \end{aligned}\]
Less general but better-behaved than the score-function/REINFORCE estimator.
See reparameterization trick for more about that.
3 Parametric
I can imagine that our observed rv \({\mathsf{x}}\in \mathbb{R}\) is generated via lookups from its iCDF $F(;) $ with parameter \(\theta\): \[\mathsf{x} = F^{-1}(\mathsf{u};\theta) \] where \(\mathsf{u}\sim\operatorname{Uniform}(0,1)\). Each realization corresponds to a choice of \(u_i\sim \mathsf{u}\) independently. How can I get the derivative of such a map?
Maybe I generated my original variable not by the icdf method but by simulating some variable \({\mathsf{z}}\sim F(\cdot; \theta).\) In which case I may as well have generated those \(\mathsf{u}_i\) by taking \(\mathsf{u}_i=F(\mathsf{z}_i;\theta)\) for some \(\mathsf{z} \sim F(\cdot;\theta)\) and I am conceptually generating my RV by fixing \(z_i\sim\mathsf{z}_i\) and taking \(\phi := F^{-1}(F(z_i;\theta);\tau).\) So to find the effect of my perturbation what I actually need is
\[\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta}\\ \end{aligned}\]
Does this do what we want? Kinda. So suppose that the parameters in question are something boring, such as the location parameter of a location-scale distribution, i.e. \(F(\cdot;\theta)=F(\cdot-\theta;0).\) Then \(F^{-1}(\cdot;\theta)=F^{-1}(\cdot;0)+\theta\) and thus
\[\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta} &=\left.\frac{\partial}{\partial \tau} F^{-1}(F(z-\theta;0);0)+\tau\right|_{\tau=\theta}\\ &=\left.\frac{\partial}{\partial \tau}\left(z-\theta+\tau\right)\right|_{\tau=\theta}\\ &=1\\ \end{aligned}\]
OK grand that came out simple enough.
TBC
4 “Measure-valued”
TBD (Mohamed et al. 2020; Rosca et al. 2019).
5 Tooling
van Krieken, Tomczak, and Teije (2021) supplies us with a large library of pytorch tools for stochastic gradient estimation purposes, under the rubric Storchastic. (Source.). See also Deepmind’s mc_gradients.
6 Optimising Monte Carlo
Let us say I need to differentiate through a monte carlo algorithm to alter its parameters while holding the PRNG fixed. See Tuning MC.