Monte Carlo gradient estimation

September 30, 2020 — July 2, 2024

Bayes
calculus
density
estimator distribution
Monte Carlo
probabilistic algorithms
probability
risk
uncertainty
Figure 1

Taking gradients through integrals/expectations using randomness, i.e. can I estimate this?

\[ g[f] := \frac{\partial}{\partial \theta} \mathbb{E}_{\mathsf{x}\sim p(\mathsf{x};\theta)}[f(\mathsf{x})] \]

A concept with similar name but which is not the same is Stochastic Gradient MCMC, which uses stochastic gradients to sample from a target posterior distribution. Some similar tools and concepts pop up in both applications.

1 Score function estimator

A.k.a. REINFORCE (all-caps, for some reason?) A generic method that works on lots of things, including discrete variables; notoriously high variance if done naïvely. Credited to (Williams 1992), but surely it must be older than that?

\[ \bgin{aligned} \hat{g}_{\text{REINFORCE}}(f) &= \frac{\partial}{\partial \theta} f(\mathsf{x}) \mathbb{E}_{\mathsf{x}\sim p(\mathsf{x};\theta)} \log p(\mathsf{x};\theta)\\ &= f(\mathsf{x}) \mathbb{E}_{\mathsf{x}\sim p(\mathsf{x};\theta)} \frac{\partial}{\partial \theta}\log p(\mathsf{x};\theta) \end{aligned} \]

I am pretty sure this was called a “score function estimator” in my statistics degree.

For unifying overviews, see (Mohamed et al. 2020; Schulman et al. 2015; van Krieken, Tomczak, and Teije 2021) and the Storchastic docs.

It is annoyingly hard to find a simple example of this method online, despite its simplicity; all the code example I can see always wrap it up with reinforcement learning or some other unnecessarily-specific complexity.

Laurence Davies and I knocked together this demo, in which we try to find the parameters which minimises a difference between the categorical distribution we sample from, and some target distribution.

import torch

# True target distribution probabilities
true_probs = torch.tensor([0.1, 0.6, 0.3])

# Optimization parameters
n_batch = 1000
n_iter = 3000
lr = 0.01

def f(x):
    """
    The target function, a likelihood for a categorical distribution with
    the given probabilities.
    The minus sign is important, since this algorithm _minimises_
    """
    return -torch.distributions.Multinomial(
        total_count=1, probs=true_probs).log_prob(x)


# Set the seed for reproducibility
torch.manual_seed(42)

# Initialize the parameter estimates
theta_hat = torch.nn.Parameter(torch.tensor([0., 0., 0.]))
optimizer = torch.optim.Adam([theta_hat], lr=lr)

for epoch in range(n_iter):
    optimizer.zero_grad()
    # Sample from the estimated distribution
    x_sample = torch.distributions.Multinomial(
        1, logits=theta_hat).sample((n_batch,))
    # exaluate log density at the sample points
    log_p_theta_x = torch.distributions.Multinomial(
        1, logits=theta_hat).log_prob(x_sample)
    # Evaluate the target function at the sample points
    f_hat = f(x_sample)

    # Compute the gradient of the log density with respect to parameters
    # The `grad_outputs` should multiply the `f_hat` with the gradient directly
    grad_log_p_theta_x = torch.autograd.grad(
        outputs=log_p_theta_x, inputs=theta_hat,
        grad_outputs=torch.ones_like(log_p_theta_x),
        create_graph=True)[0]

    # The final gradients are weighted over the sample points
    final_gradients = (
        f_hat.detach().unsqueeze(1)
        * grad_log_p_theta_x
      ).mean(dim=0)
    theta_hat.grad = final_gradients

    optimizer.step()

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Estimated Probs:"
          f"{torch.softmax(theta_hat, dim=0).detach().numpy()}")

# Display the final estimated probabilities
estimated_final_probs = torch.softmax(theta_hat, dim=0)
print("Final Estimated Probabilities: "
  f" {estimated_final_probs.detach().numpy()}"
  f" (True Probabilities: {true_probs.detach().numpy()}")

Note that the batch size there is very large. If we set it to be smaller, the variance of the estimator is too high to be useful; that is the notorious problem of this estimator.

Classically we would address this with a diminishing learning rate as per SGD theory, but I have lazily not done that here.

1.1 Rao-Blackwellization

Rao-Blackwellization (Casella and Robert 1996) seems like a natural trick to think about for gradient like the above estimators to reduce their variance. How would it work? Liu et al. (2019) is a contemporary example; I have a vague feeling that I saw something similar in Reuven Y. Rubinstein and Kroese (2016). TODO: follow up.

2 Reparameterization trick

Define some base distribution \(\mathsf{e}\sim p(\mathsf{e})\) such that \(f(\mathsf{x};\theta) \simeq f(T(\mathsf{e});\theta)\) for some transform \(T\). Then \[\begin{aligned} \hat{g}_{\text{reparam}}(f) &= \frac{\partial}{\partial \theta} \mathbb{E}_{\mathsf{e}\sim p(\mathsf{e})}[f(\mathsf{x})] \\ &= \mathbb{E}_{\mathsf{e}\sim p(\mathsf{e})}\left[\frac{\partial f}{\partial T}\frac{\partial T}{\partial \theta}(\mathsf{e};\theta)\right]. \end{aligned}\]

Less general but better-behaved than the score-function/REINFORCE estimator.

See reparameterization trick for more about that.

3 Parametric

I can imagine that our observed rv \({\mathsf{x}}\in \mathbb{R}\) is generated via lookups from its iCDF $F(;) $ with parameter \(\theta\): \[\mathsf{x} = F^{-1}(\mathsf{u};\theta) \] where \(\mathsf{u}\sim\operatorname{Uniform}(0,1)\). Each realization corresponds to a choice of \(u_i\sim \mathsf{u}\) independently. How can I get the derivative of such a map?

Maybe I generated my original variable not by the icdf method but by simulating some variable \({\mathsf{z}}\sim F(\cdot; \theta).\) In which case I may as well have generated those \(\mathsf{u}_i\) by taking \(\mathsf{u}_i=F(\mathsf{z}_i;\theta)\) for some \(\mathsf{z} \sim F(\cdot;\theta)\) and I am conceptually generating my RV by fixing \(z_i\sim\mathsf{z}_i\) and taking \(\phi := F^{-1}(F(z_i;\theta);\tau).\) So to find the effect of my perturbation what I actually need is

\[\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta}\\ \end{aligned}\]

Does this do what we want? Kinda. So suppose that the parameters in question are something boring, such as the location parameter of a location-scale distribution, i.e. \(F(\cdot;\theta)=F(\cdot-\theta;0).\) Then \(F^{-1}(\cdot;\theta)=F^{-1}(\cdot;0)+\theta\) and thus

\[\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta} &=\left.\frac{\partial}{\partial \tau} F^{-1}(F(z-\theta;0);0)+\tau\right|_{\tau=\theta}\\ &=\left.\frac{\partial}{\partial \tau}\left(z-\theta+\tau\right)\right|_{\tau=\theta}\\ &=1\\ \end{aligned}\]

OK grand that came out simple enough.

TBC

4 “Measure-valued”

TBD (Mohamed et al. 2020; Rosca et al. 2019).

5 Tooling

van Krieken, Tomczak, and Teije (2021) supplies us with a large library of pytorch tools for stochastic gradient estimation purposes, under the rubric Storchastic. (Source.). See also Deepmind’s mc_gradients.

6 Optimising Monte Carlo

Let us say I need to differentiate through a monte carlo algorithm to alter its parameters while holding the PRNG fixed. See Tuning MC.

7 References

Ahn, Korattikara, and Welling. 2012. Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring.” In Proceedings of the 29th International Coference on International Conference on Machine Learning. ICML’12.
Arya, Schauer, Schäfer, et al. 2022. Automatic Differentiation of Programs with Discrete Randomness.” In.
Blundell, Cornebise, Kavukcuoglu, et al. 2015. Weight Uncertainty in Neural Networks.” In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. ICML’15.
Casella, and Robert. 1996. Rao-Blackwellisation of Sampling Schemes.” Biometrika.
Fu. 2005. Stochastic Gradient Estimation.”
Glasserman, and Ho. 1991. Gradient Estimation Via Perturbation Analysis.
Grathwohl, Choi, Wu, et al. 2018. Backpropagation Through the Void: Optimizing Control Variates for Black-Box Gradient Estimation.” In Proceedings of ICLR.
Grathwohl, Swersky, Hashemi, et al. 2021. Oops I Took A Gradient: Scalable Sampling for Discrete Distributions.”
Hyvärinen. 2005. Estimation of Non-Normalized Statistical Models by Score Matching.” The Journal of Machine Learning Research.
Jang, Gu, and Poole. 2017. Categorical Reparameterization with Gumbel-Softmax.”
Kingma, and Welling. 2022. Auto-Encoding Variational Bayes.”
Kool, Hoof, and Welling. 2019. Estimating Gradients for Discrete Random Variables by Sampling Without Replacement.” In.
Kool, van Hoof, and Welling. 2019. Buy 4 Reinforce Samples, Get a Baseline for Free!
Liang, Norouzi, Berant, et al. 2018. Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing.”
Liu, Regier, Tripuraneni, et al. 2019. Rao-Blackwellized Stochastic Gradients for Discrete Distributions.” In.
Maddison, Mnih, and Teh. 2017. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.” In.
Mnih, and Gregor. 2014. Neural Variational Inference and Learning in Belief Networks.” In Proceedings of The 31st International Conference on Machine Learning. ICML’14.
Mnih, and Rezende. 2016. Variational Inference for Monte Carlo Objectives.”
Mohamed, Rosca, Figurnov, et al. 2020. Monte Carlo Gradient Estimation in Machine Learning.” Journal of Machine Learning Research.
Oktay, McGreivy, Aduol, et al. 2020. Randomized Automatic Differentiation.” arXiv:2007.10412 [Cs, Stat].
Prillo, and Eisenschlos. 2020. SoftSort: A Continuous Relaxation for the Argsort Operator.”
Ranganath, Gerrish, and Blei. 2013. Black Box Variational Inference.” arXiv:1401.0118 [Cs, Stat].
Richter, Boustati, Nüsken, et al. 2020. VarGrad: A Low-Variance Gradient Estimator for Variational Inference.”
Rosca, Figurnov, Mohamed, et al. 2019. Measure–Valued Derivatives for Approximate Bayesian Inference.” In NeurIPS Workshop on Approximate Bayesian Inference.
Rubinstein, Reuven Y, and Kroese. 2004. The Cross-Entropy Method a Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning.
Rubinstein, Reuven Y., and Kroese. 2016. Simulation and the Monte Carlo Method. Wiley series in probability and statistics.
Schulman, Heess, Weber, et al. 2015. Gradient Estimation Using Stochastic Computation Graphs.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’15.
Shi, Sun, and Zhu. 2018. A Spectral Approach to Gradient Estimation for Implicit Distributions.” In.
Stoker. 1986. Consistent Estimation of Scaled Coefficients.” Econometrica.
Tucker, Mnih, Maddison, et al. 2017. REBAR: Low-Variance, Unbiased Gradient Estimates for Discrete Latent Variable Models.” In Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17.
van Krieken, Tomczak, and Teije. 2021. Storchastic: A Framework for General Stochastic Automatic Differentiation.” In arXiv:2104.00428 [Cs, Stat].
Walder, Roussel, Nock, et al. 2019. New Tricks for Estimating Gradients of Expectations.” arXiv:1901.11311 [Cs, Stat].
Weber, Heess, Buesing, et al. 2019. Credit Assignment Techniques in Stochastic Computation Graphs.”
Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.” Machine Learning.
Yin, and Zhou. 2018. ARM: Augment-REINFORCE-Merge Gradient for Stochastic Binary Networks.”