# Monte Carlo gradient estimation

September 30, 2020 — May 22, 2024

Bayes
calculus
density
estimator distribution
Monte Carlo
probabilistic algorithms
probability
risk
uncertainty

Taking gradients through integrals/expectations using randomness, i.e. can I estimate this?

$g[f] := \frac{\partial}{\partial \theta} \mathbb{E}_{\mathsf{x}\sim p(\mathsf{x};\theta)}[f(\mathsf{x})]$

A concept with similar name but which is not the same is Stochastic Gradient MCMC, which uses stochastic gradients to sample from a target posterior distribution. Some similar tools and concepts pop up in both applications.

## 1 Score function estimator

A.k.a. REINFORCE (all-caps, for some reason?) A generic method that works on lots of things, including discrete variables; notoriously high variance if done naïvely. Credited to , but surely it must be older than that?

$\hat{g}_{\text{REINFORCE}}(f) = f(\mathsf{x})\frac{\partial}{\partial \theta} \mathbb{E}_{\mathsf{x}\sim p(\mathsf{x};\theta)} \log p(\mathsf{x};\theta)$

I am pretty sure this was called a “score function estimator” in my statistics degreee.

For unifying overviews see and the Storchastic docs.

### 1.1 Rao-Blackwellization

Rao-Blackwellization seems like a natural trick gradient estimators. How would it work? Liu et al. (2019) is a contemporary example; I have a vague feeling that I saw something similar in Reuven Y. Rubinstein and Kroese (2016). TODO: follow up.

## 2 Reparameterization trick

Define some base distribution $$\mathsf{e}\sim p(\mathsf{e})$$ such that $$f(\mathsf{x};\theta) \simeq f(T(\mathsf{e});\theta)$$ for some transform $$T$$. Then \begin{aligned} \hat{g}_{\text{reparam}}(f) &= \frac{\partial}{\partial \theta} \mathbb{E}_{\mathsf{e}\sim p(\mathsf{e})}[f(\mathsf{x})] \\ &= \mathbb{E}_{\mathsf{e}\sim p(\mathsf{e})}\left[\frac{\partial f}{\partial T}\frac{\partial T}{\partial \theta}(\mathsf{e};\theta)\right]. \end{aligned}

Less general but better-behaved than the score-function/REINFORCE estimator.

See reparameterization trick for more about that.

## 3 Parametric

I can imagine that our observed rv $${\mathsf{x}}\in \mathbb{R}$$ is generated via lookups from its iCDF $F(;)$ with parameter $$\theta$$: $\mathsf{x} = F^{-1}(\mathsf{u};\theta)$ where $$\mathsf{u}\sim\operatorname{Uniform}(0,1)$$. Each realization corresponds to a choice of $$u_i\sim \mathsf{u}$$ independently. How can I get the derivative of such a map?

Maybe I generated my original variable not by the icdf method but by simulating some variable $${\mathsf{z}}\sim F(\cdot; \theta).$$ In which case I may as well have generated those $$\mathsf{u}_i$$ by taking $$\mathsf{u}_i=F(\mathsf{z}_i;\theta)$$ for some $$\mathsf{z} \sim F(\cdot;\theta)$$ and I am conceptually generating my RV by fixing $$z_i\sim\mathsf{z}_i$$ and taking $$\phi := F^{-1}(F(z_i;\theta);\tau).$$ So to find the effect of my perturbation what I actually need is

\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta}\\ \end{aligned}

Does this do what we want? Kinda. So suppose that the parameters in question are something boring, such as the location parameter of a location-scale distribution, i.e. $$F(\cdot;\theta)=F(\cdot-\theta;0).$$ Then $$F^{-1}(\cdot;\theta)=F^{-1}(\cdot;0)+\theta$$ and thus

\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta} &=\left.\frac{\partial}{\partial \tau} F^{-1}(F(z-\theta;0);0)+\tau\right|_{\tau=\theta}\\ &=\left.\frac{\partial}{\partial \tau}\left(z-\theta+\tau\right)\right|_{\tau=\theta}\\ &=1\\ \end{aligned}

OK grand that came out simple enough.

TBC

## 5 Tooling

van Krieken, Tomczak, and Teije (2021) supplies us with a large library of pytorch tools for stochastic gradient estimation purposes, under the rubric Storchastic. (Source.). See also Deepmind’s mc_gradients.

## 6 Optimising Monte Carlo

Let us say I need to differentiate through a monte carlo algorithm to alter its parameters while holding the PRNG fixed. See Tuning MC.

## 7 References

Ahn, Korattikara, and Welling. 2012. In Proceedings of the 29th International Coference on International Conference on Machine Learning. ICML’12.
Arya, Schauer, Schäfer, et al. 2022. In.
Blundell, Cornebise, Kavukcuoglu, et al. 2015. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. ICML’15.
Casella, and Robert. 1996. Biometrika.
Fu. 2005.
Glasserman, and Ho. 1991. Gradient Estimation Via Perturbation Analysis.
Grathwohl, Choi, Wu, et al. 2018. In Proceedings of ICLR.
Grathwohl, Swersky, Hashemi, et al. 2021.
Hyvärinen. 2005. The Journal of Machine Learning Research.
Liu, Regier, Tripuraneni, et al. 2019. In.
Mnih, and Gregor. 2014. In Proceedings of The 31st International Conference on Machine Learning. ICML’14.
Mohamed, Rosca, Figurnov, et al. 2020. Journal of Machine Learning Research.
Oktay, McGreivy, Aduol, et al. 2020. arXiv:2007.10412 [Cs, Stat].
Prillo, and Eisenschlos. 2020.
Ranganath, Gerrish, and Blei. 2013. arXiv:1401.0118 [Cs, Stat].
Richter, Boustati, Nüsken, et al. 2020.
Rosca, Figurnov, Mohamed, et al. 2019. “Measure–Valued Derivatives for Approximate Bayesian Inference.” In NeurIPS Workshop on Approximate Bayesian Inference.
Rubinstein, Reuven Y, and Kroese. 2004. The Cross-Entropy Method a Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning.
Rubinstein, Reuven Y., and Kroese. 2016. Simulation and the Monte Carlo Method. Wiley series in probability and statistics.
Schulman, Heess, Weber, et al. 2015. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’15.
Shi, Sun, and Zhu. 2018. In.
Stoker. 1986. Econometrica.
Tucker, Mnih, Maddison, et al. 2017. In Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17.
van Krieken, Tomczak, and Teije. 2021. In arXiv:2104.00428 [Cs, Stat].
Walder, Roussel, Nock, et al. 2019. arXiv:1901.11311 [Cs, Stat].
Williams. 1992. Machine Learning.