Taking gradients through integrals using randomness.
A thing with similar name but which is not the same is Stochastic Gradient MCMC which *uses* stochastic gradients to sample from a target posterior distribution.
Some similar tools and concepts pop up in both uses.

## Score function estimator

A.k.a. REINFORCE, all-caps, for some reason. Could do with a decent intro. TBD.

A very generic method that works on lots of things, including discrete variables; however, notoriously high variance if done naïvely.

For unifying overviews see (Mohamed et al. 2020; Schulman et al. 2015; van Krieken, Tomczak, and Teije 2021) and the Storchastic docs.

- Shakir Mohamed, Log Derivative Trick
- Syed Ashar Javed, REINFORCE vs Reparameterization Trick

### Rao-Blackwellization

Rao-Blackwellization (Casella and Robert 1996) seems like a natural trick gradient estimators. How would it work? Liu et al. (2019) is a contemporary example; I have a vague feeling that I saw something similar in Reuven Y. Rubinstein and Kroese (2016). TODO: follow up.

## Parametric

I can imagine that our observed rv \({\mathsf{x}}\in \mathbb{R}\) is generated via lookups from its iCDF \(F(\cdot;\theta)\) with parameter \(\theta\): \[\mathsf{x} = F^{-1}(\mathsf{u};\theta) \] where \(\mathsf{u}\sim\operatorname{Uniform}(0,1)\). Each realization corresponds to a choice of \(u_i\sim \mathsf{u}\) independently. How can I get the derivative of such a map?

Maybe I generated my original variable not by the icdf method but by simulating some variable \({\mathsf{z}}\sim F(\cdot; \theta).\) In which case I may as well have generated those \(\mathsf{u}_i\) by taking \(\mathsf{u}_i=F(\mathsf{z}_i;\theta)\) for some \(\mathsf{z} \sim F(\cdot;\theta)\) and I am conceptually generating my RV by fixing \(z_i\sim\mathsf{z}_i\) and taking \(\phi := F^{-1}(F(z_i;\theta);\tau).\) So to find the effect of my perturbation what I actually need is

\[\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta}\\ \end{aligned}\]

Does this do what we want? Kinda. So suppose that the parameters in question are something boring, such as the location parameter of a location-scale distribution, i.e. \(F(\cdot;\theta)=F(\cdot-\theta;0).\) Then \(F^{-1}(\cdot;\theta)=F^{-1}(\cdot;0)+\theta\) and thus

\[\begin{aligned} \left.\frac{\partial}{\partial \tau} F^{-1}(F(z;\theta);\tau)\right|_{\tau=\theta} &=\left.\frac{\partial}{\partial \tau} F^{-1}(F(z-\theta;0);0)+\tau\right|_{\tau=\theta}\\ &=\left.\frac{\partial}{\partial \tau}\left(z-\theta+\tau\right)\right|_{\tau=\theta}\\ &=1\\ \end{aligned}\]

OK grand that came out simple enough.

TBC

## “Measure-valued”

TBD (Mohamed et al. 2020; Rosca et al. 2019).

## Tooling

van Krieken, Tomczak, and Teije (2021) supplies us with a large library of pytorch tools for stochastic gradient estimation purposes, under the rubric Storchastic. (Source.). See also Deepmind’s mc_gradients.

## Reparameterization trick

## Optimising Monte Carlo

Let us say I need to differentiate through a monte carlo algorithm to alter its parameters while holding the PRNG fixed. See Tuning MC.

## References

*Proceedings of the 29th International Coference on International Conference on Machine Learning*, 1771–78. ICML’12. Madison, WI, USA: Omnipress.

*Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37*, 1613–22. ICML’15. Lille, France: JMLR.org.

*Biometrika*83 (1): 81–94.

*Gradient Estimation Via Perturbation Analysis*. Springer Science & Business Media.

*The Journal of Machine Learning Research*6 (December): 695–709.

*arXiv:2104.00428 [Cs, Stat]*.

*Journal of Machine Learning Research*21 (132): 1–62.

*arXiv:2007.10412 [Cs, Stat]*, July.

*arXiv:1401.0118 [Cs, Stat]*, December.

*NeurIPS Workshop on Approximate Bayesian Inference*.

*Simulation and the Monte Carlo Method*. 3 edition. Wiley series in probability and statistics. Hoboken, New Jersey: Wiley.

*The Cross-Entropy Method a Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning*. New York, NY: Springer New York.

*Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2*, 3528–36. NIPS’15. Cambridge, MA, USA: MIT Press.

*Econometrica*54 (6): 1461–81.

*arXiv:1901.11311 [Cs, Stat]*, June.

## No comments yet. Why not leave one?