Quasi-gradients of discrete parameters

2022-12-20 — 2025-07-25

Suspiciously similar content

Figure 1: Abraham Bosse, Moyen vniuersel de pratiquer la perspectiue sur les tableaux, ou surfaces irregulieres : ensemble quelques particularitez concernant cet art, & celuy de la graueure en taille-douce (1653)

Notes on taking gradients through things that look like they have no gradients in the standard sense because their arguments are discrete. There are a lot of loosely related concepts in here that may not reflect an actual common theme. TBC.

1 Let’s all worry about the Straight-through Estimator

When training neural networks with quantized weights or activations, we introduce a non-differentiable projection

$P : R^{r} \to X$

(e.g. $P (x) = sign (x)$ for binary networks). A common “trick” is to ignore the true Jacobian $\partial P / \partial x$ and simply back-propagate gradients as if $P$ were the identity:

${\tilde{x}}_{k + 1} = {\tilde{x}}_{k} - η \nabla f (P ({\tilde{x}}_{k})), x_{k + 1} = P ({\tilde{x}}_{k + 1}) .$

This is the Straight-Through Estimator (STE) (Bengio, Léonard, and Courville 2013) — a crude but surprisingly effective heuristic to avoid vanishing gradients through hard quantization. PEople use this a lot in practice, but then fret about it. There is a miniature industry fixing it.

1.1 Mirror Descent to STE

Ajanthan et al. (2021) tries to justify STE as a mirror descent trick. Mirror Descent, to recap, generalizes gradient descent to settings where the feasible set $X \subset R^{r}$ is not naturally Euclidean. It relies on:

A mirror map $Φ : C \to R$ , a strictly convex, differentiable function whose gradient $\nabla Φ$ maps the primal space $X$ into the dual space $R^{r}$ .
The associated Bregman divergence

$D_{Φ} (p, q) = Φ (p) - Φ (q) - ⟨ \nabla Φ (q), p - q ⟩ .$

Starting from $x_{0} \in X$ , each iteration does

$\nabla Φ (y_{k + 1}) = \nabla Φ (x_{k}) - η g_{k}, x_{k + 1} = \arg min_{x \in X} D_{Φ} (x, y_{k + 1}),$

where $g_{k} \in \partial f (x_{k})$ . Equivalently,

$x_{k + 1} = \arg min_{x \in X} ⟨ η g_{k}, x ⟩ + D_{Φ} (x, x_{k}) .$

Ajanthan et al. show that any strictly monotone projection $P$ yields a valid mirror map

$Φ (x) = \int_{x_{0}}^{x} P^{- 1} (y) d y,$

because $\nabla Φ (x) = P^{- 1} (x)$ and $Φ$ is strictly convex. With this choice:

MD in the dual space amounts to storing auxiliary variables $\tilde{x} = P^{- 1} (x)$ .
The MD update $\nabla Φ (y_{k + 1}) = \nabla Φ (x_{k}) - η g_{k}$ becomes simply

${\tilde{x}}_{k + 1} = {\tilde{x}}_{k} - η g_{k},$

and mapping back to the primal:

$x_{k + 1} = P ({\tilde{x}}_{k + 1}) .$

This is exactly the STE procedure.

In this view, STE is nothing other than a numerically-stable implementation of MD under the mirror map induced by the projection $P$ . Mirror Descent provides the missing theoretical foundation for why STE works so well in practice: it is simply gradient descent in a dual space tailored to the geometry of quantization.

This is kinda wild. It is not clear at all to me that mirror descent should have generalised to discrete spaces, which shows I am thinking about it wrong.

We will be violating a lot of the mirror descent assumptions in the NN setting (surely some kind of convexity violation). I wonder if you could recover mirror descent theory in the vicinity of an optimum or something? This dual space perspective looks handy.

1.2 STE as a Bayes procedure

Meng, Bachmann, and Khan (2020) is another attempt to justify that devises a Bayesian update corresponding to the STE, as seen in NN quantization.

2 Stochastic gradients via REINFORCE

The classic generic REINFORCE/Score function method for estimating gradients of expectations of functions of random variables can be used to estimate gradients of functions of discrete random variables as a special case. There are particular extra tricks used in practice for discrete random variables to keep it performant; see e.g. (Grathwohl et al. 2018; Liu et al. 2019; Mnih and Gregor 2014; Tucker et al. 2017).

3 Gumbel-(soft)max

A.k.a. the concrete distribution. See Gumbel-max.

4 Gradients of other weird things

Differentiable sorting? See, e.g. Grover et al. (2018) and Prillo and Eisenschlos (2020).

5 Avoiding the need for gradients

Famously, Expectation Maximization can handle some of the same optimisation problems as gradient-based methods, but without needing gradients. There are presumably more variants.

6 Other methods

What even are (Grathwohl et al. 2021; Zhang, Liu, and Liu 2022)? I think they work for quantised continuous vars, or possibly ordinal vars?

7 Examples

8 Incoming

Differentiating the discrete: Automatic Differentiation meets Integer Optimization | μβ

9 References

Ajanthan, Gupta, Torr, et al. 2021. “Mirror Descent View for Neural Network Quantization.” In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics.

Arya, Schauer, Schäfer, et al. 2022. “Automatic Differentiation of Programs with Discrete Randomness.” In.

Bengio, Léonard, and Courville. 2013. “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation.”

Casella, and Robert. 1996. “Rao-Blackwellisation of Sampling Schemes.” Biometrika.

Grathwohl, Choi, Wu, et al. 2018. “Backpropagation Through the Void: Optimizing Control Variates for Black-Box Gradient Estimation.” In Proceedings of ICLR.

Grathwohl, Swersky, Hashemi, et al. 2021. “Oops I Took A Gradient: Scalable Sampling for Discrete Distributions.”

Grover, Wang, Zweig, et al. 2018. “Stochastic Optimization of Sorting Networks via Continuous Relaxations.” In.

Hyvärinen. 2005. “Estimation of Non-Normalized Statistical Models by Score Matching.” The Journal of Machine Learning Research.

Jang, Gu, and Poole. 2017. “Categorical Reparameterization with Gumbel-Softmax.”

Kool, Hoof, and Welling. 2019. “Estimating Gradients for Discrete Random Variables by Sampling Without Replacement.” In.

Kool, van Hoof, and Welling. 2019. “Buy 4 Reinforce Samples, Get a Baseline for Free!”

Liang, Norouzi, Berant, et al. 2018. “Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing.”

Lindsten. 2011. “Rao-Blackwellised Particle Methods for Inference and Identification.”

Liu, Regier, Tripuraneni, et al. 2019. “Rao-Blackwellized Stochastic Gradients for Discrete Distributions.” In.

Maddison, Mnih, and Teh. 2017. “The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.” In.

Meng, Bachmann, and Khan. 2020. “Training Binary Neural Networks Using the Bayesian Learning Rule.” In Proceedings of the 37th International Conference on Machine Learning.

Mnih, and Gregor. 2014. “Neural Variational Inference and Learning in Belief Networks.” In Proceedings of The 31st International Conference on Machine Learning. ICML’14.

Mohamed, Rosca, Figurnov, et al. 2020. “Monte Carlo Gradient Estimation in Machine Learning.” Journal of Machine Learning Research.

Oktay, McGreivy, Aduol, et al. 2020. “Randomized Automatic Differentiation.” arXiv:2007.10412 [Cs, Stat].

Prillo, and Eisenschlos. 2020. “SoftSort: A Continuous Relaxation for the Argsort Operator.”

Rosca, Figurnov, Mohamed, et al. 2019. “Measure–Valued Derivatives for Approximate Bayesian Inference.” In NeurIPS Workshop on Approximate Bayesian Inference.

Schulman, Heess, Weber, et al. 2015. “Gradient Estimation Using Stochastic Computation Graphs.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’15.

Shi, Zhou, Hwang, et al. 2022. “Gradient Estimation with Discrete Stein Operators.” In Advances in Neural Information Processing Systems.

Staines, and Barber. 2012. “Variational Optimization.”

Swersky, Rubanova, Dohan, et al. 2020. “Amortized Bayesian Optimization over Discrete Spaces.” In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI).

Tucker, Mnih, Maddison, et al. 2017. “REBAR: Low-Variance, Unbiased Gradient Estimates for Discrete Latent Variable Models.” In Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17.

van Krieken, Tomczak, and Teije. 2021. “Storchastic: A Framework for General Stochastic Automatic Differentiation.” In arXiv:2104.00428 [Cs, Stat].

Weber, Heess, Buesing, et al. 2019. “Credit Assignment Techniques in Stochastic Computation Graphs.”

Williams. 1992. “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.” Machine Learning.

Yin, Yue, and Zhou. 2019. “ARSM: Augment-REINFORCE-Swap-Merge Estimator for Gradient Backpropagation Through Categorical Variables.” In Proceedings of the 36th International Conference on Machine Learning.

Yin, and Zhou. 2018. “ARM: Augment-REINFORCE-Merge Gradient for Stochastic Binary Networks.”

Zhang, Liu, and Liu. 2022. “A Langevin-Like Sampler for Discrete Distributions.” In Proceedings of the 39th International Conference on Machine Learning.