Gumbel (soft) max tricks

Concrete distribution, relaxed categorical etc

2017-02-20 — 2022-04-01

classification

metrics

probability

statistics

Suspiciously similar content

The family of Gumbel tricks is useful for sampling from things that look like categorical distributions and simplices and learning models that use categorical variables by reparameterisation.

1 Gumbel Trick basic

Francis Bach on Gumbel tricks has his characteristically out-of-the-simplex perspective.
Chris J. Maddison on Gumbel Machinery
Laurent Dinh, Gumbel-Max Trick Inference
The Gumbel-Max Trick for Discrete Distributions
Tim Veira, Gumbel-max trick

2 Softmax relaxation

A.k.a. relaxed Bernoulli, relaxed categorical.

One of the co-inventors, Eric Jang, wrote a tutorial Categorical Variational Autoencoders using Gumbel-Softmax:

The main contribution of this work is a “reparameterization trick” for the categorical distribution. Well, not quite—it’s actually a re-parameterization trick for a distribution that we can smoothly deform into the categorical distribution. We use the Gumbel-Max trick, which provides an efficient way to draw samples $z$ from the Categorical distribution with class probabilities $π_{i}$ : $z = OneHot (\underset{i}{\arg max} [g_{i} + \log π_{i}])$ argmax is not differentiable, so we simply use the softmax function as a continuous approximation of argmax: $y_{i} = \frac{\exp ((\log (π_{i}) + g_{i}) / τ)}{\sum_{j = 1}^{k} \exp ((\log (π_{j}) + g_{j}) / τ)} for i = 1, \dots, k$ Hence, we call this the “Gumbel-SoftMax distribution”. $τ$ is a temperature parameter that allows us to control how closely samples from the Gumbel-Softmax distribution approximate those from the categorical distribution. As $τ \to 0$ , the softmax becomes an argmax and the Gumbel-Softmax distribution becomes the categorical distribution. During training, we let $τ > 0$ to allow gradients past the sample, then gradually anneal the temperature $τ$ (but not completely to 0, as the gradients would blow up).

Emma Benjaminson, The Gumbel-Softmax Distribution takes it in small pedagogic steps.

3 Straight-through Gumbel

TBC

4 Reverse Gumbel

Gumbel-Max Trick Inference
Gumbel Machinery · Chris J. Maddison introduces Maddison, Tarlow, and Minka (2015).

5 References

Huijben, Kool, Paulus, et al. 2022. “A Review of the Gumbel-Max Trick and Its Extensions for Discrete Stochasticity in Machine Learning.” arXiv:2110.01515 [Cs, Stat].

Jang, Gu, and Poole. 2017. “Categorical Reparameterization with Gumbel-Softmax.”

Maddison, Mnih, and Teh. 2017. “The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.” In.

Maddison, Tarlow, and Minka. 2015. “A* Sampling.”

Papandreou, and Yuille. 2011. “Perturb-and-MAP Random Fields: Using Discrete Optimization to Learn and Sample from Energy Models.” In 2011 International Conference on Computer Vision.

Paulus, Maddison, and Krause. 2020. “Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator.”

Potapczynski, Loaiza-Ganem, and Cunningham. 2020. “Invertible Gaussian Reparameterization: Revisiting the Gumbel-Softmax.” In Advances in Neural Information Processing Systems.

Ravfogel, Svete, Snæbjarnarson, et al. 2025. “Gumbel Counterfactual Generation From Language Models.”

Shekhovtsov. 2023. “Cold Analysis of Rao-Blackwellized Straight-Through Gumbel-Softmax Gradient Estimator.” In Proceedings of the 40th International Conference on Machine Learning.

Wang, and Yin. 2020. “Relaxed Multivariate Bernoulli Distribution and Its Applications to Deep Generative Models.” In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI).