# Gumbel (soft) max tricks

Concrete distribution, relaxed categorical etc

February 20, 2017 — April 1, 2022

The family of Gumbel tricks is useful for sampling from things that look like categorical distributions and simplices and learning models which use categorical variables by reparameterisation.

## 1 Gumbel Trick basic

- Francis Bach on Gumbel tricks has his characteristically out-of-the-simplex perspective.
- Chris J. Maddison on Gumbel Machinery
- Laurent Dinh, Gumbel-Max Trick Inference
- The Gumbel-Max Trick for Discrete Distributions
- Tim Veira, Gumbel-max trick

## 2 Softmax relaxation

A.k.a. relaxed Bernoulli, relaxed categorical.

One of the co-inventors, Eric Jang, wrote a tutorial Categorical Variational Autoencoders using Gumbel-Softmax:

The main contribution of this work is a “reparameterization trick” for the categorical distribution. Well, not quite—it’s actually a re-parameterization trick for a distribution that we can smoothly deform into the categorical distribution. We use the Gumbel-Max trick, which provides an efficient way to draw samples \(z\) from the Categorical distribution with class probabilities \(\pi_{i}\) : \[ z=\operatorname{OneHot}\left(\underset{i}{\arg \max }\left[g_{i}+\log \pi_{i}\right]\right) \] argmax is not differentiable, so we simply use the softmax function as a continuous approximation of argmax: \[ y_{i}=\frac{\exp \left(\left(\log \left(\pi_{i}\right)+g_{i}\right) / \tau\right)}{\sum_{j=1}^{k} \exp \left(\left(\log \left(\pi_{j}\right)+g_{j}\right) / \tau\right)} \quad \text { for } i=1, \ldots, k \] Hence, we call this the “Gumbel-SoftMax distribution”. \(\tau\) is a temperature parameter that allows us to control how closely samples from the Gumbel-Softmax distribution approximate those from the categorical distribution. As \(\tau \rightarrow 0\), the softmax becomes an argmax and the Gumbel-Softmax distribution becomes the categorical distribution. During training, we let \(\tau>0\) to allow gradients past the sample, then gradually anneal the temperature \(\tau\) (but not completely to 0, as the gradients would blow up).

Emma Benjaminson, The Gumbel-Softmax Distribution takes it in small pedagogic steps.

## 3 Straight-through Gumbel

TBC

## 4 Reverse Gumbel

- Gumbel-Max Trick Inference
- Gumbel Machinery · Chris J. Maddison introduces Maddison, Tarlow, and Minka (2015).

## 5 References

*arXiv:2110.01515 [Cs, Stat]*.

*2011 International Conference on Computer Vision*.

*Advances in Neural Information Processing Systems*.

*Proceedings of the 40th International Conference on Machine Learning*.

*Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI)*.