Categorical random variates

February 20, 2017 — January 12, 2022

classification
metrics
probability
regression
statistics
Figure 1

Distributions over categories.

1 Stick breaking tricks

Recommended reading: Machine Learning Trick of the Day (6): Tricks with Sticks— Shakir Mohammed.

TBC.

2 via random measures

See random measures.

3 Gumbel-max

See Gumbel-max tricks.

4 Pólya-Gamma augmentation

See Pólya-Gamma.

5 Softmax models

TBC

6 Multicategorical distributions

Can something belong to many categories? Then we are probably looking for Paintbox models (Broderick, Pitman, and Jordan 2013; Zhang and Paisley 2019) or some kind of multivariate Bernoulli model.

7 Dirichlet distribution

TBD. See Dirichlet distributions.

8 Dirichlet process

TBD. A distribution over an unknown number of categories. See also Gamma processes, which is how I learned to understand Dirichlet processes, insofar as I do.

9 Parametric distributions over non-negative integers

See count models.

10 Ordinal

Figure 2

If there is a natural ordering to the categories, then we are in a weird place. TBC.

11 Calibration

Kenneth Tay says:

In the context of binary classification, calibration refers to the process of transforming the output scores from a binary classifier to class probabilities. If we think of the classifier as a “black box” that transforms input data into a score, we can think of calibration as a post-processing step that converts the score into a probability of the observation belonging to class 1.

The scores from some classifiers can already be interpreted as probabilities (e.g. logistic regression), while the scores from some classifiers require an additional calibration step before they can be interpreted as such (e.g. support vector machines).

He recommends the tutorial Huang et al. (2020) and associated github.

More general probabilistic calibration here.

12 Hierarchical

TBD

Figure 3

13 References

Agresti. 2007. An Introduction to Categorical Data Analysis.
Arya, Schauer, Schäfer, et al. 2022. Automatic Differentiation of Programs with Discrete Randomness.” In.
Broderick, Pitman, and Jordan. 2013. Feature Allocations, Probability Functions, and Paintboxes.” Bayesian Analysis.
Connor, and Mosimann. 1969. “Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution.” Journal of the American Statistical Association.
Ferguson. 1974. Prior Distributions on Spaces of Probability Measures.” The Annals of Statistics.
Frigyik, Kapila, and Gupta. 2010. Introduction to the Dirichlet Distribution and Related Processes.”
Grathwohl, Swersky, Hashemi, et al. 2021. Oops I Took A Gradient: Scalable Sampling for Discrete Distributions.”
Gregor, Danihelka, Mnih, et al. 2014. Deep AutoRegressive Networks.” In Proceedings of the 31st International Conference on Machine Learning.
Hjort. 1990. Nonparametric Bayes Estimators Based on Beta Processes in Models for Life History Data.” The Annals of Statistics.
Huang, Li, Macheret, et al. 2020. A Tutorial on Calibration Measurements and Calibration Models for Clinical Prediction Models.” Journal of the American Medical Informatics Association : JAMIA.
Huijben, Kool, Paulus, et al. 2022. A Review of the Gumbel-Max Trick and Its Extensions for Discrete Stochasticity in Machine Learning.” arXiv:2110.01515 [Cs, Stat].
Ishwaran, and Zarepour. 2002. Exact and Approximate Sum Representations for the Dirichlet Process.” Canadian Journal of Statistics.
Jang, Gu, and Poole. 2017. Categorical Reparameterization with Gumbel-Softmax.” arXiv:1611.01144 [Cs, Stat].
Lau, and Cripps. 2022. Thinned Completely Random Measures with Applications in Competing Risks Models.” Bernoulli.
Lin. 2016. “On The Dirichlet Distribution.”
Maddison, Mnih, and Teh. 2017. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.”
Papandreou, and Yuille. 2011. Perturb-and-MAP Random Fields: Using Discrete Optimization to Learn and Sample from Energy Models.” In 2011 International Conference on Computer Vision.
Polson, Scott, and Windle. 2013. Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables.” Journal of the American Statistical Association.
Rao, and Teh. 2009. “Spatial Normalized Gamma Processes.” In Proceedings of the 22nd International Conference on Neural Information Processing Systems. NIPS’09.
Roychowdhury, and Kulis. 2015. Gamma Processes, Stick-Breaking, and Variational Inference.” In Artificial Intelligence and Statistics.
Shah, Knowles, and Ghahramani. 2015. An Empirical Study of Stochastic Variational Algorithms for the Beta Bernoulli Process.” arXiv:1506.08180 [Cs, Stat].
Shekhovtsov. 2023. Cold Analysis of Rao-Blackwellized Straight-Through Gumbel-Softmax Gradient Estimator.” In Proceedings of the 40th International Conference on Machine Learning.
Teh, Grür, and Ghahramani. 2007. Stick-Breaking Construction for the Indian Buffet Process.” In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics.
Thibaux, and Jordan. 2007. Hierarchical Beta Processes and the Indian Buffet Process.” In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics.
Wang, and Yin. 2020. Relaxed Multivariate Bernoulli Distribution and Its Applications to Deep Generative Models.” In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI).
Xuan, Lu, Zhang, et al. 2015. Nonparametric Relational Topic Models Through Dependent Gamma Processes.” arXiv:1503.08542 [Cs, Stat].
Zhang, and Paisley. 2019. Random Function Priors for Correlation Modeling.” In International Conference on Machine Learning.