# Models for count data

May 15, 2015 — October 15, 2024

I have data and/or predictions made up of non-negative integers \({\mathbb N }\cup\{0\}\). What probability distributions can I use to model it?

All the distributions I discuss here have support unbounded above. Bounded distributions (e.g. vanilla Binomial) are for some other time. The exception is the Bernoulli RV, a.k.a. the biased coin, which is simple enough to sneak in.

For some related ideas in a time series context, see

- count time series
- or, a special case, Galton-Watson processes

Much of this material is in (Johnson, Kemp, and Kotz 2005), if you’d like a convenient 1-stop reference.

## 1 Connection to point processes

Count models arise as the marginal distributions of the number of *events* in a point process. For example, the Poisson distribution arises as the marginal distribution of the number of particles in a Poisson point process. There is some mileage that can be gotten from exploiting this connection. TBD.

## 2 In information retrieval

Should I also raid the document topic model literature for use of count data? Surely those TF-IDF models are implicitly count data? See string bags and semantics. compare with Steyvers and Tenenbaum’s semantic network model (Steyvers and Tenenbaum 2005).

## 3 Poisson

The Poisson is reminiscent of the Gaussian for count data, in terms of the number of places that it pops up, and the vast number of things that have a limiting Poisson distribution.

Conveniently, it has only one parameter, which means we don’t need to justify how we chose any other parameters, saving valuable time and thought. OTOH, that means it is not very flexible.

- Spelled
- \(\operatorname{Poisson}(\lambda)\)
- Pmf
- \(\mathbb{P}(k;\lambda)={\frac {\lambda ^{k}}{k!}}e^{{-\lambda }}\)
- Mean
- \(\lambda\)
- Variance
- \(\lambda\)
- Pgf
- \(G(s;\lambda)=\exp(\lambda (s-1))\)

## 4 Negative Binomial

Seems to originate in Greenwood and Yule (1920). A generic count data model which, unlike the Poisson, has two parameters allowing some decoupling between mean and dispersion. This makes one feel better about using it than a Poisson, since 2 parameters seem like a good number. Has a traditional rationale in terms of drawing balls from urns and such. The chief virtue of the Negative Binomial is that it is both *flexible* and *uncontroversial*. The chief vice is that it is numerically tedious to work with in practice.

- Spelled
- \(\operatorname{NB}(p,r)\)
- Pmf
- \(\mathbb{P}(k;p,r) = {k+r-1 \choose k}\cdot (1-p)^{r}p^{k} = \frac{\Gamma(k+r)}{k!\,\Gamma(r)} (1-p)^rp^k\)
- Mean
- \({\frac {pr}{1-p}}\)
- Variance
- \({\frac {pr}{(1-p)^{2}}}\)
- Pgf
- \(G(s;p,r)=\left({\frac {1-p}{1-ps}}\right)^{{\!r}}{\text{ for }}|s|<{\frac 1p}\)

It includes Geometric as special cases when \(r=1\), and the Gamma and Poisson as limiting cases. More precisely, the NB is a limiting case of the Pólya distribution. for example, in the large-k limit, it approximates the Gamma distribution, and, when the mean is held constant, in the large \(r\) limit, it approaches Poisson. For fixed *r* it is an exponential family.

### 4.1 Numerical issues

All these log-gamma differences are numerically unstable if calculated naïvely, so we need to use different numerical methods depending upon the combination of \(k,\lambda\) and \(\alpha\) values in play, so we need branching logic.

The Lagrangian Poisson-Poisson process is easier to fit, in that the log Pmf is numerically tractable for all parameter combinations even with large count values, and it is typically no less natural for common data sets.

### 4.2 Mean/dispersion parameterization (Pólya)

Commonly, where the \(r\) parameter is not required to be a non-negative integer, we call it a *Pólya* model, and use it for *over-dispersed* data i.e. data that looks like a Poisson process if we are drunk, but whose variance is too big in comparison to its mean for soberer minds.

To see how that works we reparameterise the model in terms of a “location” parameter \(\lambda\) and “dispersion”/scale-ish parameter \(\alpha\), such that we can rewrite it.

- Spelled
- \(\text{Pólya}(\lambda,\alpha)\)
- Pmf
- \[ \mathbb{P}(k;\lambda,\alpha) = \frac{ \Gamma\left( k+\frac{1}{\alpha} \right) }{ \Gamma(k+1)\Gamma\left( \frac{1}{\alpha} \right) } \left( \frac{ \lambda }{ \lambda+\frac{1}{\alpha} }\right)^k\left( 1+\lambda\alpha \right)^{-\frac{1}{\alpha} } \]
- Mean
- \(\lambda\)
- Variance
- \(\lambda + \lambda^2\alpha\)
- Pgf
- \(G(s;\lambda,\alpha)=(\alpha (\lambda -\lambda s)+1)^{-1/\alpha }\)

The log Pmf is thus

\[ \mathbb{P}(k;\lambda,\alpha) =\begin{array}{l} \log\Gamma\left(k+\frac{1}{\alpha}\right)\\- \log\Gamma(k+1)\\- \log\Gamma\left( \frac{1}{\alpha} \right)\\+k\left( \log\lambda-\log\left( \lambda+\frac{1}{\alpha} \right) \right)\\- \frac{\log(1+\lambda\alpha)}{\alpha} \end{array}\]

## 5 Geometric

A discrete analogue of the exponential, i.e. the probability distribution of the number X of Bernoulli trials before the first success, supported on the set \(\{0, 1, 2, 3, …\}\)

- Spelled
- \(\text{Geom}(p)\)
- Pmf
- \(\mathbb{P}(k;p) = (1-p)^k\,p\!\)
- Pgf
- \(G(s;p)=\frac{p}{1-s+sp}.\)
- Mean
- \(\frac {1-p}{p}\)
- Variance
- \(\frac {1-p}{p^{2}}\)

Note that \({\operatorname{Geom}}(p)={\operatorname{NB}}(1,\,1-p).\,\).

### 5.1 Mean parameterization

We can parameterise the Geometric distribution in terms of the mean \(\lambda=\frac {1-p}{p}\Rightarrow p=\frac{1}{\lambda+1}\)

- Spelled
- \(\operatorname{MGeom}(\lambda)\)
- Pmf
- \(\mathbb{P}(k;\lambda) = \left(\frac{\lambda}{\lambda+1}\right)^k\frac{1}{\lambda+1}\)
- Pgf
- \(G(s;\lambda)=\frac{\lambda+1}{s\lambda+1}.\)
- Mean
- \(\lambda\)
- Variance
- \(\lambda^2+\lambda\)

## 6 Lagrangian distributions

An interesting family. Models the number of events in a cascade process. Handled under cascade models. Some models on this page a special cases of cascade models.

## 7 Discrete Stable

Another generalization of Poisson, with a power-law-ish tail.

By analogy with the continuous-stable distribution, a “stable”-like family for count data. In particular, the Discrete Stable is Stable in the sense that it is a limiting distribution for sums of count random variables, analogous to the continuous stable family for real-valued RVs.

No (convenient) closed form for the Pmf in the general case, so it can be tedious to use as a likelihood (in practice we approximate). The Pgf is simple, so that’s something.

- Spelled
- \(\operatorname{DS}(\nu,a)\)
- Pmf
- \(\mathbb{P}(k;\nu,a)= \left.\frac{1}{k!} \frac{d^kG(s;\nu,a)}{ds^k}\right|_{s=0}.\) (which is simply the formula for extracting the Pmf from any Pgf.)
- Pgf
- \(G(s;\nu,a)=\exp(-a (1-s)^\nu).\)
- Mean
- \(\operatorname{E}(X) = G'(1^-) = a \nu e^{-a}.\)
- Variance
- \(\operatorname{Var}(X)=G''(1^-) + \operatorname{E}(X) - \operatorname{E}^2(X).\)

Here, \(a>0\) is a scale parameter and \(0<\nu\leq 1\) a dispersion parameter describing in particular a power-law tail such that when \(\nu< 1\),

\[ \lim_{k \to \infty}\mathbb{P}(k;\nu,a) \simeq \frac{1}{k^{\nu+1}}. \]

Question: the Pgf formulation implies this is a non-negative distribution. Does it follow that there are no symmetric discrete RVs?

John P. Nolan (2001) and J. P. Nolan (1997) give approximate ML-estimators of the parameter. Wai Ha Lee (2010) does some legwork:

This thesis considers the interplay between the continuous and discrete properties of random stochastic processes. It is shown that the special cases of the one-sided Lévy-stable distributions can be connected to the class of discrete-stable distributions through a doubly-stochastic Poisson transform. This facilitates the creation of a one-sided stable process for which the N-fold statistics can be factorised explicitly. […] Using the same Poisson transform interrelationship, an exact method for generating discrete-stable variates is found. It has already been shown that discrete-stable distributions occur in the crossing statistics of continuous processes whose autocorrelation exhibits fractal properties. The statistical properties of a nonlinear filter analogue of a phase-screen model are calculated, and the level crossings of the intensity analysed. […] The asymptotic properties of the inter-event density of the process are found to be accurately approximated by a function of the Fano factor and the mean of the crossings alone.

That sounds gruelling. We can probably read the shorter article version though (W. H. Lee, Hopcraft, and Jakeman 2008).

## 8 Zipf/Zeta models

The discrete version of the basic power-law models.

While we are here, the plainest explanation of the relation of Zipf’s law to Pareto distribution that I know is Lada Adamic’s Zipf, Power-laws, and Pareto - a ranking tutorial.

- Spelled
- \(\operatorname{Zipf}(s)\)
- Pmf
- \(\mathbb{P}(k;s)={\frac {1/k^{s}}{\zeta (s)}}\)
- Mean
- \({\frac {\zeta (s-1)}{\zeta (s)}}~{\text{ for }}~s>2\)
- Variance
- \({\frac {\zeta (s)\zeta (s-2)-\zeta (s-1)^{2}}{\zeta (s)^{2}}}~{\text{ for }}~s>3\)

Zipf’s law has unbounded support.

Zipf is a *heavy tailed* distribution, in that it does not necessarily have finite higher moments.

cf John D. Cook’s intro.

In the bounded case, it becomes the Zipf—Mandelbrot law, which is too fiddly for me to discuss here unless I turn out to really need it, which would likely be for ranking statistics.

## 9 Yule-Simon

- Spelled
- \(\operatorname{YS}(\rho)\)
- Pmf
- \(\mathbb{P}(k;\rho)=\rho \,{\mathrm {B}}(k,\rho +1)\)
- Mean
- \({\frac {\rho }{\rho -1}},\, \rho > 1\)
- Variance
- \({\frac {\rho^2}{(\rho-1)^2\;(\rho -2)}}, \rho > 2\)

where B is the beta function.

See also the two-parameter version, which replaces the beta function with an incomplete beta function, giving Pmf \(\mathbb{P}(k;\rho,\alpha )={\frac {\rho }{1-\alpha ^{{\rho }}}}\;{\mathrm {B}}_{{1-\alpha }}(k,\rho +1).\)

YS is also a *heavy tailed* distribution, in that it does not necessarily have finite higher moments. Zipf’s law-ish in the tail.

## 10 Conway-Maxwell-Poisson

Exponential family count model with free variance parameter. See (Chatla and Shmueli 2016).

## 11 Decomposability properties

For background, see decomposability.

### 11.1 Stability

By analogy with the continuous case we may construct a stability equation for count RVs:

\[ X(a) \triangleq W^{1/\alpha}\odot X(b), \]

\(\odot\) is Steutel and van Harn’s *discrete multiplication operator*, which I won’t define exhaustively because there are variously complex formulations of it, and I don’t care enough to wrangle them. In the simplest case, which AFAICT is all we need \(\odot\) gives us a *binomial thinning* of the left operand by the right

\[ A\odot B \sim \mathrm{Binom}(A,B) \]

### 11.2 Self-divisibility

Poisson RVs are self-divisible, in the sense that

\[\begin{aligned} X_1&\sim \operatorname{Poisson}(\lambda_1),\\ X_2&\sim \operatorname{Poisson}(\lambda_2),\\ X_1&\perp X_2\\ &\Rightarrow \\ X_1+X_2&\sim \operatorname{Poisson}(\lambda_1+\lambda_2). \end{aligned}\]

Pólya RVs likewise are self-divisible:

\[\begin{aligned} X_1&\sim \operatorname{Pólya}(\lambda_1, \alpha_1),\,\\ X_2&\sim \operatorname{Pólya}(\lambda_1, \alpha_1),\,\\ X_1&\perp X_2 &\Rightarrow\\ X_1+X_2&\sim \operatorname{Pólya}\left(\lambda_1+\lambda_2,\, \frac{\alpha_1\lambda_1^2+\alpha_2\lambda_2^2}{(\lambda_1+\lambda_2)^2}\right)\\ \operatorname{E}(X_1+X_2)&=\operatorname{E}(X_1)+\operatorname{E}(X_2)\\ \operatorname{Var}(X_1+X_2)&=\operatorname{Var}(X_1)+\operatorname{Var}(X_2) \end{aligned}\]

So, AFAICS, are GPDs, in \(\lambda\).

## 12 Quantize a continuous distribution

If we have some continuous distribution \(p(x)\), we can quantize it to a count distribution \(p(k)\) where \(k=\lfloor x\rfloor\). If the integral \(\int_k^{k+1} p(x)dx\) is tractable, then the Pmf is easy

## 13 References

*The Canadian Journal of Statistics / La Revue Canadienne de Statistique*.

*arXiv:1304.3741 [Math]*.

*Physical Review E*.

*arXiv:1504.05229 [Cs, Math, Stat]*.

*arXiv:1610.08244 [Stat]*.

*Generalized Poisson Distributions*.

*Communications in Statistics - Theory and Methods*.

*Lagrangian Probability Distributions*.

*Statistics*.

*Technometrics*.

*SIAM Journal on Applied Mathematics*.

*Communications in Statistics*.

*A Modern Course on Statistical Distributions in Scientific Work*. NATO Advanced Study Institutes Series 17.

*Communications in Statistics - Theory and Methods*.

*American Journal of Mathematical and Management Sciences*.

*Biometrika*.

*Biometrika*.

*Communications in Statistics - Simulation and Computation*.

*Journal of Applied Probability*.

*Communications in Statistics - Simulation and Computation*.

*The European Physical Journal B - Condensed Matter and Complex Systems*.

*Journal of the Royal Statistical Society*.

*Biometrika*.

*Communications in Statistics - Theory and Methods*.

*SIAM Journal on Applied Mathematics*.

*Univariate Discrete Distributions*.

*Annual Review of Statistics and Its Application*.

*Physical Review E*.

*Journal of Probability and Statistical Science*.

*PLoS ONE*.

*Aequationes Mathematicae*.

*Sankhyā: The Indian Journal of Statistics, Series A (1961-2002)*.

*Communications in Statistics. Stochastic Models*.

*Lévy Processes*.

*Microsurveys in Discrete Probability*. DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

*Biometrics*.

*Advances In Neural Information Processing Systems*.

*Journal of the Royal Statistical Society. Series C (Applied Statistics)*.

*Biostatistics*.

*Journal of Applied Probability*.

*Statistics & Probability Letters*.

*The Annals of Probability*.

*Cognitive Science*.

*Biometrika*.

*Journal of Statistical Planning and Inference*.

*Statistica Neerlandica*.

*Non-Linear Time Series*.

*Stochastic Processes and Their Applications*.

*Zeitschrift Für Wahrscheinlichkeitstheorie Und Verwandte Gebiete*.

*Journal of Multivariate Analysis*.

*arXiv:2105.14591 [Math]*.