I have data and/or predictions made up of non-negative integers \({\mathbb N }\cup\{0\}\). What probability distributions can I use to model it?

All the distributions I discuss here have support unbounded above. Bounded distributions (e.g. vanilla Binomial) are for some other time. The exception is the Bernoulli RV, a.k.a. the biased coin, which is simple enough to sneak in.

For the details of these models in a time series context, see

- count time series
- or, a special case, Galton-Watson processes

Should I also raid the document topic model literature for this. Surely they are implicitly count data? See string bags compare with Steyvers and Tenenbaum’s semantic network model (Steyvers and Tenenbaum 2005).

A lot of this material is in (Johnson, Kemp, and Kotz 2005), if you’d like a convenient 1-stop reference.

## Poisson

The Poisson is reminiscent of the Gaussian for count data, in terms of the number of places that it pops up, and the vast number of things that have a limiting Poisson distribution.

Conveniently, it has only one parameter, which means you don’t need to justify how you chose any other parameters, saving valuable time and thought. It’s useful as a “null model”, in that the number of particles in realisation of a point process without interaction will be Poisson-distributed, conditional upon the mean measure. Conversely, non-Poisson residuals are evidence that your model has failed to take out some kind of interaction or hidden variable.

- Spelled
- \(\operatorname{Poisson}(\lambda)\)
- Pmf
- \(\mathbb{P}(k;\lambda)={\frac {\lambda ^{k}}{k!}}e^{{-\lambda }}\)
- Mean
- \(\lambda\)
- Variance
- \(\lambda\)
- Pgf
- \(G(s;\lambda)=\exp(\lambda (s-1))\)

## Negative Binomial

Nearly a century old (Greenwood and Yule 1920).
A generic count data model which, unlike the Poisson, has both location-like
and scale-like parameters, instead of only one parameter.
This makes one feel better about using it than a Poisson, since 2 parameters seems like a good number.
Has a traditional rationale in terms of drawing balls from urns and such,
which is of little interest here.
The key virtue here is that it is both *flexible* and *uncontroversial*.

It includes Geometric as special cases when \(r=1\),
and the Gamma and Poisson as limiting cases.
More precisely, the Poisson is a *limiting* case of the Polya distribution.
for example, in the large-k limit, it
approximates the Gamma distribution,
and, when the mean is held constant,
in the large \(r\) limit, it approaches Poisson.
For fixed *r* it is an exponential family

For all that, it’s still contrived, this model, and tedious to fit.

- Spelled
- \(\operatorname{NB}(p,r)\)
- Pmf
- \(\mathbb{P}(k;p,r) = {k+r-1 \choose k}\cdot (1-p)^{r}p^{k} = \frac{\Gamma(k+r)}{k!\,\Gamma(r)} (1-p)^rp^k\)
- Mean
- \({\frac {pr}{1-p}}\)
- Variance
- \({\frac {pr}{(1-p)^{2}}}\)
- Pgf
- \(G(s;p,r)=\left({\frac {1-p}{1-ps}}\right)^{{\!r}}{\text{ for }}|s|<{\frac 1p}\)

### Mean/dispersion parameterisation (Polya)

Commonly, where the \(r\) parameter is not required to be a non-negative
integer, we call it a *Polya* model, and use it for *over-dispersed* data
i.e. data that looks like a Poisson process if we are drunk, but whose variance
is too big in comparison to its mean for soberer minds.

To see how that works we will reparameterise the model in terms of a “location” parameter \(\lambda\) and “dispersion”/scale-ish parameter \(\alpha\), such that we can rewrite it.

- Spelled
- \(\text{Polya}(\lambda,\alpha)\)
- Pmf
- \(\mathbb{P}(k;\lambda,\alpha) = \frac{ \Gamma\left( k+\frac{1}{\alpha} \right) }{ \Gamma(k+1)\Gamma\left( \frac{1}{\alpha} \right) } \left( \frac{ \lambda }{ \lambda+\frac{1}{\alpha} }\right)^k\left( 1+\lambda\alpha \right)^{-\frac{1}{\alpha} }\)
- Mean
- \(\lambda\)
- Variance
- \(\lambda + \lambda^2\alpha\)
- Pgf
- \(G(s;\lambda,\alpha)=(\alpha (\lambda -\lambda s)+1)^{-1/\alpha }\)

The log Pmf is thus

\[ \mathbb{P}(k;\lambda,\alpha) =\begin{array}{l} \log\Gamma\left(k+\frac{1}{\alpha}\right)\\- \log\Gamma(k+1)\\- \log\Gamma\left( \frac{1}{\alpha} \right)\\+k\left( \log\lambda-\log\left( \lambda+\frac{1}{\alpha} \right) \right)\\- \frac{\log(1+\lambda\alpha)}{\alpha} \end{array}\]

All these log-gamma differences will be numerically unstable if calculated naïvely, so we need to use different numerical methods depending upon the combination of \(k,\lambda\) and \(\alpha\) values in play. Aieee! Tedious!

To bother with this nonsense, one would need a strong reason to believe the negative binomial was the true model; otherwise, for example, the Lagrangian Poisson-Paisson is easier to fit, in that the log Pmf is numerically tractable for all parameter combinations even with large count values, and it is typically no less natural for common data sets.

## Geometric

A discrete analogue of the exponential, i.e. the probability distribution of the number X of Bernoulli trials before the first success, supported on the set \(\{0, 1, 2, 3, …\}\)

- Spelled
- \(\text{Geom}(p)\)
- Pmf
- \(\mathbb{P}(k;p) = (1-p)^k\,p\!\)
- Pgf
- \(G(s;p)=\frac{p}{1-s+sp}.\)
- Mean
- \(\frac {1-p}{p}\)
- Variance
- \(\frac {1-p}{p^{2}}\)

Note that \({\operatorname{Geom}}(p)={\operatorname{NB}}(1,\,1-p).\,\).

### Mean parameterisation

We can parameterise this in terms of the mean \(\lambda=\frac {1-p}{p}\Rightarrow p=\frac{1}{\lambda+1}\)

- Spelled
- \(\operatorname{MGeom}(\lambda)\)
- Pmf
- \(\mathbb{P}(k;\lambda) = \left(\frac{\lambda}{\lambda+1}\right)^k\frac{1}{\lambda+1}\)
- Pgf
- \(G(s;\lambda)=\frac{\lambda+1}{s\lambda+1}.\)
- Mean
- \(\lambda\)
- Variance
- \(\lambda^2+\lambda\)

## Lagrangian distributions

Another clade of distribution.
Here we induce them via the pgf, although we
generate the distribution from a *function* this time, or rather, two functions.
Sometimes the forms of the mass function are explicit and easy.
Others, not so much.
This family includes various others on this page; I will check which some day.
For now, let’s get to the interesting new ones.
They are unified to my mind by modelling cluster processes,
i.e. If you have a given initial population and a given offspring distribution
for some population of… things… a Lagrangian distribution gives you a model for
the size of the total population.

See (P. C. Consul 1988; P. C. Consul and Famoye 2006) for a deep dive on this.

There are many possible interpretations for this; I will choose the interpretation in terms of cascade sizes of branching processes. It’s interesting to me because (P. C. Consul and Shoukri 1988; P. C. Consul and Famoye 2006 Ch 6.2) the total cascade size of a subcritical branching process has a “delta Lagrangian” or “general Lagrangian” distribution, depending on whether the cluster has a deterministic or random starting population.

We will define offspring distribution of such a branching process as \(G\sim G_Y(\eta, \alpha)\), with \(EG:=\eta\lt 1\).

Let’s get specific.

### Poisson-Poisson Lagrangian

See (P. C. Consul and Famoye 2006 Ch 9.3). Also known as the Generalised Poisson, although there are many things called that.

- Spelled
- \(\operatorname{GPD}(\mu,\eta)\)
- Pmf
- \(\mathbb{P}(X=x;\mu,\eta)=\frac{\mu(\mu+ \eta x)^{x-1}}{x!e^{\mu+x\eta}}\)
- Mean
- \(\frac{\mu}{1-\eta}\)
- Variance
- \(\frac{\mu}{(1-\eta)^3}\)

Returning to the cascade interpretation: Suppose we have

- an
*initial population*is distributed \(\operatorname{Poisson}(\mu\)) - and everyone in the population has a number of
*offspring*distributed \(\operatorname{Poisson}(\eta\)).

Then the total population is distributed as \(\operatorname{GPD}(\mu, \eta)\).

Notice that this can produce long tails, in the sense that it can have a large variance with finite mean, but not heavy tails, in the sense of the variance becoming infinite while retaining a finite mean; both variance and expectation go to infinity together.

Here, I implemented the GPD for you in python. There are versions for R, presumably. A quick search turned up RMKDiscrete and LaplacesDemon.

### Delta Lagrangian distributions

🏗

### General Lagrangian distribution

A larger family of Lagrangian distributions (the largest?) family is summarised in (P. Consul and Shenton 1972), in an unintuitively abstract way.

One parameter: a differentiable (infinitely?) function, not necessarily a pgf, \(g: [0,1]\rightarrow \mathbb{R}\) such that \(g(0)\neq 0\text{ and } g(1)=1\). Now we define a pgf \(\psi(s)\) implicitly as the smallest root of the Lagrange transformation \(z=sg(z)\). The paradigmatic example of such a function is \(g:z\mapsto 1−p+pz\); let’s check this fella out.

🏗

- Spelled
- ?
- Pmf
- ?
- Mean
- ?
- Variance
- ?

## Discrete Stable

Another generalisation of Poisson, with features such as a power-law tail.

By analogy with the continuous-stable distribution, a “stable” family for count data.

In particular, this is stable in the sense that it is a limiting distribution for sums of count random variables, analogous to the continuous stable family for real-valued RVs.

No (convenient) closed form for the Pmf in the general case, but the Pgf is simple, so that’s something.

- Spelled
- \(\operatorname{DS}(\nu,a)\)
- Pmf
- \(\mathbb{P}(k;\nu,a)= \left.\frac{1}{k!} \frac{d^kG(s;\nu,a)}{ds^k}\right|_{s=0}.\) (which is simply the formula for extracting the Pmf from any Pgf.)
- Pgf
- \(G(s;\nu,a)=\exp(-a (1-s)^\nu).\)
- Mean
- \(\operatorname{E}(X) = G'(1^-) = a \nu e^{-a}.\)
- Variance
- \(\operatorname{Var}(X)=G''(1^-) + \operatorname{E}(X) - \operatorname{E}^2(X).\)

Here, \(a>0\) is a scale parameter and \(0\lt\nu\leq 1\) a dispersion parameter describing in particular a power-law tail such that when \(\nu\lt 1\),

\[ \lim_{k \to \infty}\mathbb{P}(k;\nu,a) \simeq \frac{1}{k^{\nu+1}}. \]

Question: the Pgf formulation implies this is a non-negative distribution. Does that mean that symmetric discrete RVs cannot be stable? Possibly-negative ones?

(John P. Nolan 2001; J. P. Nolan 1997) give approximate ML-estimators of the parameter. (Wai Ha Lee 2010) does some legwork:

This thesis considers the interplay between the continuous and discrete properties of random stochastic processes. It is shown that the special cases of the one-sided Lévy-stable distributions can be connected to the class of discrete-stable distributions through a doubly-stochastic Poisson transform. This facilitates the creation of a one-sided stable process for which the N-fold statistics can be factorised explicitly. […] Using the same Poisson transform interrelationship, an exact method for generating discrete-stable variates is found. It has already been shown that discrete-stable distributions occur in the crossing statistics of continuous processes whose autocorrelation exhibits fractal properties. The statistical properties of a nonlinear filter analogue of a phase-screen model are calculated, and the level crossings of the intensity analysed. […] The asymptotic properties of the inter-event density of the process are found to be accurately approximated by a function of the Fano factor and the mean of the crossings alone.

Brrr. That sounds gruelling. You can probably just read the shorter article version though (W. H. Lee, Hopcraft, and Jakeman 2008).

## Zipf/Zeta models

The discrete version of the basic power-law models.

While we are here, the plainest explanation of the relation of Zips to Pareto distribution that I know is Lada Adamic’s Zipf, Power-laws, and Pareto - a ranking tutorial.

- Spelled
- \(\operatorname{Zipf}(s)\)
- Pmf
- \(\mathbb{P}(k;s)={\frac {1/k^{s}}{\zeta (s)}}\)
- Mean
- \({\frac {\zeta (s-1)}{\zeta (s)}}~{\text{ for }}~s>2\)
- Variance
- \({\frac {\zeta (s)\zeta (s-2)-\zeta (s-1)^{2}}{\zeta (s)^{2}}}~{\text{ for }}~s>3\)

This has unbounded support. In the bounded case, it becomes the Zipf—Mandelbrot law, which is too fiddly for me to discuss here unless I turn out to really need it, which would likely be for ranking statistics.

This is a *heavy tailed* distribution, in that it does not necessarily have finite higher moments.

## Yule-Simon

- Spelled
- \(\operatorname{YS}(\rho)\)
- Pmf
- \(\mathbb{P}(k;\rho)=\rho \,{\mathrm {B}}(k,\rho +1),\,\)
- Mean
- \({\frac {\rho }{\rho -1}},\, \rho \gt 1\)
- Variance
- \({\frac {\rho^2}{(\rho-1)^2\;(\rho -2)}}\,,\, \rho \gt 2\)

where B is the beta function.

Zipf law in the tail. See also the two-parameter version, which replaces the beta function with an incomplete beta function, giving Pmf \(\mathbb{P}(k;\rho,\alpha )={\frac {\rho }{1-\alpha ^{{\rho }}}}\;{\mathrm {B}}_{{1-\alpha }}(k,\rho +1),\,\)

This is also a *heavy tailed* distribution, in that it does not necessarily have finite higher moments.

I’m bored with this one too.

## Conway-Maxwell-Poisson

Exponential family count model with free variance parameter. See (Chatla and Shmueli 2016).

## Decomposability properties

For background, see decomposability.

### Stability

By analogy with the continuous case we may construct a stability equation for count RVs:

\[ X(a) \triangleq W^{1/\alpha}\odot X(b), \]

\(\odot\) here is Steutel and van Harn’s *discrete multiplication operator*,
which I won’t define here exhaustively because there are variously
complex formulations of it, and I don’t care enough to wrangle them.
In the simplest case it gives us a *binomial thinning* of the left operand by
the right

\[ A\odot B \sim \mathrm{Binom}(A,B) \]

### Self-divisibility

Poisson RVs are self-divisible, in the sense that

\[\begin{aligned} X_1&\sim \operatorname{Poisson}(\lambda_1),\\ X_2&\sim \operatorname{Poisson}(\lambda_2),\\ X_1&\perp X_2\\ &\Rightarrow \\ X_1+X_2&\sim \operatorname{Poisson}(\lambda_1+\lambda_2). \end{aligned}\]

Polya RVs likewise are self-divisible:

\[\begin{aligned} X_1&\sim \operatorname{Polya}(\lambda_1, \alpha_1),\,\\ X_2&\sim \operatorname{Polya}(\lambda_1, \alpha_1),\,\\ X_1&\perp X_2 &\Rightarrow\\ X_1+X_2&\sim \operatorname{Polya}\left(\lambda_1+\lambda_2,\, \frac{\alpha_1\lambda_1^2+\alpha_2\lambda_2^2}{(\lambda_1+\lambda_2)^2}\right)\\ \operatorname{E}(X_1+X_2)&=\operatorname{E}(X_1)+\operatorname{E}(X_2)\\ \operatorname{Var}(X_1+X_2)&=\operatorname{Var}(X_1)+\operatorname{Var}(X_2) \end{aligned}\]

So, AFAICS, are GPDs, in \(\lambda\).

## References

*The Canadian Journal of Statistics / La Revue Canadienne de Statistique*28 (4): 783–98. https://doi.org/10.2307/3315916.

*Physical Review E*88 (3): 032124. https://doi.org/10.1103/PhysRevE.88.032124.

*Generalized Poisson Distributions*. New York: CRC Press.

*Communications in Statistics - Theory and Methods*21 (1): 89–109. https://doi.org/10.1080/03610929208830766.

*Lagrangian Probability Distributions*. Boston: Birkhäuser. https://books.google.com/books/about/Lagrangian_Probability_Distributions.html?id=UGbkIwT63qsC.

*Statistics*20 (3): 407–15. https://doi.org/10.1080/02331888908802188.

*Technometrics*15 (4): 791–99. https://doi.org/10.1080/00401706.1973.10489112.

*Communications in Statistics*2 (3): 263–72. https://doi.org/10.1080/03610927308827073.

*A Modern Course on Statistical Distributions in Scientific Work*, edited by G. P. Patil, S. Kotz, and J. K. Ord, 41–57. NATO Advanced Study Institutes Series 17. Springer Netherlands. https://doi.org/10.1007/978-94-010-1842-5_5.

*Communications in Statistics - Theory and Methods*13 (12): 1533–47. https://doi.org/10.1080/03610928408828776.

*American Journal of Mathematical and Management Sciences*8 (1-2): 181–202. https://doi.org/10.1080/01966324.1988.10737237.

*SIAM Journal on Applied Mathematics*23 (2): 239–48. https://doi.org/10.1137/0123026.

*Biometrika*70 (1): 269–74. https://doi.org/10.1093/biomet/70.1.269.

*Biometrika*96 (4): 781–92. https://doi.org/10.1093/biomet/asp057.

*Communications in Statistics - Simulation and Computation*38 (9): 2004–17. https://doi.org/10.1080/03610910903202089.

*Communications in Statistics - Simulation and Computation*21 (1): 173–88. https://doi.org/10.1080/03610919208813013.

*The European Physical Journal B - Condensed Matter and Complex Systems*41 (2): 255–58. https://doi.org/10.1140/epjb/e2004-00316-5.

*Journal of the Royal Statistical Society*83 (2): 255–79. https://doi.org/10.2307/2341080.

*Stochastic Processes and Their Applications*45 (2): 209–30. https://doi.org/10.1016/0304-4149(93)90070-K.

*Zeitschrift für Wahrscheinlichkeitstheorie Und Verwandte Gebiete*61 (1): 97–118. https://doi.org/10.1007/BF00537228.

*Communications in Statistics - Theory and Methods*45 (3): 712–21. https://doi.org/10.1080/03610926.2013.835414.

*SIAM Journal on Applied Mathematics*44 (4): 854–68. https://doi.org/10.1137/0144061.

*Univariate Discrete Distributions*. 3rd ed. Hoboken, N.J: Wiley.

*Physical Review E*77 (1): 011109. https://doi.org/10.1103/PhysRevE.77.011109.

*Journal of Probability and Statistical Science*8 (1): 113–23.

*PLoS ONE*2 (2): e180. https://doi.org/10.1371/journal.pone.0000180.

*Aequationes Mathematicae*49 (1): 57–85. https://doi.org/10.1007/BF01827929.

*Sankhyā: The Indian Journal of Statistics, Series A (1961-2002)*27 (2/4): 249–58. http://www.jstor.org/stable/25049369.

*Communications in Statistics. Stochastic Models*13 (4): 759–74. https://doi.org/10.1080/15326349708807450.

*Lévy Processes*, edited by Ole E. Barndorff-Nielsen, Sidney I. Resnick, and Thomas Mikosch, 379–400. Birkhäuser Boston. https://doi.org/10.1007/978-1-4612-0197-7_17.

*Biometrics*61 (1): 179–85. https://doi.org/10.1111/j.0006-341X.2005.030833.x.

*Advances In Neural Information Processing Systems*, 5006–14. http://papers.nips.cc/paper/6082-poisson-gamma-dynamical-systems.

*Journal of the Royal Statistical Society. Series C (Applied Statistics)*54 (1): 127–42. http://www.ics.uci.edu/ johnsong/papers/Shmueli.

*Biostatistics*, edited by Ian B. MacNeill, Gary J. Umphrey, Allan Donner, and V. Krishna Jandhyala, 259–68. Dordrecht: Springer Netherlands. http://link.springer.com/10.1007/978-94-009-4794-8_15.

*Journal of Applied Probability*31: 185–97. https://doi.org/10.2307/3214956.

*Statistics & Probability Letters*79 (14): 1608–14. https://doi.org/10.1016/j.spl.2009.03.030.

*The Annals of Probability*7 (5): 893–99. https://doi.org/10.1214/aop/1176994950.

*Cognitive Science*29 (1): 41–78. https://doi.org/10.1207/s15516709cog2901_3.

*Journal of Statistical Planning and Inference*136 (9): 3173–86. https://doi.org/10.1016/j.jspi.2004.12.008.

*Statistica Neerlandica*54 (3): 374–76. https://doi.org/10.1111/1467-9574.00147.

*Non-Linear Time Series*, 199–244. Springer International Publishing.

*Journal of Multivariate Analysis*87 (2): 356–69. https://doi.org/10.1016/S0047-259X(03)00020-4.