(Discrete-measure)-valued stochastic processes

October 10, 2019 — May 4, 2022

machine learning
stochastic processes

\[ \renewcommand{\var}{\operatorname{Var}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\pd}{\partial} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\bf}[1]{\mathbf{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\mm}[1]{\mathrm{#1}} \renewcommand{\cc}[1]{\mathcal{#1}} \renewcommand{\oo}[1]{\operatorname{#1}} \renewcommand{\gvn}{\mid} \renewcommand{\II}{\mathbb{I}} \]

Stochastic processes indexed by time whose state is a discrete (possibly only countable) measure. Or, to be starker: evolving categorical distributions. Popular in, for example, mathematical models of alleles in biological evolution.

Population genetics keywords, in approximate order of generality and chronology: Fisher-Wright diffusion, Moran process, Viot-Fleming process. Obviously there are many other processes matching this broad description.

For continuous state see measure processes.

1 Discrete univariate state

In the simplest case, imagine that we wish to construct a process that is a distribution over \(\{0,1\},\) e.g. for some process \(\{P_t\}_t, P_t\in[0,1] \forall t\) we could imagine that this provides a probability for Bernoulli observation likelihood. We have a lot of candidates for \(P_t\). Here are some examples:

  1. Take two Gamma processes, \(G_t^1, G_t^2\), we could take \(P_t:=\frac{G_t^1}{G_t^1+ G_t^2}\). This would be a marginally Beta distribution.
  2. Take any process \(\{B_t\}_t\) which has instantaneous values \(B_t\in\mathbb{R},\) e.g. a Wiener process or Ornstein-Uhlenbeck process. Now \(\sigma(B_t)\) is the desired process, for \(\sigma:\mathbb{R}\to[0,1].\)

Both these could be generalised in various ways to be Dirichlet distributions of a desired dimensionality.

Option (1) is the most interesting to me for various reasons. We can come back to (2) later with Shakir Mohammed’s classic post on Tricks with Sticks.

Let us look at a basic case, cribbing shamelessly from the Gamma processes notebook:

# generate a stationary thinned autoregressive Gamma series
gamp = function(T, alpha, lambda, rho) {
  g = rgamma(1, alpha, rate=lambda)
  b = rbeta(T, alpha*rho, alpha*(1-rho))
  zeta = rgamma(T, alpha*(1-rho), rate=lambda)
  gs = numeric(T)
  for (i in 1:T) {
    g = b[i] * g + zeta[i]
    gs[i] = g
betap = function(T, alpha, lambda, rho) {
    g1 = gamp(T, head(alpha,1), head(lambda,1), head(rho,1))
    g2 = gamp(T, tail(alpha,1), tail(lambda,1), tail(rho,1))
    g1 / (g1+g2)
T = 10000
ts = 1:T
plot(ts, betap(T, 1.0, 0.1, 0.999),
    type = "l", col = 2,
    ylim = c(0, 1), ylab="", xlab = "time")
lines(ts, betap(T, 10, 1.0, 0.999),
    type = "l", col = 3)
lines(ts, betap(T, 100, 10.0, 0.999),
    type = "l", col = 4)
legend("bottomright", c("lambda=0.1", "lambda=1", "lambda=10"),
    lty = 1, col = 2:4)

Some stationary Beta processes

Some stationary Beta processes

Note that we are not actually required to keep the parameters the same for both latent Gamma processes; we can have different \(\alpha\) and introduce a bias away from \(\mathbb{E}P_t=0.5\):

plot(ts, betap(T, c(1, 1), 1.0, 0.999),
    type = "l", col = 2,
    ylim = c(0, 1), ylab="", xlab = "time")
lines(ts, betap(T, c(1, 10), 1.0, 0.999),
    type = "l", col = 3)
lines(ts, betap(T, c(10, 1), 1.0, 0.999),
    type = "l", col = 4)
legend("bottomright", c("alpha1=1,alpha2=1", "alpha1=1,alpha2=10", "alpha1=10,alpha2=1"),
    lty = 1, col = 2:4)

Some stationary Beta processes with asymmetric latents

Some stationary Beta processes with asymmetric latents

If the \(\lambda\) parameter is not the same between them, then it is not longer marginally Beta, but it is still in the correct \([0,1]\) range to constitute a valid probability. We can get some interesting bias and variance structures this way:

plot(ts, betap(T, c(1, 1), c(1,1), 0.999),
    type = "l", col = 2,
    ylim = c(0, 1), ylab="", xlab = "time")
lines(ts, betap(T, c(1,3), c(1, 1/3), 0.999),
    type = "l", col = 3)
lines(ts, betap(T, c(3, 1), c(1/3, 1), 0.999),
    type = "l", col = 4)
legend("bottomright", c("lambda1=1,lambda2=1","lambda1=1,lambda2=3", "lambda1=3,lambda2=1"),
    lty = 1, col = 2:4)

Some stationary processes on the unit interval

Some stationary processes on the unit interval

You get the idea.

It would be nice if we had a formulation for which it were easy to marginalise out the implied latent processes and various other statistical operations of use. Can we get there using Gamma-Dirichlet relations?

One obvious recipe that might work in discrete time is

  1. Given \(P_{t-1}\sim\operatorname{Beta}(\alpha,K-\alpha),\) draw some \(G_t\sim\operatorname{Gamma}(K, \lambda)\). Now \(G_t^1:=P_{t-1}G_t\sim \operatorname{Gamma}(\alpha, \lambda)\) and \(G_t^1:=(1-P_{t-1})G_t\sim \operatorname{Gamma}(K-\alpha, \lambda)\) independently (TODO: check this).
  2. We can perturb either or both of those Gamma variates to another one by thinning and addition, as in thinned autoregressive Gamma processes with some correlation param \(\rho\), giving us \(G_t^1{}',G_t^2{}'\).
  3. Finally, \(P_t=\frac{G_t^1{}'}{G_t^1{}'+G_t^2{}'}\sim\operatorname{Beta}(\alpha,K-\alpha),\) i.e. it has the same distribution.

This circuitous method has the virtue that there are no dependencies on unobserved latent dependent time series, and the process thus depends only on the previous timestep, i.e. it is \(\{P_{t-1}\}\)-measurable, whereas process (1) is \(\{G_{t-1}^1,G_{t-1}^2\}\)-measurable. However, that generation process is weird and lengthy.

Exercise: What is the density of \(P_t|P_{t-1}\) in this formulation?

2 Discrete multivariate state

How do we generalise the discrete model to multivariate observations? Imagine our observations are \(d\)-dimensional lattice-structured. This looks contrived to the statistician, but is bread-and-butter to computer scientists, who deal with binary vectors all the time. In the maximally general case relationships between variables can be… arbitrary of course.

But there are models that assume some structure and I reckon some of those might be useful for inference.

Classics include

  1. Discrete state graphical models which can exploit knowledge of relationships, and which have a well-developed and reliable belief propagation theory.
  2. Spin glasses-type stochastic physics models, which impose an undirected graphical structure on the variates. Classically this is a lattice structure (the Ising model) but I many other structures can be found now, including random structures. Spin glasses can be computationally impractical, as with many undirected graphical model problems. Physicists have spent a lot of sweat and blood developing rather complicated vocabulary for these models, which is similar but not quite the same as the vcabulary that statisticians use in inference over such models.
  3. Network models, stochastic block models etc, would probably fit naturally in here.

I would like something that uses spatial adjacency in a non-parametric way and which has convenient inference. Are there models in this class, which make the “right” assumptions to be tractable for use in inference?

2.1 Discrete stick breaking process

We can break up a discrete vector of length \(d\) into regions of equal mass, and then use a stick-breaking process to sample those regions; the stick breaking distribution in this case would naturally be, say, Binomial. We can also imagine Indian Buffet/Chinese restaurant style constructions. In fact, maybe that would be even easier.

Does this get us anywhere? What dependency could we naturally introduce here? One candidate would be to randomly allocate a small number of cells in each partition to random other partitions at each timestep.

3 References

Asselah, Ferrari, and Groisman. 2011. Quasistationary Distributions and Fleming-Viot Processes in Finite Spaces.” Journal of Applied Probability.
Donnelly, and Kurtz. 1996. A Countable Representation of the Fleming-Viot Measure-Valued Diffusion.” The Annals of Probability.
Dunson, and Park. 2008. Kernel Stick-Breaking Processes.” Biometrika.
Ethier, and Griffiths. 1993. The Transition Function of a Fleming-Viot Process.” The Annals of Probability.
Ethier, and Kurtz. 1993. Fleming–Viot Processes in Population Genetics.” SIAM Journal on Control and Optimization.
Fleming, and Viot. 1979. Some Measure-Valued Markov Processes in Population Genetics Theory.” Indiana University Mathematics Journal.
Griffin, and Steel. 2006. Order-Based Dependent Dirichlet Processes.” Journal of the American Statistical Association.
Ishwaran, and James. 2001. Gibbs Sampling Methods for Stick-Breaking Priors.” Journal of the American Statistical Association.
Konno, and Shiga. 1988. Stochastic Partial Differential Equations for Some Measure-Valued Diffusions.” Probability Theory and Related Fields.
Marzen, and Crutchfield. 2020. Inference, Prediction, and Entropy-Rate Estimation of Continuous-Time, Discrete-Event Processes.” arXiv:2005.03750 [Cond-Mat, Physics:nlin, Stat].
Moran. 1958. Random Processes in Genetics.” Mathematical Proceedings of the Cambridge Philosophical Society.
Nowak. 2006. Evolutionary Dynamics: Exploring the Equations of Life.