# The interpretation of RV densities as point process intensities and vice versa

Point process of observations ↔︎ observation of a point process

September 13, 2016 — September 24, 2022

density
nonparametric
point processes
probability
spatial
statistics
time series

Estimating densities by considering the observations drawn from that as a point process. In one dimension this gives us the particularly lovely trick of survival analysis, but the relations are much more general.

Consider the problem of estimating the common density $$f(x)dx=dF(x)$$ of indexed i.i.d. random variables $$\{X_i\}_{1\leq i\leq n}\in \mathbb{R}^d$$ from $$n$$ realisations of those variables, $$\{x_i\}_{i\leq n}$$ where $$F:\mathbb{R}^d\rightarrow[0,1]$$ a (cumulative) distribution. We assume the state is absolutely continuous with respect to the Lebesgue measure, i.e. $$\mu(A)=0\Rightarrow P(X_i\in A)=0$$. This implies that $$P(X_i)=P(X_j)=0\text{ for }i\neq j$$ and that the density exists as a standard function (i.e. we do not need to consider generalised functions such as distributions to handle atoms in $$F$$ etc.)

Here we parameterise the density with some finite dimensional parameter vector $$\theta,$$ i.e. $$f(x;\theta),$$ whose value completely characterises the density; the problem of estimating the density is then the same as the one of estimating $$\theta.$$

In the method of maximum likelihood estimation, we seek to maximise the value of the empirical likelihood of the data. That is, we choose a parameter estimate $$\hat{\theta}$$ to satisfy \begin{aligned} \hat{\theta} &:=\operatorname{arg max}_\theta\prod_i f(x_i;\theta)\\ &=\operatorname{arg max}_\theta\sum_i \log f(x_i;\theta) \end{aligned}

## 1 Basis function method for density

Let’s consider the case where we try to estimate this function by constructing it from some given basis of $$p$$ functions $$\phi_j: \mathbb{R}^d\rightarrow[0,\infty),$$ so that

$f(x)=\sum_{j\leq p}w_j\phi_j(x)$

and $$\theta\equiv\{w_j\}_{j\leq p}.$$ We keep this simple by requiring $$\int\phi_j(x)dx=1,$$ so that they are all valid densities. Then the requirement that $$\int f(x)dx=1$$ will imply that $$\sum_j w_j=1,$$ i.e. we are taking a convex combination of these basis densities.

Then the maximum likelihood estimator can be written \begin{aligned} \hat{\theta} &=\operatorname{arg max}_{\{w_i\}}f(\{x_i\};\{w_i\})\\ &=\operatorname{arg max}_{\{w_i\}}\sum_i \log \sum_{j\leq p}w_j\phi_j(x_i) \end{aligned}

A moment’s thought reveals that this equation has no finite optimum, since it is strictly increasing in each $$w_j$$. However, we are missing a constraint, which is that to be a well-defined probability density, it must integrate to unity, i.e. $\int f(\{x\};\{w_i\})dx = 1$ and therefore \begin{aligned} \int \sum_{j\leq p}w_j\phi_j(x)dx&=1\\ \sum_{j\leq p}w_j\int\phi_j(x)dx&=1\\ \sum_{j\leq p}w_j &=1. \end{aligned}

By messing around with Lagrange multipliers to enforce that constraint we eventually find $\hat{\{w_k\}} = \frac{\sum_i \phi_k(x_i)}{\sum_i \sum_j \phi_j(x_i)}.$

## 2 Intensities

Consider the problem of estimating the intensity $$\lambda$$ of a simple, non-interacting inhomogeneous point process $$N(B)$$ on some compact $$W\subset\mathbb{R}^d$$ from a realisation $$\{x_i\}_{i\leq n}$$, and this counting function $$N(B)$$ counts the number of points that fall on a set $$B\subset\mathbb{R}^d$$.

The intensity is (in the simple non-interacting case — see Daley and Vere-Jones (2003) for other cases) a function $$\lambda:\mathbb{R}\rightarrow [0,\infty)$$ such that, or any box $$B\subset W$$, $N(B)\sim\operatorname{Poisson}(\Lambda(B))$ where $\Lambda(B):=\int_Bd\lambda(x)dx$ and for any disjoint boxes, $$N(A)\perp N(B).$$

After some argumentation about intensities we can find a likelihood for the observed distribution: $f(\{x_i\};\tau)= \prod_i \lambda(x_i;\tau)\exp\left(-\int\lambda(x;\tau)dx\right).$

Say that we wish to find the inhomogeneous intensity function by the method of maximum likelihood. We allow the intensity function to be described by a parameter vector $$\tau,$$ which we write $$\lambda(x;\tau)$$, and we once again construct an estimate: \begin{aligned} \hat{\tau}&:=\operatorname{arg max}_\tau\sum_i\log f(x;\tau)\\ &=\operatorname{arg max}_\tau\sum_i\log \left(\lambda(x_i;\tau) \exp\left(-\int_W\lambda(x;\tau) dx\right)\right)\\ &=\operatorname{arg max}_\tau\sum_i\log \lambda(x_i;\tau)-\int_W\lambda(x;\tau)dx\\ &=\operatorname{arg max}_\tau\sum_i\log \lambda(x_i;\tau)-\Lambda(W). \end{aligned}

## 3 Basis function method for intensity

Now consider the case where we assume that the intensity can be written in a $$\phi_k$$ basis as above, so that $\lambda(x)=\sum_{j\leq p}\omega_j\phi_j(x)$ with $$\tau\equiv\{\omega_j\}.$$ Then our estimate may be written \begin{aligned} \hat{\tau}&:=\operatorname{arg max}_{\{\omega_j\}}f\left(\{x_i\};\{\omega_j\} \right)\\ &:=\operatorname{arg max}_{\{\omega_j\}}\sum_i\left(\log \lambda(x_i;\tau)-\Lambda(W)\right)\\ &=\operatorname{arg max}_{\{\omega_j\}}\sum_i\left(\log \sum_{j\leq p}\omega_j\phi_j(x_i)-\int_W\sum_{j\leq p}\omega_j\phi_j(x)dx\right)\\ &=\operatorname{arg max}_{\{\omega_j\}}\sum_i\left(\log \sum_{j\leq p}\omega_j\phi_j(x_i)-\sum_{j\leq p}\omega_j\int_W \phi_j(x)dx \right)\\ &=\operatorname{arg max}_{\{\omega_j\}}\sum_i\left(\log \sum_{j\leq p}\omega_j\phi_j(x_i)\right)-\sum_{j\leq p}\omega_j \end{aligned} We have a similar log-likelihood to the density estimation case.

Under the constraint that $$E(\hat{N}=N)$$ we have $$\sum_j\omega_j=n$$ and therefore $\hat{\tau} =\operatorname{arg max}_{\{\omega_i\}}\sum_i\left(\log \sum_{j\leq p}\omega_j\phi_j(x_i)\right)-n.$

Note that if we consider the points as a shifted density we find the result is the same as the maximum obtained by considering the points as an inhomogeneous spatial point pattern, up to an offset of $$n$$, i.e. $$\omega_j\equiv nw_j.$$

## 4 Count regression

From the other direction, we can formulate density estimation as a count regression; For “nice” distributions this will be the same as estimating the correct Poisson intensity for every given small region of the state space (e.g. ). 🏗

## 5 Probability over boxes

Consider a box in $$B\subset \mathbb{R}^d$$. The probability of any one $$X_i$$ falling within that box, $P(X_i\subset B)=E\left(\mathbb{I}\{X_i\subset B\}\right)=\int_B dF(x).$

We know that the expected number of $$X_i$$ to fall within that box is $$N$$ times the probability of any one falling in that box, i.e. $E\left(\sum_{i\leq N}\mathbb{I}\{X_i\subset B\}\right)=N\int_B dF(x)$ and thus $P(N(B)=k)=\frac{\exp(-\Lambda(B))\Lambda(B)^k}{k!}.$ …Where was I going with this? Something to do with linear point process estimation perhaps? 🏗

## 6 Score function versus hazard function

The score function and log-hazard rates are similar beasts. We can exploit that in other ways perhaps, e.g. in a Langevin dynamics algorithm? But will we gain something useful from that? Does Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation leverage something like that?

## 7 Interacting point processes

Interacting point processes have intensities too which may also be re-interpreted as densities. What kind of relations are implied between the RVs which would have this “dynamically evolving” density? Clearly not i.i.d. But useful somewhere?

## 8 References

Andersen, Borgan, Gill, et al. 1997. Statistical models based on counting processes. Springer series in statistics.
Anderson, and Richardson. 1979. Technometrics.
Barron, and Sheu. 1991. The Annals of Statistics.
Berman, and Diggle. 1989. Journal of the Royal Statistical Society. Series B (Methodological).
Brown, Cai, and Zhou. 2010. The Annals of Statistics.
Castellan. 2003. IEEE Transactions on Information Theory.
Cox. 1965. Journal of the Royal Statistical Society: Series B (Methodological).
Cunningham, Shenoy, and Sahani. 2008. In Proceedings of the 25th International Conference on Machine Learning. ICML ’08.
Daley, and Vere-Jones. 2003. An introduction to the theory of point processes.
———. 2008. An Introduction to the Theory of Point Processes. Probability and Its Applications.
Efromovich. 1996. The Annals of Statistics.
———. 2007. The Annals of Statistics.
Eilers, and Marx. 1996. Statistical Science.
Ellis. 1991. Stochastic Processes and Their Applications.
Giesecke, Kakavand, and Mousavi. 2008. In Simulation Conference, 2008. WSC 2008. Winter.
Gu. 1993. Journal of the American Statistical Association.
Heigold, Schlüter, and Ney. 2007. In Eighth Annual Conference of the International Speech Communication Association.
Hinton, Deng, Yu, et al. 2012. IEEE Signal Processing Magazine.
Kooperberg, and Stone. 1991. Computational Statistics & Data Analysis.
———. 1992. Journal of Computational and Graphical Statistics.
Leung, and Barron. 2006. IEEE Transactions on Information Theory.
Marteau-Ferey, Bach, and Rudi. 2020. In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20.
Miller, Cole, Louppe, et al. 2020. In.
Møller, and Waagepetersen. 2003. Statistical Inference and Simulation for Spatial Point Processes.
Norets. 2010. The Annals of Statistics.
Panaretos, and Zemel. 2016. The Annals of Statistics.
Papangelou. 1974. Zeitschrift Für Wahrscheinlichkeitstheorie Und Verwandte Gebiete.
Rásonyi, and Tikosi. 2022. Statistics & Probability Letters.
Reynaud-Bouret. 2003. Probability Theory and Related Fields.
Saul, and Lee. 2001. In Advances in Neural Information Processing Systems.
Schoenberg. 2005. Journal of Statistical Planning and Inference.
Sha, and Saul. 2006. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.
Sugiyama, Takeuchi, Suzuki, et al. 2010. In International Conference on Artificial Intelligence and Statistics.
Tsuchida, Ong, and Sejdinovic. 2023.
Tüske, Tahir, Schlüter, et al. 2015. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
van Lieshout. 2011. Methodology and Computing in Applied Probability.
Willett, and Nowak. 2007. IEEE Transactions on Information Theory.