# The interpretation of RV densities as point process intensities and vice versa

## Point process of observations ↔ observation of a point process Estimating densities by considering the observations drawn from that as a point process. In one dimension this gives us the particularly lovely trick of survival analysis, but the relations are much more general.

Consider the problem of estimating the common density $$f(x)dx=dF(x)$$ of indexed i.i.d. random variables $$\{X_i\}_{1\leq i\leq n}\in \mathbb{R}^d$$ from $$n$$ realisations of those variables, $$\{x_i\}_{i\leq n}$$ where $$F:\mathbb{R}^d\rightarrow[0,1]$$ a (cumulative) distribution. We assume the state is absolutely continuous with respect to the Lebesgue measure, i.e. $$\mu(A)=0\Rightarrow P(X_i\in A)=0$$. This implies that $$P(X_i)=P(X_j)=0\text{ for }i\neq j$$ and that the density exists as a standard function (i.e. we do not need to consider generalised functions such as distributions to handle atoms in $$F$$ etc.)

Here we parameterise the density with some finite dimensional parameter vector $$\theta,$$ i.e. $$f(x;\theta),$$ whose value completely characterises the density; the problem of estimating the density is then the same as the one of estimating $$\theta.$$

In the method of maximum likelihood estimation, we seek to maximise the value of the empirical likelihood of the data. That is, we choose a parameter estimate $$\hat{\theta}$$ to satisfy \begin{aligned} \hat{\theta} &:=\operatorname{arg max}_\theta\prod_i f(x_i;\theta)\\ &=\operatorname{arg max}_\theta\sum_i \log f(x_i;\theta) \end{aligned}

## Basis function method for density

Let’s consider the case where we try to estimate this function by constructing it from some given basis of $$p$$ functions $$\phi_j: \mathbb{R}^d\rightarrow[0,\infty),$$ so that

$f(x)=\sum_{j\leq p}w_j\phi_j(x)$

and $$\theta\equiv\{w_j\}_{j\leq p}.$$ We keep this simple by requiring $$\int\phi_j(x)dx=1,$$ so that they are all valid densities. Then the requirement that $$\int f(x)dx=1$$ will imply that $$\sum_j w_j=1,$$ i.e. we are taking a convex combination of these basis densities.

Then the maximum likelihood estimator can be written \begin{aligned} \hat{\theta} &=\operatorname{arg max}_{\{w_i\}}f(\{x_i\};\{w_i\})\\ &=\operatorname{arg max}_{\{w_i\}}\sum_i \log \sum_{j\leq p}w_j\phi_j(x_i) \end{aligned}

A moment’s thought reveals that this equation has no finite optimum, since it is strictly increasing in each $$w_j$$. However, we are missing a constraint, which is that to be a well-defined probability density, it must integrate to unity, i.e. $\int f(\{x\};\{w_i\})dx = 1$ and therefore \begin{aligned} \int \sum_{j\leq p}w_j\phi_j(x)dx&=1\\ \sum_{j\leq p}w_j\int\phi_j(x)dx&=1\\ \sum_{j\leq p}w_j &=1. \end{aligned}

By messing around with Lagrange multipliers to enforce that constraint we eventually find $\hat{\{w_k\}} = \frac{\sum_i \phi_k(x_i)}{\sum_i \sum_j \phi_j(x_i)}.$

## Intensities

Consider the problem of estimating the intensity $$\lambda$$ of a simple, non-interacting inhomogeneous point process $$N(B)$$ on some compact $$W\subset\mathbb{R}^d$$ from a realisation $$\{x_i\}_{i\leq n}$$, and this counting function $$N(B)$$ counts the number of points that fall on a set $$B\subset\mathbb{R}^d$$.

The intensity is (in the simple non-interacting case — see Daley and Vere-Jones (2003) for other cases) a function $$\lambda:\mathbb{R}\rightarrow [0,\infty)$$ such that, or any box $$B\subset W$$, $N(B)\sim\operatorname{Poisson}(\Lambda(B))$ where $\Lambda(B):=\int_Bd\lambda(x)dx$ and for any disjoint boxes, $$N(A)\perp N(B).$$

After some argumentation about intensities we can find a likelihood for the observed distribution: $f(\{x_i\};\tau)= \prod_i \lambda(x_i;\tau)\exp\left(-\int\lambda(x;\tau)dx\right).$

Say that we wish to find the inhomogeneous intensity function by the method of maximum likelihood. We allow the intensity function to be described by a parameter vector $$\tau,$$ which we write $$\lambda(x;\tau)$$, and we once again construct an estimate: \begin{aligned} \hat{\tau}&:=\operatorname{arg max}_\tau\sum_i\log f(x;\tau)\\ &=\operatorname{arg max}_\tau\sum_i\log \left(\lambda(x_i;\tau) \exp\left(-\int_W\lambda(x;\tau) dx\right)\right)\\ &=\operatorname{arg max}_\tau\sum_i\log \lambda(x_i;\tau)-\int_W\lambda(x;\tau)dx\\ &=\operatorname{arg max}_\tau\sum_i\log \lambda(x_i;\tau)-\Lambda(W). \end{aligned}

## Basis function method for intensity

Now consider the case where we assume that the intensity can be written in a $$\phi_k$$ basis as above, so that $\lambda(x)=\sum_{j\leq p}\omega_j\phi_j(x)$ with $$\tau\equiv\{\omega_j\}.$$ Then our estimate may be written \begin{aligned} \hat{\tau}&:=\operatorname{arg max}_{\{\omega_j\}}f\left(\{x_i\};\{\omega_j\} \right)\\ &:=\operatorname{arg max}_{\{\omega_j\}}\sum_i\left(\log \lambda(x_i;\tau)-\Lambda(W)\right)\\ &=\operatorname{arg max}_{\{\omega_j\}}\sum_i\left(\log \sum_{j\leq p}\omega_j\phi_j(x_i)-\int_W\sum_{j\leq p}\omega_j\phi_j(x)dx\right)\\ &=\operatorname{arg max}_{\{\omega_j\}}\sum_i\left(\log \sum_{j\leq p}\omega_j\phi_j(x_i)-\sum_{j\leq p}\omega_j\int_W \phi_j(x)dx \right)\\ &=\operatorname{arg max}_{\{\omega_j\}}\sum_i\left(\log \sum_{j\leq p}\omega_j\phi_j(x_i)\right)-\sum_{j\leq p}\omega_j \end{aligned} We have a similar log-likelihood to the density estimation case.

Under the constraint that $$E(\hat{N}=N)$$ we have $$\sum_j\omega_j=n$$ and therefore $\hat{\tau} =\operatorname{arg max}_{\{\omega_i\}}\sum_i\left(\log \sum_{j\leq p}\omega_j\phi_j(x_i)\right)-n.$

Note that if we consider the points as a shifted density we find the result is the same as the maximum obtained by considering the points as an inhomogeneous spatial point pattern, up to an offset of $$n$$, i.e. $$\omega_j\equiv nw_j.$$

## Count regression

From the other direction, we can formulate density estimation as a count regression; For “nice” distributions this will be the same as estimating the correct Poisson intensity for every given small region of the state space (e.g. ). 🏗

## Probability over boxes

Consider a box in $$B\subset \mathbb{R}^d$$. The probability of any one $$X_i$$ falling within that box, $P(X_i\subset B)=E\left(\mathbb{I}\{X_i\subset B\}\right)=\int_B dF(x).$

We know that the expected number of $$X_i$$ to fall within that box is $$N$$ times the probability of any one falling in that box, i.e. $E\left(\sum_{i\leq N}\mathbb{I}\{X_i\subset B\}\right)=N\int_B dF(x)$ and thus $P(N(B)=k)=\frac{\exp(-\Lambda(B))\Lambda(B)^k}{k!}.$ …Where was I going with this? Something to do with linear point process estimation perhaps? 🏗

## Score function versus hazard function

The score function and log-hazard rates are similar beasts. We can exploit that in other ways perhaps, e.g. in a Langevin dynamics algorithm? But will we gain something useful from that? Does Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation leverage something like that?

## Interacting point processes

Interacting point processes have intensities too which may also be re-interpreted as densities. What kind of relations are implied between the RVs which would have this “dynamically evolving” density? Clearly not i.i.d. But useful somewhere?

### No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.