## Densities

Consider the problem of estimating the common density \(f(x)dx=dF(x)\) density of indexed i.i.d. random variables \(\{X_i\}_{1\leq i\leq n}\in \mathbb{R}^d\) from \(n\) realisations of those variables, \(\{x_i\}_{i\leq n}\) where \(F:\mathbb{R}^d\rightarrow[0,1]\) a (cumulative) distribution. We assume the state is absolutely continuous with respect to the Lebesgue measure, i.e. \(\mu(A)=0\Rightarrow P(X_i\in A)=0\). This implies that \(P(X_i)=P(X_j)=0\text{ for }i\neq j\) and that the density exists as a standard function (i.e. we do not need to consider generalised functions such as distributions to handle atoms in \(F\) etc.)

Here we will parameterise the density with some finite dimensional parameter vector \(\theta,\) i.e. \(f(x;\theta),\) whose value completely characterises the density; the problem of estimating the density is then the same as the one of estimating \(\theta.\)

In the method of maximum likelihood estimation we seek to maximise the value of the empirical likelihood of the data. That is, we choose a parameter estimate \(\hat{\theta}\) to satisfy

\[ \begin{aligned} \hat{\theta} &:=\operatorname{argmax}_\theta\prod_i f(x_i;\theta)\\ &=\operatorname{argmax}_\theta\sum_i \log f(x_i;\theta) \end{aligned} \]

## Basis function method for density

Let’s consider the case where we try to estimate this function by constructing it from some given basis of \(p\) functions \(\phi_j: \mathbb{R}^d\rightarrow[0,\infty),\) so that

\[f(x)=\sum_{j\leq p}w_j\phi_j(x)\]

and \(\theta\equiv\{w_j\}_{j\leq p}.\) We will keep this simple by requiring \(\int\phi_j(x)dx=1,\) so that they are all valid densities. Then the requirement that \(\int f(x)dx=1\) will imply that \(\sum_j w_j=1,\) i.e. we are taking a convex combination of these basis densities.

Then the maximum likelihood estimator can be written

\[ \begin{aligned} \hat{\theta} &=\operatorname{argmax}_{\{w_i\}}f(\{x_i\};\{w_i\})\\ &=\operatorname{argmax}_{\{w_i\}}\sum_i \log \sum_{j\leq p}w_j\phi_j(x_i) \end{aligned} \]

A moment’s thought reveals that this equation has no solution, since it is strictly increasing in each \(w_j\). However, we are missing a constraint, which is that to be a well-defined probability density, it must integrate to unity, i.e.

\[ \int f(\{x\};\{w_i\})dx = 1 \] and therefore

\[ \begin{aligned} \int \sum_{j\leq p}w_j\phi_j(x)dx&=1\\ \sum_{j\leq p}w_j\int\phi_j(x)dx&=1\\ \sum_{j\leq p}w_j &=1\\ \end{aligned} \]

By messing around with Lagrange multipliers to enforce that constraint we eventually find

\[ \hat{\{w_k\}} = \frac{\sum_i \phi_k(x_i)}{\sum_i \sum_j \phi_j(x_i)} \]

## Intensities

Consider the problem of estimating the intensity \(\lambda\) of a *simple*,
non-interacting inhomogeneous point process
\(N(B)\) on some compact \(W\subset\mathbb{R}^d\) from a realisation
\(\{x_i\}_{i\leq n}\), and this counting function \(N(B)\)
counts the number of points that fall on
a set \(B\subset\mathbb{R}^d\).

The *intensity* is (in the simple non interacting case –
see (2003)
for other cases) a function
\(\lambda:\mathbb{R}\rightarrow [0,\infty)\) such that,
or any box \(B\subset W\),

\[N(B)\sim\operatorname{Poisson}(\Lambda(B))\] where

\[\Lambda(B):=\int_Bd\lambda(x)dx\] and for any disjoint boxes, \(N(A)\perp N(B).\)

After some argumentation about intensities we can find a likelihood for the observed distribution:

\[f(\{x_i\};\tau)= \prod_i \lambda(x_i;\tau)\exp\left(-\int\lambda(x;\tau)dx\right). \]

Say that we wish to find the inhomogeneous intensity function by the method of maximum likelihood. We allow the intensity function to be described by a parameter vector \(\tau,\) which we write \(\lambda(x;\tau)\), and we once again construct an estimate:

\[ \begin{aligned} \hat{\tau}&:=\operatorname{argmax}_\tau\sum_i\log f(x;\tau)\\ &=\operatorname{argmax}_\tau\sum_i\log \left(\lambda(x_i;\tau) \exp\left(-\int_W\lambda(x;\tau) dx\right)\right)\\ &=\operatorname{argmax}_\tau\sum_i\log \lambda(x_i;\tau)-\int_W\lambda(x;\tau)dx\\ &=\operatorname{argmax}_\tau\sum_i\log \lambda(x_i;\tau)-\Lambda(W) \end{aligned} \]

## Basis function method for intensity

Now consider the case where we assume that the intensity can be written in a \(\phi_k\) basis as above, so that

\[\lambda(x)=\sum_{j\leq p}\omega_j\phi_j(x)\]

with \(\tau\equiv\{\omega_j\}.\) Then our estimate may be written

\[ \begin{aligned} \hat{\tau}&:=\operatorname{argmax}_{\{\omega_j\}}f\left(\{x_i\};\{\omega_j\} \right)\\ &:=\operatorname{argmax}_{\{\omega_j\}}\sum_i\left(\log \lambda(x_i;\tau)-\Lambda(W)\right)\\ &=\operatorname{argmax}_{\{\omega_j\}}\sum_i\left(\log \sum_{j\leq p}\omega_j\phi_j(x_i)-\int_W\sum_{j\leq p}\omega_j\phi_j(x)dx\right)\\ &=\operatorname{argmax}_{\{\omega_j\}}\sum_i\left(\log \sum_{j\leq p}\omega_j\phi_j(x_i)-\sum_{j\leq p}\omega_j\int_W \phi_j(x)dx \right)\\ &=\operatorname{argmax}_{\{\omega_j\}}\sum_i\left(\log \sum_{j\leq p}\omega_j\phi_j(x_i)\right)-\sum_{j\leq p}\omega_j \end{aligned} \]

We have a similar log-likelihood to the density estimation case.

Under the constraint that \(E(\hat{N}=N)\) we have \(\sum_j\omega_j=n\) and therefore

\[ \hat{\tau} =\operatorname{argmax}_{\{\omega_i\}}\sum_i\left(\log \sum_{j\leq p}\omega_j\phi_j(x_i)\right)-n \]

Note that if we consider the points *as a density* we find the same is the same
as the maximum obtained by considering the points as an inhomogeneous spatial
point patter, up to an offset of \(n\), i.e. \(\omega_j\equiv nw_j.\)

## Count regression

We can formulate density estimation as a count regression; For “nice” distributions this will be the same as estimating the correct Poisson intensity for every given small region of the state space (e.g. (Gu 1993; Eilers and Marx 1996)). 🏗

## Probability over boxes

Consider a box in \(B\subset \mathbb{R}^d\). The probability of any one \(X_i\) falling within that box,

\[P(X_i\subset B)=E\left(\mathbb{I}\{X_i\subset B\}\right)=\int_B dF(x).\]

We know that the expected number of \(X_i\) to fall within that box is \(N\) times the probability of any one falling in that box, i.e.

\[E\left(\sum_{i\leq N}\mathbb{I}\{X_i\subset B\}\right)=N\int_B dF(x)\] and thus

\[P(N(B)=k)=\frac{\exp(-\Lambda(B))\Lambda(B)^k}{k!}.\]

…Where was I going with this? Something to do with linear point process estimation perhaps? 🏗

## Questions

- Connection to kernel-based point process methods?
- Connection to survival analysis?

Andersen, Per Kragh, Ornulf Borgan, Richard D. Gill, and Niels Keiding. 1997. *Statistical Models Based on Counting Processes*. Corr. 2. print. Springer Series in Statistics. New York, NY: Springer.

Anderson, J. A., and S. C. Richardson. 1979. “Logistic Discrimination and Bias Correction in Maximum Likelihood Estimation.” *Technometrics* 21 (1): 71–78. https://doi.org/10.1080/00401706.1979.10489724.

Barron, Andrew R., and Chyong-Hwa Sheu. 1991. “Approximation of Density Functions by Sequences of Exponential Families.” *The Annals of Statistics* 19 (3): 1347–69. https://doi.org/10.1214/aos/1176348252.

Berman, Mark, and Peter Diggle. 1989. “Estimating Weighted Integrals of the Second-Order Intensity of a Spatial Point Process.” *Journal of the Royal Statistical Society. Series B (Methodological)* 51 (1): 81–92. https://publications.csiro.au/rpr/pub?list=BRO&pid=procite:d5b7ecd7-435c-4dab-9063-f1cf2fbdf4cb.

Brown, Lawrence D., T. Tony Cai, and Harrison H. Zhou. 2010. “Nonparametric Regression in Exponential Families.” *The Annals of Statistics* 38 (4): 2005–46. https://doi.org/10.1214/09-AOS762.

Castellan, G. 2003. “Density Estimation via Exponential Model Selection.” *IEEE Transactions on Information Theory* 49 (8): 2052–60. https://doi.org/10.1109/TIT.2003.814485.

Cox, D. R. 1965. “On the Estimation of the Intensity Function of a Stationary Point Process.” *Journal of the Royal Statistical Society. Series B (Methodological)* 27 (2): 332–37. http://www.jstor.org/stable/2984202.

Cunningham, John P., Krishna V. Shenoy, and Maneesh Sahani. 2008. “Fast Gaussian Process Methods for Point Process Intensity Estimation.” In *Proceedings of the 25th International Conference on Machine Learning*, 192–99. ICML ’08. New York, NY, USA: ACM Press. https://doi.org/10.1145/1390156.1390181.

Daley, Daryl J., and David Vere-Jones. 2003. *An Introduction to the Theory of Point Processes*. 2nd ed. Vol. 1. Elementary theory and methods. New York: Springer. http://ebooks.springerlink.com/UrlApi.aspx?action=summary&v=1&bookid=108085.

———. 2008. *An Introduction to the Theory of Point Processes*. 2nd ed. Vol. 2. General theory and structure. Probability and Its Applications. New York: Springer. http://link.springer.com/chapter/10.1007/978-0-387-49835-5_7.

Efromovich, Sam. 1996. “On Nonparametric Regression for IID Observations in a General Setting.” *The Annals of Statistics* 24 (3): 1126–44. https://doi.org/10.1214/aos/1032526960.

———. 2007. “Conditional Density Estimation in a Regression Setting.” *The Annals of Statistics* 35 (6): 2504–35. https://doi.org/10.1214/009053607000000253.

Eilers, Paul H. C., and Brian D. Marx. 1996. “Flexible Smoothing with B-Splines and Penalties.” *Statistical Science* 11 (2): 89–121. https://doi.org/10.1214/ss/1038425655.

Ellis, Steven P. 1991. “Density Estimation for Point Processes.” *Stochastic Processes and Their Applications* 39 (2): 345–58. https://doi.org/10.1016/0304-4149(91)90087-S.

Giesecke, K., H. Kakavand, and M. Mousavi. 2008. “Simulating Point Processes by Intensity Projection.” In *Simulation Conference, 2008. WSC 2008. Winter*, 560–68. https://doi.org/10.1109/WSC.2008.4736114.

Gu, Chong. 1993. “Smoothing Spline Density Estimation: A Dimensionless Automatic Algorithm.” *Journal of the American Statistical Association* 88 (422): 495–504. https://doi.org/10.1080/01621459.1993.10476300.

Heigold, Georg, Ralf Schlüter, and Hermann Ney. 2007. “On the Equivalence of Gaussian HMM and Gaussian HMM-Like Hidden Conditional Random Fields.” In *Eighth Annual Conference of the International Speech Communication Association*. http://www-i6.informatik.rwth-aachen.de/publications/download/282/Heigold--2007.pdf.

Hinton, G., Li Deng, Dong Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, et al. 2012. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups.” *IEEE Signal Processing Magazine* 29 (6): 82–97. https://doi.org/10.1109/MSP.2012.2205597.

Kooperberg, Charles, and Charles J. Stone. 1992. “Logspline Density Estimation for Censored Data.” *Journal of Computational and Graphical Statistics* 1 (4): 301–28. https://doi.org/10.2307/1390786.

———. 1991. “A Study of Logspline Density Estimation.” *Computational Statistics & Data Analysis* 12 (3): 327–47. https://doi.org/10.1016/0167-9473(91)90115-I.

Leung, G., and A. R. Barron. 2006. “Information Theory and Mixing Least-Squares Regressions.” *IEEE Transactions on Information Theory* 52 (8): 3396–3410. https://doi.org/10.1109/TIT.2006.878172.

Lieshout, Marie-Colette N. M. van. 2011. “On Estimation of the Intensity Function of a Point Process.” *Methodology and Computing in Applied Probability* 14 (3): 567–78. https://doi.org/10.1007/s11009-011-9244-9.

Møller, Jesper, and Rasmus Plenge Waagepetersen. 2003. *Statistical Inference and Simulation for Spatial Point Processes*. Chapman and Hall/CRC. https://doi.org/10.1201/9780203496930.

Norets, Andriy. 2010. “Approximation of Conditional Densities by Smooth Mixtures of Regressions.” *The Annals of Statistics* 38 (3): 1733–66. https://doi.org/10.1214/09-AOS765.

Panaretos, Victor M., and Yoav Zemel. 2016. “Separation of Amplitude and Phase Variation in Point Processes.” *The Annals of Statistics* 44 (2): 771–812. https://doi.org/10.1214/15-AOS1387.

Papangelou, F. 1974. “The Conditional Intensity of General Point Processes and an Application to Line Processes.” *Zeitschrift Für Wahrscheinlichkeitstheorie Und Verwandte Gebiete* 28 (3): 207–26. https://doi.org/10.1007/BF00533242.

Reynaud-Bouret, Patricia. 2003. “Adaptive Estimation of the Intensity of Inhomogeneous Poisson Processes via Concentration Inequalities.” *Probability Theory and Related Fields* 126 (1). https://doi.org/10.1007/s00440-003-0259-1.

Saul, Lawrence K., and Daniel D. Lee. 2001. “Multiplicative Updates for Classification by Mixture Models.” In *Advances in Neural Information Processing Systems*, 897–904. http://papers.nips.cc/paper/2085-multiplicative-updates-for-classification-by-mixture-models.

Schoenberg, Frederic Paik. 2005. “Consistent Parametric Estimation of the Intensity of a Spatial–Temporal Point Process.” *Journal of Statistical Planning and Inference* 128 (1): 79–93. https://doi.org/10.1016/j.jspi.2003.09.027.

Sha, Fei, and Lawrence K. Saul. 2006a. “Large Margin Hidden Markov Models for Automatic Speech Recognition.” In *Advances in Neural Information Processing Systems*, 1249–56. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_143.pdf.

Sha, Fei, and L. K. Saul. 2006b. “Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition.” In *2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings*, 1:I–I. https://doi.org/10.1109/ICASSP.2006.1660008.

Sugiyama, Masashi, Ichiro Takeuchi, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Daisuke Okanohara. 2010. “Conditional Density Estimation via Least-Squares Density Ratio Estimation.” In *International Conference on Artificial Intelligence and Statistics*, 781–88. http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2010_SugiyamaTSKHO10.pdf.

Tüske, Zoltán, Muhammad Ali Tahir, Ralf Schlüter, and Hermann Ney. 2015. “Integrating Gaussian Mixtures into Deep Neural Networks: Softmax Layer with Hidden Variables.” In *2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, 4285–9. IEEE. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7178779.

Willett, R. M., and R. D. Nowak. 2007. “Multiscale Poisson Intensity and Density Estimation.” *IEEE Transactions on Information Theory* 53 (9): 3171–87. https://doi.org/10.1109/TIT.2007.903139.