# Mixture models for density estimation

March 29, 2016 — April 24, 2020

A method of semi-parametric density estimation.

We consider random variables \(\mathsf{x}\) with density make out of a mixture of other component densities.

\[ f(x) \sim \sum_{i = 1}^M w_i f(x;\theta_i). \] Here \(\sum_i w=1\) and \(f(x;\theta_i)\) are the mixture component densities, which are themselves specified by some parameter vector \(\theta_i.\) These components do not need to be in the same family, so let us assume that the \(\theta_i\) can encode some information about which family too. Equivalently, we consider the variables defined by randomly choosing from an array of mutually independent component random variables \(\mathsf{x}_i\sim f(\cdot; \theta_i)\).

\[ \mathsf{x} = \begin{cases} \vdots \\ \mathsf{x}_i: \text{ with prob. } w_i\\ \vdots \\ \end{cases} \] We might imagine there is some latent variable \(\mathsf{w}\sim \mathrm{Multinomial}(w_1,\dots,w_M)\) which we use to choose the component. This seems to be the formalism we use more commonly in model averaging.

There are many connections. Inferring mixture models is conceptually similar to clustering, and indeed many papers note the connection. Clustering via nearest-centroids is more or less mixture modelling where the mixture components all have the same shape. The classic inference method is expectation maximisation, which is of independent interest. A degenerate case, convolution kernel density estimation leads to particularly simple results. Model selection and convergence theory is a gigantic mess which I will not find time to disentangle.

## 1 Moments of a mixture

Before we go on to consider all the knotty convoluted estimation and approximation theory, let us think about the basic probabilistic properties of mixtures.

\[ \begin{aligned} E(\mathsf{x}) = {} & E[E(\mathsf{x}\mid \mathsf{w})] \\ = {} & E\left.\begin{cases} \vdots \\ E[\mathsf{x}_i] & \text{with prob. }w_i \\ \vdots \end{cases}\right\} \\ = {} & \sum_{i=1}^M w_i \mu_i\\ =: {} & \bar{\mu}. \end{aligned}\]

\[ \begin{aligned} \var(\mathsf{x}) = {} & E(\var(\mathsf{x}\mid \mathsf{w})) + \var(E(\mathsf{x} \mid \mathsf{w})) \\ = {} & E\left.\begin{cases} \vdots \\ \var[\mathsf{x}_i] & \text{with prob. }w_i \\ \vdots \end{cases}\right\} \\ & {} + \var\left.\begin{cases} \vdots \\ E[\mathsf{x}_i] & \text{with prob. }w_i \\ \vdots \end{cases} \right\} \\ = {} & \sum_{i=1}^M w_i \var(\mathsf{x}_i) + \sum_{i=1}^M w_i(\mu_i-\bar\mu)(\mu_i-\bar\mu)^T, \end{aligned}\] where \(\mu_i:=E(\mathsf{x}_i)\) and \(\bar{\mu}\) is the weighted mean over all \(\mu_i.\)

Nice. This simplicity is why you would work with these creatures, out of all the crazy ways we have of combining random variables into new random variables (e.g. reparameterization), mixtures are the ones that retain

## 2 Mixture zoo

The following categories are not mutually exclusive; In fact I’m mentioning them all to work out what exactly the intersections are.

(Battey and Liu 2013; van de Geer 1996; Assaf J. Zeevi and Meir 1997) discuss some useful results common to various convex combination estimators.

(DasGupta 2008) ch 33-24 is a high-speed no-filler all-killer summary of various convergence results and mixture types, including a connection to Donoho-Jin “Higher criticism” and nonparametric deconvolution and multiple testing.

### 2.1 “Classic Mixtures”

Finite location-scale mixtures, usually of Gaussians.

Your data are vectors \(x_j\in \mathbb{R}^d\).

Your density looks like this:

\[ f(x_j) = w_0 + \sum_{i=1}^{i=m}w_i\phi((x_j-\mu_j)^T\sigma_i^{-1}(x_j-\mu_j))/|\sigma| \]

Traditionally, \(\phi\) is given, and given as the normal density, but any “nice” elliptic density will do.

Also traditionally, \(m\) is given by magic. Fitting the parameters of this model, then, involves choosing \((\mathbf{w},\boldsymbol{\mu},\mathbf{\sigma}),\, \mathbf{w}\in\mathbb{R}^{m+1}, \boldsymbol{\mu}\in\mathbb{R}^{m\times d}, \boldsymbol{\sigma}\in\mathbb{R}^{d\times d}.\)

Why would you bother? We know that this class is dense in the space of all densities under the total variation metric (Cheney and Light 2009), which is a rationale if not a guarantee of its usefulness. For the Gaussian one can appeal to, e.g. (Battey and Liu 2013; Assaf J. Zeevi and Meir 1997) to argue that we can get “close” to any density in large classes by choosing the Gaussian for big \(m\) and introducing some strategic assumptions. Moreover, we know that it’s computationally tractable.

#### 2.1.1 Radial basis functions

A finite mixtures subtype, from the function approximation literature. When your component density has an assumed form

\[ f(x_j) = w_0 + \sum_{i=1}^{i=m}w_i\phi(h_j\|x_j-\mu_j\|)/h_j \]

Here \(h\) is a scale parameter for the density function. This corresponds to a spherical, rather than elliptic, approximating function, so has fewer parameters per component in higher dimensions than one.

#### 2.1.2 Kernel density estimators

If you have as many mixture components as data points, you have kernel density estimate. This is a semiparametric limiting case of the radial basis function model. To keep the number of parameters manageable you usually assume that the mixture components themselves all share the same scale parameters, although there are some extensions to that around which I might recall if I have need.

### 2.2 Continuous mixtures

Noticing that a classic location mixture is a convolution between a continuous density and an atomic density, the question arises whether you can convolve two more general densities. Yes, you can. Estimate a nonparametric mixing density. Now you have a nonparametric estimation problem, something like, estimate \(dP(\mu,\sigma)\). I should mention oneinteresting sub-case:

#### 2.2.1 Normal variance mixtures

Mixtures of normals where you only vary scale parameters. I clearly don’t know enough about this to write the entry; this is just a note to myself. 🏗 *z-distributions* and *generalized hyperbolic distributions* are the keywords. They include many other distributions as special cases.

### 2.3 Bayesian Dirichlet mixtures

🏗; Something about using a Dirichlet process for the… weights of the mixture components? Giving you a posterior distribution over countably finite mixture parameters? something like that?

### 2.4 Non-affine mixtures

Do the mixture components have to have location/scale parameterisations? Most models mentioned so far have assumed that, but more general models are feasible. However, estimating the scale parameters is already a PITA; what can we say about more general shape parameters? I suspect this is one way you could approach tail cutoffs in extreme value toery for example.

## 3 In Bayesian variational inference

Gaussian mixture models are the approximation densities in “classic” flavoured Bayesian variational inference, as opposed to fancy modern neural variational inference. Here you approximate a posterior with a mixture distribution; The estimation method looks reasonably similar to the frequentist case.

## 4 Estimation/selection methods

### 4.1 (Local) maximum likelihood

A classic method; There are some subtleties since the global maximum can be badly behaved; you have to mess around with local roots of the likelihood equation and thereby lose some of the lovely asymptotics of MLE methods in exponential families.

However, I am not sure which properties you lose. (Geoffrey J. McLachlan and Rathnayake 2014), for example, makes the assertion that the AIC conditions (whatever they are) don’t hold, but the BIC ones do.

### 4.2 Method of moments

Particularly popular in recent times for mixtures. 🏗

### 4.3 Minimum distance

Minimise the distance between the empirical PDF and the estimated PDF in some metric. For reasons I have not yet digested, one is probably best to do this in the Hellinger metric if one wishes convenient convergence, (Beran 1977), although kernel density estimates tend to prefer \(L_2\), as with e.g, regression smoothing problems. How on earth you numerically minimise Hellinger distance from data is something I won’t think about for now, although I admit to being curious.

### 4.4 Regression smoothing formulation

Not quite mixture models; you have a smoothness penalty on the quadratic component, a log-quadratic regression spline, then the results are “nearly” Gaussian mixture models. See (Eilers and Marx 1996).

## 5 Convergence and model selection

I would like a convenient text summarising the convergence results for mixture models under varying assumptions. There is a giant heap of research on convergence under various assumptions and metrics which needs summarising. That is beyond my capacity at the moment.

Dustin Mixon explains his paper and a few others in the field, in The Voronoi Means Conjecture

As I mentioned in a previous blog post, I have a recent paper with Soledad Villar and Rachel Ward that analyzes an SDP relaxation of the k-means problem. It turns out that the solution to the relaxation can be interpreted as a denoised version of the original dataset, sending many of the data points very close to what appear to be k-means-optimal centroids. This is similar in spirit to Dasgupta’s random projection, and we use a similar rounding scheme to estimate the optimal k-means-centroids. Using these centroid estimates as Gaussian center estimates, we are able to prove performance bounds of the form \(\mathrm{MSE}\lesssim k^2\sigma^2\) when \(\mathrm{SNR}>rsim k^2\), meaning the performance doesn’t depend on the dimension, but rather the model order.

But does it make sense that the performance should even depend on the model order?

Also van de Geer (1996), leverages the convex combination structure to get bounds relating KL and Hellinger convergence of density estimators (including kernel density estimators)

(Dasgupta 1999) probably counts, and (Mixon, Villar, and Ward 2016) introduces others, but most of these seem not to address the *approximation* problem. but the *clustering* problem. Clustering doesn’t fit perfectly with my purpose here; I don’t necessarily care about assigning things correctly to clusters; rather, I want to approximate the *overall* density well.

Mixon, Villar, and Ward (2016) has for some interesting connections to matrix concentration inequalities:

The study of theoretical guarantees for learning mixtures of Gaussians started with the work of (Dasgupta 1999). His work presented an algorithm based on random projections and showed this algorithm approximates the centers of Gaussians […] in \(R^m\) separated by […] the biggest singular value among all [covariance matrices]. After this work, several generalizations and improvements appeared. […] To date, techniques used for learning mixtures of Gaussians include expectation maximization (Dasgupta and Schulman 2007), spectral methods (Awasthi and Sheffet 2012; Kumar and Kannan 2010), projections (random and deterministic) (Dasgupta 1999; Moitra and Valiant 2010; Arora and Kannan 2001), and the method of moments (Moitra and Valiant 2010).

Every existing performance guarantee exhibits one of the following forms:

the algorithm correctly clusters all points according to Gaussian mixture component, or

the algorithm well-approximates the center of each Gaussian (a la(Dasgupta 1999).)

Results of type (1), which include (Awasthi and Sheffet 2012; Kumar and Kannan 2010; Achlioptas and Mcsherry 2007), require the minimum separation between the Gaussians centers to have a multiplicative factor of polylog N, where N is the number of points. This stems from a requirement that every point be closer to their Gaussian’s center (in some sense) than the other centers, so that the problem of cluster recovery is well-posed. Note that in the case of spherical Gaussians, the Gaussian components can be truncated to match the stochastic ball model in this regime, where the semidefinite program we present is already known to be tight with high probability (Iguchi et al. 2015; Awasthi et al. 2015). Results of type (2) tend to be specifically tailored to exploit unique properties of the Gaussians, and as such are not easily generalizable to other data models. […] For instance, if \(\mathsf{x} \sim N (\mu, \sigma^2 Im )\), then \(E(\|\mathsf{x} − \mu\|^2) = m\sigma^2\). In high dimensions, since the entries of the Gaussians are independent, concentration of measure implies that most of the points will reside in a thin shell with center μ and radius about \(\sqrt{m\sigma}\). This property allows algorithms to cluster even concentric Gaussians as long as the covariances are sufficiently different. However, algorithms that allow for no separation between the Gaussian centers require a sample complexity which is exponential in k (Moitra and Valiant 2010).

Hmm.

Here’s one lately-popular extension to the finite mixture mode: Choosing the number of mixture components adaptively, using some kind of model selection procedure, as per (Priebe and Marchette 2000) with the “sieve” — (Murata, Yoshizawa, and Amari 1994) uses an information criterion.

There are various optimality results, depending on your assumptions on mixing distributions and component distributions and true distributions and the constellation and so on.

AFAICT all assume either fixed component scales, or require a free regularisation parameter selected by cross validation.

I would like sample complexity results that can allow me to bound number of components before seeing the data. That would have been handy for a research project of mine, but I am unaware of any, and have since discontinued that project. All that remains is this a bestiary of approaches.

### 5.1 Large sample results for mixtures

I would prefer not to rely on asymptotic distributions, but sometimes they help.

(Geoffrey J. McLachlan and Rathnayake 2014) claims the AIC doesn’t work because ‘certain regularity assumptions’ necessary for it are not satisfied (which?) but that one may try the BIC, as per (A. R. Barron and Cover 1991; Andrew R. Barron et al. 2008). Contrariwise, (Massart 2007) gives explicit cases where the AIC does work in penalised density estimation, so… yeah. In any case, it’s certainly true that the derivation and calculation of the GIC is tedious.

(Ghosal and van der Vaart 2001) give MLE risk bounds under various assumptions, which I could maybe use independently.

### 5.2 Finite sample results for mixtures

In a penalised least-squares framework, see (Bertin, Pennec, and Rivoirard 2011). This generally seems much more reasonable to me, and I’m reading up on it. However, I think that basis selection is going to turn out to be nasty.

Massart (Massart 2007) explains some of the complications and gives example applications of both least-squares and ML model selection procedures for general density model selection. However, his results are restricted to mixtures of orthonormal bases, which, realistically, means histograms with bins of disjoint support, if you want your density to be non-negative. Wondering how much work it is to get overlapping support.

### 5.3 Sieve method

Argh! so many variants. What I would like for my mixture sieve

### 5.4 Akaike Information criterion

Use an Akaike-type information criterion

See (Ando, Konishi, and Imoto 2008; Andrew R. Barron et al. 2008; A. Barron, Rissanen, and Yu 1998; Murata, Yoshizawa, and Amari 1994)…

(Konishi and Kitagawa 2008) 6.1. is a compressed summary intro to general regularized basis expansion in a *regression* setting, i.e. we approximate an arbitrary function. Density approximation is more constrained, since know that our mixture must integrate to 1. Also we don’t have a separate error term; rather, we assume our components completely summarise the randomness. Usually, although not always, we further require the components be non-negative functions with non-negative weights to give us specifically a convex combination of functions. Anyway, presumably we can extract a similar result from that?

### 5.5 Quantization and coding theory

The information theoretic cousin. Non-uniform quantisation in communication theory is when you optimally distribute the density of your quantization symbols according to the density of the signal, in order to compress a signal while still reconstructing it as precisely as possible. This connection is most commonly raised in the multidimensional case, when it is “Vector quantization” or VQ to its friends. See, e.g. (Gersho and Gray 2012). This is then a coding theory problem.

From reading the literature it is not immediately apparent how, precisely, vector quantisation is related to mixture density estimation, although there is a family resemblance. In vector quantisation you do something like reduce the signal to a list of Voronoi cells and the coordinates of their centres, then code a signal to the nearest centre; Squinting right makes this look like a mixture problem. (D. D. Lee and Seung 2001, 1999) make this connection precise. Investigate.

Now, how do you choose this optimal code from measurements of the signal? THAT is the statistical question.

See quantization for more.

### 5.6 Minimum description length/BIC

Related to both the previous, in some way I do not yet understand.

Rissanen’s Minimum Description length, as applied to mixture density estimation? Putatively related to the Bayesian Information Criterion, which is purportedly an MDL measure. (?) That is not immediately obvious to me; I though BIC was just a way of maximising evidence marginal likelihood.

Andrew Barron and co-workers seem to own the statistical MDL approach to mixture estimation. See, (Andrew R. Barron 1994, 1991; A. R. Barron and Cover 1991; A. Barron, Rissanen, and Yu 1998; A. R. Barron 1993), with literature reviews in Andrew R. Barron et al. (2008). That last one constructs discretized mixture models as “two stage codes”, and achieves prediction risk bounds for finite samples using them.

## 6 Unsatisfactory thing: scale parameter selection theory

Many really good results take the scale parameter as given.

This is fine, as far as it goes, but the scale parameter selection is the reason why I might bother with this class of model, typically, otherwise this is simply a weird convex deconvolution problem, which is not so interesting. In particular, how to handle the scale parameter selection in model selection?

(Belkin, Rademacher, and Voss 2016; Shimazaki and Shinomoto 2010) make a start.

## 7 Connection to Mercer kernel methods

Connection with kernel PCA (Schölkopf et al. 1998) and metric multidimensional scaling (Williams 2001) and such are explored in kernel approximation.

## 8 Incoming

## 9 References

*Learning Theory*. Lecture Notes in Computer Science.

*J. ACM*.

*Journal of Machine Learning Research*.

*Journal of Statistical Planning and Inference*, Special Issue in Honor of Junjiro Ogawa (1915 - 2000): Design of Experiments, Multivariate Analysis and Statistical Inference,.

*Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing*. STOC ’01.

*Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science*. ITCS ’15.

*arXiv Preprint arXiv:1206.3204*.

*arXiv:1412.8690 [Cs, Math, Stat]*.

*Nonparametric Functional Estimation and Related Topics*. NATO ASI Series 335.

*IEEE Transactions on Information Theory*.

*Machine Learning*.

*IEEE Transactions on Information Theory*.

*Information Theory Workshop, 2008. ITW’08. IEEE*.

*IEEE Transactions on Information Theory*.

*Journal of Multivariate Analysis*.

*arXiv:1308.3968 [Stat]*.

*Journal of Multivariate Analysis*.

*Journal of Machine Learning Research*.

*Advances in Neural Information Processing Systems*.

*The Annals of Statistics*.

*Annales de l’Institut Henri Poincaré, Probabilités Et Statistiques*.

*ESAIM: Probability and Statistics*.

*Neural Computation*.

*Microsoft Research*.

*arXiv:1609.06764 [Stat]*.

*The Annals of Statistics*.

*A Course in Approximation Theory*.

*The Annals of Mathematical Statistics*.

*Journal of the American Statistical Association*.

*arXiv:1205.4891 [Cs]*.

*Foundations of Computer Science, 1999. 40th Annual Symposium on*.

*Asymptotic Theory of Statistics and Probability*. Springer Texts in Statistics.

*Journal of Machine Learning Research*.

*arXiv:1612.05053 [Stat]*.

*The Annals of Statistics*.

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*.

*Statistical Science*.

*The Annals of Statistics*.

*The Annals of Statistics*.

*Annals of Statistics*.

*Vector Quantization and Signal Compression*.

*The Annals of Statistics*.

*IEEE ASSP Magazine*.

*Scandinavian Actuarial Journal*.

*The Annals of Statistics*.

*Eighth Annual Conference of the International Speech Communication Association*.

*IEEE Signal Processing Magazine*.

*Proceedings of the 4th Conference on Innovations in Theoretical Computer Science*. ITCS ’13.

*Institute of Mathematical Statistics Lecture Notes - Monograph Series*.

*arXiv:1509.07983 [Cs, Math, Stat]*.

*Advances in Water Resources*.

*2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*.

*Information Criteria and Statistical Modeling*. Springer Series in Statistics.

*2010 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS)*.

*Generalized linear models with random effects*. Monographs on statistics and applied probability 106.

*Computational Statistics & Data Analysis*.

*Nature*.

*Advances in Neural Information Processing Systems 13*.

*IEEE Transactions on Information Theory*.

*Advances in Neural Information Processing Systems 12*.

*Journal of Mathematical Imaging and Vision*.

*Proceedings of the National Conference on Artificial Intelligence*.

*Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003*. Lecture Notes in Mathematics 1896.

*Biometrics*.

*Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery*.

*arXiv:1602.06612 [Cs, Math, Stat]*.

*2010 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS)*.

*IEEE Transactions on Neural Networks*.

*The Annals of Statistics*.

*SIAM Journal on Optimization*.

*Journal of the American Statistical Association*.

*Computational Statistics & Data Analysis*.

*Signal Processing*.

*Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016*.

*arXiv:1403.4224 [Cs]*.

*Advances in Neural Information Processing Systems*.

*ESAIM: Probability and Statistics*.

*SIAM Review*.

*IEEE Transactions on Information Theory*.

*Journal of the American Statistical Association*.

*Neural Computation*.

*Mustererkennung 1998*. Informatik Aktuell.

*Artificial Neural Networks — ICANN’97*. Lecture Notes in Computer Science.

*Advances in Neural Information Processing Systems*.

*Journal of Computational Neuroscience*.

*Machine Learning and Knowledge Discovery in Databases*.

*Advances in Neural Information Processing Systems*.

*Advances in Neural Information Processing Systems 13*.

*Statistics and Computing*.

*arXiv:1612.03450 [Cs, Math, Stat]*.

*Journal of Nonparametric Statistics*.

*ESAIM: Probability and Statistics*.

*Computational Statistics & Data Analysis*, Recent Developments in Mixture Model,.

*Weak Convergence and Empirical Processes: With Applications to Statistics*.

*Journal of Computer and System Sciences*.

*IEEE Transactions on Information Theory*.

*Advances in Neural Information Processing Systems 13*.

*The Annals of Statistics*.

*Journal of Complexity*.

*Neural Computation*.

*IEEE Transactions on Signal Processing*.

*Neural Networks: The Official Journal of the International Neural Network Society*.

*IEEE Transactions on Information Theory*.