# Mixture models for density estimation

A method of semi-parametric density estimation.

This is also conceptually close to clustering, and indeed there are lots of papers noting the connection. Clustering via nearest-centroids is more or less misture modelling where the mixture components all have the same shape.

We start from “classic” flavoured Gaussian Mixture Models, pausing for a rest stop at expectation maximisation, with a moment to muse how we can unify this with kernel density estimation, terminating at adaptive mixture sieves, wondering momentarily if tensors decompositions have anything to add. But we are not done, because there is a knotty model selection problem at the next junction.

## Connections

Connection with kernel PCA (Schölkopf et al. 1998) and metric multidimensional scaling (Williams 2001) and such are explored in kernel approximation.

## Mixture zoo

The following categories are not mutually exclusive; In fact I’m mentioning them all to work out what exactly the intersections are.

(Battey and Liu 2013; van de Geer 1996; Zeevi and Meir 1997) discuss some useful results common to various convex combination estimators.

(DasGupta 2008) ch 33 is a high-speed no-filler all-killer summary of various convergence results and mixture types, including a connection to Donoho-Jin “Higher criticism” and nonparametric deconvolution and multiple testing (ch34).

### “Classic Mixtures”

Finite location-scale mixtures.

Your data are vectors $$x_j\in \mathbb{R}^d$$.

$f(x_j) = w_0 + \sum_{i=1}^{i=m}w_i\phi((x_j-\mu_j)^T\sigma_i^{-1}(x_j-\mu_j))/|\sigma|$

Traditionally, $$\phi$$ is given, and given as the normal density, but any “nice” unimodal density will do, and we can appeal to, e.g. (Battey and Liu 2013; Zeevi and Meir 1997) to argue that we can get “close” to any density in large classes by choosing the Gaussian for big $$m$$.

Also traditionally, $$m$$ is given by magic.

Fitting the parameters of this model, then, involves choosing $$(\mathbf{w},\boldsymbol{\mu},\mathbf{\sigma}),\, \mathbf{w}\in\mathbb{R}^{m+1}, \boldsymbol{\mu}\in\mathbb{R}^{m\times d}, \boldsymbol{\sigma}\in\mathbb{R}^{d\times d}$$

Why would you bother? We know that this class is dense in the space of all densities under the total variation metric (Cheney and Light 2009), which is a rationale if not a guarantee of its usefulness. Moreover, we know that it’s computationally tractable.

• Dustin Mixon does a beautiful explanation of his paper and a few others in the field, in The Voronoi Means Conjecture

As I mentioned in a previous blog post, I have a recent paper with Soledad Villar and Rachel Ward that analyzes an SDP relaxation of the k-means problem. It turns out that the solution to the relaxation can be interpreted as a denoised version of the original dataset, sending many of the data points very close to what appear to be k-means-optimal centroids. This is similar in spirit to Dasgupta’s random projection, and we use a similar rounding scheme to estimate the optimal k-means-centroids. Using these centroid estimates as Gaussian center estimates, we are able to prove performance bounds of the form $$\mathrm{MSE}\lesssim k^2\sigma^2$$ when $$\mathrm{SNR}\gtrsim k^2$$, meaning the performance doesn’t depend on the dimension, but rather the model order.

But does it make sense that the performance should even depend on the model order?

• Also reading: (van de Geer 1996), which leverages the convex combination issue to get bounds relating KL and Hellinger convergence of density estimators (including kernel density estimators)

Finite mixtures by another name, from the function approximation literature. When your component density has an assumed form

$f(x_j) = w_0 + \sum_{i=1}^{i=m}w_i\phi(h_j\|x_j-\mu_j\|)/h_j$

Here $$h$$ is a scale parameter for the density function. This corresponds to a spherical approximating function, rather than estimating a full multidimensional bandwidth.

#### Kernel density estimators

If you have as many mixture components as data points, you have kernel density estimate. This is clearly also a finite mixture model, just a limiting case. To keep the number of parameters manageable you usually assume that the mixture components themselves all share the same scale parameters, although

### Normal Variance mixtures

Mixtures of normals where you only vary scale parameters. I clearly don’t know enough about this to write the entry; this is just a note to myself. 🏗 z-distributions and generalized hyperbolic distributions are the keywords. Various interesting properties about infinitely divisible distributions, and they include many other distributions as special cases.

### Continuous mixtures

Noticing that a classic location mixture is a convolution between a continuous density and an atomic density, the question arises whether you can convolve two more general densities. Yes, you can. Estimate a nonparametric mixing density. Now you have a nonparametric estimation problem, something like, estimate $$dP(\mu,\sigma)$$.

### Bayesian Dirichlet mixtures

🏗; Something about using a Dirichlet process for the… weights of the mixture components? Giving you a posterior distribution over countably finite mixture parameters? something like that?

### Non-affine mixtures

Do the mixture components have to have location/scale parameterisations? Most things so far have assumed that Not necessarily.

For, e.g. a tail estimate where we want to discern properties of heavy tails, this might not do what we want. Estimating the scale parameters is already a PITA; what can we say about more general shape parameters? This is probably not a theoretical issue, in that the asymptotic behaviour of the M-estimators of a Beta-mixture don’t change. Practically, however, the specific optimisation problem might get very hard. Mind you, it’s notoriously not that easy even with location-scale parameters/ Can we actually find our global maximum?

### Convergence guarantees

Surely someone has done this, since it looks like an obvious idea at the intersection of matrix concentration inequalities. Maybe it got filed in the clustering literature.

(Dasgupta 1999) probably counts, and (Mixon, Villar, and Ward 2016) introduces others, but most of these seem not to address the approximation problem. but the clustering problem. Clustering doesn’t fit perfectly with our purpose here; we don’t necessarily care about assigning things correctly to clusters; rather, we want to approximate the overall density well.

See (Mixon, Villar, and Ward 2016) for some interesting connections:

The study of theoretical guarantees for learning mixtures of Gaussians started with the work of (Dasgupta 1999). His work presented an algorithm based on random projections and showed this algorithm approximates the centers of Gaussians in $$R^m$$ separated by the biggest singular value among all [covariance matrices]. After this work, several generalizations and improvements appeared. To date, techniques used for learning mixtures of Gaussians include expectation maximization (Dasgupta and Schulman 2007), spectral methods (Awasthi and Sheffet 2012; Kumar and Kannan 2010), projections (random and deterministic) (Dasgupta 1999; Moitra and Valiant 2010; Arora and Kannan 2001), and the method of moments (Moitra and Valiant 2010).

Every existing performance guarantee exhibits one of the following forms:

• the algorithm correctly clusters all points according to Gaussian mixture component, or

• the algorithm well-approximates the center of each Gaussian (a la(Dasgupta 1999).)

Results of type (1), which include (Awasthi and Sheffet 2012; Kumar and Kannan 2010; Achlioptas and Mcsherry 2007), require the minimum separation between the Gaussians centers to have a multiplicative factor of polylog N, where N is the number of points. This stems from a requirement that every point be closer to their Gaussian’s center (in some sense) than the other centers, so that the problem of cluster recovery is well-posed. Note that in the case of spherical Gaussians, the Gaussian components can be truncated to match the stochastic ball model in this regime, where the semidefinite program we present is already known to be tight with high probability (Iguchi et al. 2015; Awasthi et al. 2015). Results of type (2) tend to be specifically tailored to exploit unique properties of the Gaussians, and as such are not easily generalizable to other data models. For instance, if $$x \sim N (\mu, \sigma^2 Im )$$, then $$E(\|x − \mu\|^2) = m\sigma^2$$. In high dimensions, since the entries of the Gaussians are independent, concentration of measure implies that most of the points will reside in a thin shell with center μ and radius about $$\sqrt{m\sigma}$$. This property allows algorithms to cluster even concentric Gaussians as long as the covariances are sufficiently different. However, algorithms that allow for no separation between the Gaussian centers require a sample complexity which is exponential in k (Moitra and Valiant 2010).

Hmm.

## Estimation/selection methods

### (Local) maximum likelihood

A classic method; There are some subtleties here since the global maximum can be badly behaved; you have to mess around with local roots of the likelihood equation and thereby lose some of the the lovely asymptotics of MLE methods in exponential families.

However, I am not sure which properties you lose. (McLachlan and Rathnayake 2014), for example, makes the assertion that the AIC conditions (whatever they are) don’t hold, but the BIC ones do hold.

### Method of moments

Particularly popular in recent times for mixtures. 🏗

### Minimum distance

Minimise the distance between the empirical PDF and the estimated PDF in some metric. For reasons I have not yet digested, one is probably best to do this in the Hellinger metric if one wishes convenient convergence, (Beran 1977), although kernel density estimates tend to prefer $$L_2$$, as with e.g, regression smoothing problems. How on earth you numerically minimise Hellinger distance from data is something I won’t think about for now, although I admit to being curious.

### Regression smoothing formulation

Not quite mixture models; you have a smoothness penalty on the quadratic component, a log-quadratic regression spline, then the results are “nearly” Gaussian mixture models. See (Eilers and Marx 1996).

## Model selection

Here’s one lately-popular extension to the finite mixture mode: Choosing the number of mixture components adaptively, using some kind of model selection procedure, as per (Priebe and Marchette 2000) with the “sieve” – (Murata, Yoshizawa, and Amari 1994) uses an information criterion.

There are various optimality results, depending on your assumptions on mixing distributions and component distributions and true distributions and the constellation and so on.

AFAICT all assume either fixed component scales, or require a free regularisation parameter selected by cross validation.

I would like sample complexity results that can allow me to bound number of components before seeing the data. That would have been handy for a research project of mine, but I am unaware of any, and have since discontinued that project. All that remains is this a bestiary of approaches.

### Large sample results for mixtures

I would prefer not to rely on asymptotic distributions, but sometimes they help.

(McLachlan and Rathnayake 2014) claims the AIC doesn’t work because ‘certain regularity assumptions’ necessary for it are not satisfied (which?) but that one may try the BIC, as per (A. R. Barron and Cover 1991; Barron et al. 2008). Contrariwise, (Massart 2007) gives explicit cases where the AIC does work in penalised density estimation, so… yeah. In any case, it’s certainly true that the derivation and calculation of the GIC is tedious.

(Ghosal and van der Vaart 2001) give MLE risk bounds under various assumptions, which I could maybe use independently.

### Finite sample results for mixtures

In a penalised least-squares framework, see (Bertin, Pennec, and Rivoirard 2011). This generally seems much more reasonable to me, and I’m reading up on it. However, I think that basis selection is going to turn out to be nasty.

Massart (Massart 2007) explains some of the complications here and gives example applications of both least-squares and ML model selection procedures for general density model selection. However, his results are restricted to mixtures of orthonormal bases, which, realistically, means histograms with bins of disjoint support, if you want your density to be non-negative. Wondering how much work it is to get overlapping support.

### Sieve method

Argh! so many variants. What I would like for my mixture sieve

(Priebe 1994; Genovese and Wasserman 2000)), Battey

### Akaike Information criterion

See (Ando, Konishi, and Imoto 2008; Barron et al. 2008; Barron, Rissanen, and Yu 1998; Murata, Yoshizawa, and Amari 1994)

(Konishi and Kitagawa 2008) 6.1. is a compressed summary intro to general regularized basis expansion in a regression setting, i.e. we approximate an arbitrary function. Density approximation is more constrained, since know that our mixture must integrate to 1. Also we don’t have a separate error term; rather, we assume our components completely summarise the randomness. Usually, although not always, we further require the components be non-negative functions with non-negative weights to give us specifically a convex combination of functions. Anyway, presumably we can extract a similar result from that?

### Quantization and coding theory

The information theoretic cousin. Non-uniform quantisation in communication theory is when you optimally dsitribute the density of your quantization symbols according to the density of the signal, in order to compress a signal while still reconstructing it as precisely as possible. This connection is most commonly raised in the multidimensional case, when it is “Vector quantization” or VQ to its friends. See, e.g. (Gersho and Gray 2012). This is then a coding theory problem.

From reading the literature it is not immediately apparent how, precisely, vector quantisation is related to mixture density estimation, although there is a family resemblance. In vector quantisation you do something like reduce the signal to a list of Voronoi cells and the coordinates of their centres, then code a signal to the nearest centre; Squinting right makes this look like a mixture problem. (Lee and Seung 2001, 1999) make this connection precise. Investigate.

Now, how do you choose this optimal code from measurements of the signal? THAT is the statistical question.

### Minimum description length/BIC

Related to both the previous previous, in some way I do not yet understand.

Rissanen’s Minimum Description length, as applied to mixture density estimation? Putatively related to the Bayesian Information Criterion, which is purportedly an MDL measure. (?) That is not immediately obvious to me; I though BIC was just a way of maximising evidence marginal likelihood.

Andrew Barron and co-workers seem to own the statistical MDL approach to mixture estimation. See, (Barron 1994, 1993; Andrew R. Barron 1991; A. R. Barron and Cover 1991; Barron, Rissanen, and Yu 1998), with literature reviews in Barron et al. (2008). That last one constructs discretized mixture models as “two stage codes”, and achieves prediction risk bounds for finite samples using them.

## Unsatisfactory thing: scale parameter selection theory

All the really good results take the scale parameter as given.

This is fine, as far as it goes, but the scale parameter selection is the reason why we are bothering with this class of model, typically, otherwise this is simply a weird convex deconvolution problem, which is not so interesting. In particular, how to handle the scale parameter selection in model selection?

(Belkin, Rademacher, and Voss 2016; Shimazaki and Shinomoto 2010) make a start.

## Miscellaney

Achlioptas, Dimitris, and Frank Mcsherry. 2007. “Fast Computation of Low-Rank Matrix Approximations.” J. ACM 54 (2). https://doi.org/10.1145/1219092.1219097.

Achlioptas, Dimitris, and Frank McSherry. 2005. “On Spectral Learning of Mixtures of Distributions.” In Learning Theory, edited by Peter Auer and Ron Meir, 458–69. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/11503415_31.

Anandkumar, Animashree, Daniel Hsu, and Sham M. Kakade. 2012. “A Method of Moments for Mixture Models and Hidden Markov Models.” In Journal of Machine Learning Research. http://www.jmlr.org/proceedings/papers/v23/anandkumar12/anandkumar12.pdf.

Ando, Tomohiro, Sadanori Konishi, and Seiya Imoto. 2008. “Nonlinear Regression Modeling via Regularized Radial Basis Function Networks.” Journal of Statistical Planning and Inference, Special Issue in Honor of Junjiro Ogawa (1915 - 2000): Design of Experiments, Multivariate Analysis and Statistical Inference, 138 (11): 3616–33. https://doi.org/10.1016/j.jspi.2005.07.014.

Arora, Sanjeev, and Ravi Kannan. 2001. “Learning Mixtures of Arbitrary Gaussians.” In Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, 247–57. STOC ’01. New York, NY, USA: ACM. https://doi.org/10.1145/380752.380808.

Awasthi, Pranjal, Afonso S. Bandeira, Moses Charikar, Ravishankar Krishnaswamy, Soledad Villar, and Rachel Ward. 2015. “Relax, No Need to Round: Integrality of Clustering Formulations.” In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, 191–200. ITCS ’15. New York, NY, USA: ACM. https://doi.org/10.1145/2688073.2688116.

Awasthi, Pranjal, and Or Sheffet. 2012. “Improved Spectral-Norm Bounds for Clustering.” arXiv Preprint arXiv:1206.3204. http://www.cs.rutgers.edu/~pa336/static/pub/clustering-spectral.pdf.

Bach, Francis. 2014. “Breaking the Curse of Dimensionality with Convex Neural Networks,” December. http://arxiv.org/abs/1412.8690.

Barron, Andrew R. 1991. “Complexity Regularization with Application to Artificial Neural Networks.” In Nonparametric Functional Estimation and Related Topics, edited by George Roussas, 561–76. NATO ASI Series 335. Springer Netherlands. https://doi.org/10.1007/978-94-011-3222-0_42.

———. 1994. “Approximation and Estimation Bounds for Artificial Neural Networks.” Machine Learning 14 (1): 115–33. https://doi.org/10.1023/A:1022650905902.

Barron, Andrew R., Cong Huang, Jonathan Q. Li, and Xi Luo. 2008. “MDL, Penalized Likelihood, and Statistical Risk.” In Information Theory Workshop, 2008. ITW’08. IEEE, 247–57. IEEE. https://doi.org/10.1109/ITW.2008.4578660.

Barron, A.R. 1993. “Universal Approximation Bounds for Superpositions of a Sigmoidal Function.” IEEE Transactions on Information Theory 39 (3): 930–45. https://doi.org/10.1109/18.256500.

Barron, A. R., and T. M. Cover. 1991. “Minimum Complexity Density Estimation.” IEEE Transactions on Information Theory 37 (4): 1034–54. https://doi.org/10.1109/18.86996.

Barron, A., J. Rissanen, and Bin Yu. 1998. “The Minimum Description Length Principle in Coding and Modeling.” IEEE Transactions on Information Theory 44 (6): 2743–60. https://doi.org/10.1109/18.720554.

Battey, Heather, and Oliver Linton. 2014. “Nonparametric Estimation of Multivariate Elliptic Densities via Finite Mixture Sieves.” Journal of Multivariate Analysis 123 (January): 43–67. https://doi.org/10.1016/j.jmva.2013.08.013.

Battey, Heather, and Han Liu. 2013. “Smooth Projected Density Estimation,” August. http://arxiv.org/abs/1308.3968.

———. 2016. “Nonparametrically Filtered Parametric Density Estimation.” http://www.princeton.edu/~hbattey/SPE20160423.pdf.

Battey, Heather, and Alessio Sancetta. 2013. “Conditional Estimation for Dependent Functional Data.” Journal of Multivariate Analysis 120 (September): 1–17. https://doi.org/10.1016/j.jmva.2013.04.009.

Belkin, Mikhail, Luis Rademacher, and James Voss. 2016. “Basis Learning as an Algorithmic Primitive.” In Journal of Machine Learning Research, 446–87. http://www.jmlr.org/proceedings/papers/v49/belkin16.html.

Bengio, Yoshua, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. 2005. “Convex Neural Networks.” In Advances in Neural Information Processing Systems, 123–30. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2005_583.pdf.

Beran, Rudolf. 1977. “Minimum Hellinger Distance Estimates for Parametric Models.” The Annals of Statistics 5 (3): 445–63. https://doi.org/10.1214/aos/1176343842.

Bertin, K., E. Le Pennec, and V. Rivoirard. 2011. “Adaptive Dantzig Density Estimation.” Annales de L’Institut Henri Poincaré, Probabilités et Statistiques 47 (1): 43–74. https://doi.org/10.1214/09-AIHP351.

Birgé, Lucien, and Yves Rozenholc. 2006. “How Many Bins Should Be Put in a Regular Histogram.” ESAIM: Probability and Statistics 10 (February): 24–45. https://doi.org/10.1051/ps:2006001.

Bishop, Chris. 1991. “Improving the Generalization Properties of Radial Basis Function Neural Networks.” Neural Computation 3 (4): 579–88. https://doi.org/10.1162/neco.1991.3.4.579.

Bishop, Christopher. 1994. “Mixture Density Networks.” Microsoft Research, January. https://www.microsoft.com/en-us/research/publication/mixture-density-networks/.

Boyd, Nicholas, Trevor Hastie, Stephen Boyd, Benjamin Recht, and Michael Jordan. 2016. “Saturating Splines and Feature Selection,” September. http://arxiv.org/abs/1609.06764.

Chen, Jiahua. 1995. “Optimal Rate of Convergence for Finite Mixture Models.” The Annals of Statistics 23 (1): 221–33. https://doi.org/10.1214/aos/1176324464.

Cheney, Elliott Ward, and William Allan Light. 2009. A Course in Approximation Theory. American Mathematical Soc. https://books.google.com.au/books?hl=en&lr=&id=II6DAwAAQBAJ&oi=fnd&pg=PA1&ots=ch9-LyxDg6&sig=jetWpIErExYvlnnSsup-5yHhso0.

Christian Bauckhage. 2015. “Lecture Notes on Data Science: Soft K-Means Clustering.” https://doi.org/10.13140/RG.2.1.3582.6643.

Cohen, A. C. 1950. “Estimating the Mean and Variance of Normal Populations from Singly Truncated and Doubly Truncated Samples.” The Annals of Mathematical Statistics 21 (4): 557–69. https://doi.org/10.1214/aoms/1177729751.

Cohen, A. C., Jr. 1949. “On Estimating the Mean and Standard Deviation of Truncated Normal Distributions.” Journal of the American Statistical Association 44 (248): 518–25. https://doi.org/10.1080/01621459.1949.10483324.

Daniely, Amit, Nati Linial, and Michael Saks. 2012. “Clustering Is Difficult Only When It Does Not Matter,” May. http://arxiv.org/abs/1205.4891.

DasGupta, Anirban. 2008. Asymptotic Theory of Statistics and Probability. Springer Texts in Statistics. New York: Springer New York. http://link.springer.com/10.1007/978-0-387-75971-5.

Dasgupta, Sanjoy. 1999. “Learning Mixtures of Gaussians.” In Foundations of Computer Science, 1999. 40th Annual Symposium on, 634–44. IEEE. https://doi.org/10.1109/SFFCS.1999.814639.

Dasgupta, Sanjoy, and Leonard Schulman. 2007. “A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians.” Journal of Machine Learning Research 8 (Feb): 203–26. http://www.jmlr.org/papers/v8/dasgupta07a.html.

Dehaene, Guillaume P. 2016. “Expectation Propagation Performs a Smoothed Gradient Descent,” December. http://arxiv.org/abs/1612.05053.

Donoho, David L., Iain M. Johnstone, Gerard Kerkyacharian, and Dominique Picard. 1996. “Density Estimation by Wavelet Thresholding.” The Annals of Statistics 24 (2): 508–39. https://doi.org/10.1214/aos/1032894451.

Dunson, David B., Natesh Pillai, and Ju-Hyun Park. 2007. “Bayesian Density Regression.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (2): 163–83. https://doi.org/10.1111/j.1467-9868.2007.00582.x.

Eilers, Paul H. C., and Brian D. Marx. 1996. “Flexible Smoothing with B-Splines and Penalties.” Statistical Science 11 (2): 89–121. https://doi.org/10.1214/ss/1038425655.

Fan, Jianqing. 1991. “On the Optimal Rates of Convergence for Nonparametric Deconvolution Problems.” The Annals of Statistics 19 (3): 1257–72. https://doi.org/10.1214/aos/1176348248.

Geer, Sara van de. 1996. “Rates of Convergence for the Maximum Likelihood Estimator in Mixture Models.” Journal of Nonparametric Statistics 6 (4): 293–310. https://doi.org/10.1080/10485259608832677.

———. 1997. “Asymptotic Normality in Mixture Models.” ESAIM: Probability and Statistics 1: 17–33. http://archive.numdam.org/article/PS_1997__1__17_0.pdf.

———. 2003. “Asymptotic Theory for Maximum Likelihood in Nonparametric Mixture Models.” Computational Statistics & Data Analysis, Recent Developments in Mixture Model, 41 (3–4): 453–64. https://doi.org/10.1016/S0167-9473(02)00188-3.

Geman, Stuart, and Chii-Ruey Hwang. 1982. “Nonparametric Maximum Likelihood Estimation by the Method of Sieves.” The Annals of Statistics 10 (2): 401–14. https://doi.org/10.1214/aos/1176345782.

Genovese, Christopher R., and Larry Wasserman. 2000. “Rates of Convergence for the Gaussian Mixture Sieve.” Annals of Statistics, 1105–27.

Gersho, Allen, and Robert M. Gray. 2012. Vector Quantization and Signal Compression. Springer Science & Business Media.

Ghosal, Subhashis, and Aad W. van der Vaart. 2001. “Entropies and Rates of Convergence for Maximum Likelihood and Bayes Estimation for Mixtures of Normal Densities.” The Annals of Statistics 29 (5): 1233–63. https://doi.org/10.1214/aos/1013203452.

Gray, R. 1984. “Vector Quantization.” IEEE ASSP Magazine 1 (2): 4–29. https://doi.org/10.1109/MASSP.1984.1162229.

Hald, A. 1949. “Maximum Likelihood Estimation of the Parameters of a Normal Distribution Which Is Truncated at a Known Point.” Scandinavian Actuarial Journal 1949 (1): 119–34. https://doi.org/10.1080/03461238.1949.10419767.

Hall, Peter. 1987. “On Kullback-Leibler Loss and Density Estimation.” The Annals of Statistics 15 (4): 1491–1519. https://doi.org/10.1214/aos/1176350606.

Heigold, Georg, Ralf Schlüter, and Hermann Ney. 2007. “On the Equivalence of Gaussian HMM and Gaussian HMM-Like Hidden Conditional Random Fields.” In Eighth Annual Conference of the International Speech Communication Association. http://www-i6.informatik.rwth-aachen.de/publications/download/282/Heigold--2007.pdf.

Hinton, G., Li Deng, Dong Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, et al. 2012. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups.” IEEE Signal Processing Magazine 29 (6): 82–97. https://doi.org/10.1109/MSP.2012.2205597.

Hsu, Daniel, and Sham M. Kakade. 2013. “Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions.” In Proceedings of the 4th Conference on Innovations in Theoretical Computer Science, 11–20. ITCS ’13. New York, NY, USA: ACM. https://doi.org/10.1145/2422436.2422439.

Huang, Cong, G. L. H. Cheang, and Andrew R. Barron. 2008. “Risk of Penalized Least Squares, Greedy Selection and L1 Penalization for Flexible Function Libraries.” http://www.stat.yale.edu/~arb4/publications_files/RiskGreedySelectionAndL1penalization.pdf.

Ibragimov, I. 2001. “Estimation of Analytic Functions.” In Institute of Mathematical Statistics Lecture Notes - Monograph Series, 359–83. Beachwood, OH: Institute of Mathematical Statistics. http://projecteuclid.org/euclid.lnms/1215090078.

Iguchi, Takayuki, Dustin G. Mixon, Jesse Peterson, and Soledad Villar. 2015. “Probably Certifiably Correct K-Means Clustering,” September. http://arxiv.org/abs/1509.07983.

Jawitz, James W. 2004. “Moments of Truncated Continuous Univariate Distributions.” Advances in Water Resources 27 (3): 269–81. https://doi.org/10.1016/j.advwatres.2003.12.002.

Keriven, Nicolas, Anthony Bourrier, Rémi Gribonval, and Patrick Pérez. 2016. “Sketching for Large-Scale Learning of Mixture Models.” In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6190–4. https://doi.org/10.1109/ICASSP.2016.7472867.

Konishi, Sadanori, and G. Kitagawa. 2008. Information Criteria and Statistical Modeling. Springer Series in Statistics. New York: Springer.

Kumar, A., and R. Kannan. 2010. “Clustering with Spectral Norm and the K-Means Algorithm.” In 2010 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 299–308. https://doi.org/10.1109/FOCS.2010.35.

Lee, Daniel D., and H. Sebastian Seung. 1999. “Learning the Parts of Objects by Non-Negative Matrix Factorization.” Nature 401 (6755): 788–91. https://doi.org/10.1038/44565.

———. 2001. “Algorithms for Non-Negative Matrix Factorization.” In Advances in Neural Information Processing Systems 13, edited by T. K. Leen, T. G. Dietterich, and V. Tresp, 556–62. MIT Press. http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf.

Lee, Gyemin, and Clayton Scott. 2012. “EM Algorithms for Multivariate Gaussian Mixture Models with Truncated and Censored Data.” Computational Statistics & Data Analysis 56 (9): 2816–29. https://doi.org/10.1016/j.csda.2012.03.003.

Lee, Youngjo., John A. Nelder, and Yudi Pawitan. 2006. Generalized Linear Models with Random Effects. Monographs on Statistics and Applied Probability 106. Boca Raton, FL: Chapman & Hall/CRC.

Leung, G., and A.R. Barron. 2006. “Information Theory and Mixing Least-Squares Regressions.” IEEE Transactions on Information Theory 52 (8): 3396–3410. https://doi.org/10.1109/TIT.2006.878172.

Li, Dawei, Lihong Xu, and Erik Goodman. 2012. “On-Line EM Variants for Multivariate Normal Mixture Model in Background Learning and Moving Foreground Detection.” Journal of Mathematical Imaging and Vision 48 (1): 114–33. https://doi.org/10.1007/s10851-012-0403-6.

Li, Haifeng, Keshu Zhang, and Tao Jiang. 2005. “The Regularized EM Algorithm.” In Proceedings of the National Conference on Artificial Intelligence, 20:807. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. http://www.aaai.org/Papers/AAAI/2005/AAAI05-127.pdf.

Li, Jonathan Q., and Andrew R. Barron. 2000. “Mixture Density Estimation.” In Advances in Neural Information Processing Systems 12, edited by S. A. Solla, T. K. Leen, and K. Müller, 279–85. MIT Press. http://papers.nips.cc/paper/1673-mixture-density-estimation.pdf.

Massart, Pascal. 2007. Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003. Lecture Notes in Mathematics 1896. Berlin ; New York: Springer-Verlag. http://www.cmap.polytechnique.fr/~merlet/articles/probas_massart_stf03.pdf.

McLachlan, Geoffrey J., and Suren Rathnayake. 2014. “On the Number of Components in a Gaussian Mixture Model.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4 (5): 341–55. https://doi.org/10.1002/widm.1135.

McLachlan, G. J., and P. N. Jones. 1988. “Fitting Mixture Models to Grouped and Truncated Data via the EM Algorithm.” Biometrics 44 (2): 571–78. https://doi.org/10.2307/2531869.

Mixon, Dustin G., Soledad Villar, and Rachel Ward. 2016. “Clustering Subgaussian Mixtures by Semidefinite Programming,” February. http://arxiv.org/abs/1602.06612.

Moitra, A., and G. Valiant. 2010. “Settling the Polynomial Learnability of Mixtures of Gaussians.” In 2010 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 93–102. https://doi.org/10.1109/FOCS.2010.15.

Murata, N., S. Yoshizawa, and S. Amari. 1994. “Network Information Criterion-Determining the Number of Hidden Units for an Artificial Neural Network Model.” IEEE Transactions on Neural Networks 5 (6): 865–72. https://doi.org/10.1109/72.329683.

Norets, Andriy. 2010. “Approximation of Conditional Densities by Smooth Mixtures of Regressions.” The Annals of Statistics 38 (3): 1733–66. https://doi.org/10.1214/09-AOS765.

Orr, Mark JL. 1996. “Introduction to Radial Basis Function Networks.” Technical Report, Center for Cognitive Science, University of Edinburgh. http://twyu2.synology.me/htdocs/class_2008_1/nn/Slides/Introduction%20to%20Radial%20Basis%20Function%20Networks%20(1996).pdf.

Peng, J., and Y. Wei. 2007. “Approximating K‐means‐type Clustering via Semidefinite Programming.” SIAM Journal on Optimization 18 (1): 186–205. https://doi.org/10.1137/050641983.

Priebe, Carey E. 1994. “Adaptive Mixtures.” Journal of the American Statistical Association 89 (427): 796–806. https://doi.org/10.1080/01621459.1994.10476813.

Priebe, Carey E., and David J. Marchette. 2000. “Alternating Kernel and Mixture Density Estimates.” Computational Statistics & Data Analysis 35 (1): 43–65. https://doi.org/10.1016/S0167-9473(00)00003-7.

Qian, Shie, and Dapang Chen. 1994. “Signal Representation Using Adaptive Normalized Gaussian Functions.” Signal Processing 36 (1): 1–11. https://doi.org/10.1016/0165-1684(94)90174-0.

Que, Qichao, and Mikhail Belkin. 2016. “Back to the Future: Radial Basis Function Networks Revisited.” In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016. http://www.jmlr.org/proceedings/papers/v51/que16-supp.pdf.

Rabusseau, Guillaume, and François Denis. 2014. “Learning Negative Mixture Models by Tensor Decompositions,” March. http://arxiv.org/abs/1403.4224.

Raghunathan, Aditi, Ravishankar Krishnaswamy, and Prateek Jain. 2017. “Learning Mixture of Gaussians with Streaming Data.” In Advances in Neural Information Processing Systems, 6608–17. http://arxiv.org/abs/1707.02391.

Rakhlin, Alexander, Dmitry Panchenko, and Sayan Mukherjee. 2005. “Risk Bounds for Mixture Density Estimation.” ESAIM: Probability and Statistics 9 (June): 220–29. https://doi.org/10.1051/ps:2005011.

Redner, R., and H. Walker. 1984. “Mixture Densities, Maximum Likelihood and the EM Algorithm.” SIAM Review 26 (2): 195–239. https://doi.org/10.1137/1026034.

Rissanen, J. 1984. “Universal Coding, Information, Prediction, and Estimation.” IEEE Transactions on Information Theory 30 (4): 629–36. https://doi.org/10.1109/TIT.1984.1056936.

Roeder, Kathryn, and Larry Wasserman. 1997. “Practical Bayesian Density Estimation Using Mixtures of Normals.” Journal of the American Statistical Association 92 (439): 894–902. https://doi.org/10.1080/01621459.1997.10474044.

Ruan, Lingyan, Ming Yuan, and Hui Zou. 2011. “Regularized Parameter Estimation in High-Dimensional Gaussian Mixture Models.” Neural Computation 23 (6): 1605–22. https://doi.org/10.1162/NECO_a_00128.

Schölkopf, Bernhard, Phil Knirsch, Alex Smola, and Chris Burges. 1998. “Fast Approximation of Support Vector Kernel Expansions, and an Interpretation of Clustering as Approximation in Feature Spaces.” In Mustererkennung 1998, edited by Paul Levi, Michael Schanz, Rolf-Jürgen Ahlers, and Franz May, 125–32. Informatik Aktuell. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-72282-0_12.

Schölkopf, Bernhard, Alexander Smola, and Klaus-Robert Müller. 1997. “Kernel Principal Component Analysis.” In Artificial Neural Networks — ICANN’97, edited by Wulfram Gerstner, Alain Germond, Martin Hasler, and Jean-Daniel Nicoud, 583–88. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/BFb0020217.

Sha, Fei, and Lawrence K. Saul. 2006. “Large Margin Hidden Markov Models for Automatic Speech Recognition.” In Advances in Neural Information Processing Systems, 1249–56. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_143.pdf.

Shimazaki, Hideaki, and Shigeru Shinomoto. 2010. “Kernel Bandwidth Optimization in Spike Rate Estimation.” Journal of Computational Neuroscience 29 (1-2): 171–82. https://doi.org/10.1007/s10827-009-0180-4.

Singh, Ajit P., and Geoffrey J. Gordon. 2008. “A Unified View of Matrix Factorization Models.” In Machine Learning and Knowledge Discovery in Databases, 358–73. Springer. https://www.select.cs.cmu.edu/publications/paperdir/ecml2008-singh-gordon.pdf.

Tipping, Michael E. 2000. “The Relevance Vector Machine.” In Advances in Neural Information Processing Systems, 652–58. MIT Press. http://papers.nips.cc/paper/1719-the-relevance-vector-machine.

———. 2001. “Sparse Bayesian Learning and the Relevance Vector Machine.” The Journal of Machine Learning Research 1: 211–44. https://doi.org/10.1162/15324430152748236.

Tipping, Michael E., and Cambridge Cb Nh. 2001. “Sparse Kernel Principal Component Analysis.” In Advances in Neural Information Processing Systems 13, 633–39. MIT Press. http://papers.nips.cc/paper/1791-sparse-kernel-principal-component-analysis.pdf.

Toulis, Panos, and Edoardo M. Airoldi. 2015. “Scalable Estimation Strategies Based on Stochastic Approximations: Classical Results and New Insights.” Statistics and Computing 25 (4): 781–95. https://doi.org/10.1007/s11222-015-9560-y.

Tschannen, Michael, and Helmut Bölcskei. 2016. “Noisy Subspace Clustering via Matching Pursuits,” December. http://arxiv.org/abs/1612.03450.

Vaart, AW van der, and Jon Wellner. 1996. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Science & Business Media.

Vempala, Santosh, and Grant Wang. 2004. “A Spectral Algorithm for Learning Mixture Models.” Journal of Computer and System Sciences 68 (4): 841–60. https://doi.org/10.1016/j.jcss.2003.11.008.

Weidmann, Claudio, and Martin Vetterli. 2012. “Rate Distortion Behavior of Sparse Sources.” IEEE Transactions on Information Theory 58 (8): 4969–92. https://doi.org/10.1109/TIT.2012.2201335.

Williams, Christopher K. I. 2001. “On a Connection Between Kernel PCA and Metric Multidimensional Scaling.” In Advances in Neural Information Processing Systems 13, edited by T. K. Leen, T. G. Dietterich, and V. Tresp, 46:675–81. MIT Press. https://doi.org/10.1023/A:1012485807823.

Wong, Wing Hung, and Xiaotong Shen. 1995. “Probability Inequalities for Likelihood Ratios and Convergence Rates of Sieve MLES.” The Annals of Statistics 23 (2): 339–62. https://doi.org/10.1214/aos/1176324524.

Wu, Qiang, Yiming Ying, and Ding-Xuan Zhou. 2007. “Multi-Kernel Regularized Classifiers.” Journal of Complexity 23 (1): 108–34. https://doi.org/10.1016/j.jco.2006.06.007.

Xu, Jian-Wu, A.R.C. Paiva, Il Park, and J.C. Principe. 2008. “A Reproducing Kernel Hilbert Space Framework for Information-Theoretic Learning.” IEEE Transactions on Signal Processing 56 (12): 5891–5902. https://doi.org/10.1109/TSP.2008.2005085.

Xu, Lei, and Michael I. Jordan. 1996. “On Convergence Properties of the EM Algorithm for Gaussian Mixtures.” Neural Computation 8 (1): 129–51. https://doi.org/10.1162/neco.1996.8.1.129.

Zeevi, A. J., R. Meir, and V. Maiorov. 1998. “Error Bounds for Functional Approximation and Estimation Using Mixtures of Experts.” IEEE Transactions on Information Theory 44 (3): 1010–25. https://doi.org/10.1109/18.669150.

Zeevi, Assaf J., and Ronny Meir. 1997. “Density Estimation Through Convex Combinations of Densities: Approximation and Estimation Bounds.” Neural Networks: The Official Journal of the International Neural Network Society 10 (1): 99–109. https://doi.org/10.1016/S0893-6080(96)00037-8.