Data dimensionality reduction

Wherein I teach myself, amongst other things, feature selection, how a sparse PCA works, and decide where to file multidimensional scaling

March 23, 2015 β€” September 11, 2020

Figure 1


I will restructure learning on manifolds and dimensionality reduction into a more useful distinction.

You have lots of predictors in your regression model! Too many predictors! You wish there were fewer predictors! Maybe then it would be faster, or at least more compact. Can you throw some out, or summarise them in some sens? Also with the notion of similarity as seen in kernel tricks. What you might do to learn an index. Inducing a differential metric. Matrix factorisations and random features, random projections, high-dimensional statistics. Ultimately, this is always (at least implicitly) learning a manifold. A good dimension reduction can produce a nearly sufficient statistic for indirect inference.

1 Bayes

Throwing out data in a classical Bayes context is a subtle matter, but it can be done. See Bayesian model selection.

2 Learning a summary statistic

See learning summary statistics. As seen in approximate Bayes. Note this is not at all the same thing as discarding predictors; rather it is about learning a useful statistic to make inferences over some more intractable ones.

3 Feature selection

Figure 2

Deciding whether to include or discard predictors. This one is old and has been included in regression models for a long time. Model selection is a classic one, and regularised sparse model selection is the surprisingly effective recent evolution. But it continues! FOCI is an application of an interesting new independence test (Azadkia and Chatterjee 2019) that is very much en vogue despite being in an area that we all thought was thoroughly mined out.

4 PCA and cousins

A classic. Has a nice probabilistic interpretation β€œfor free” via the Karhunen-LoΓ¨ve theorem. Matrix factorisations are the family that contains such methods. πŸ—

There are various extensions such as additive component analysis:

We propose Additive Component Analysis (ACA), a novel nonlinear extension of PCA. Inspired by multivariate nonparametric regression with additive models, ACA fits a smooth manifold to data by learning an explicit mapping from a low-dimensional latent space to the input space, which trivially enables applications like denoising.

More interesting to me is Exponential Family PCA, which is a generalisation of PCA to non-Gaussian distributions (and I presume to non-additive relations). How does this even work? (Collins, Dasgupta, and Schapire 2001; Jun Li and Dacheng Tao 2013; Liu, Dobriban, and Singer 2017; Mohamed, Ghahramani, and Heller 2008).

5 Learning a distance metric

A related notion is to learn a simpler way of quantifying, in some sense, how similar are two datapoints. This usually involves learning an embedding in some low dimensional ambient space as a by-product.

5.1 UMAP

Uniform Manifold approximation and projection for dimension reduction (McInnes, Healy, and Melville 2018). Apparently super hot right now. (HT James Nichols). Nikolay Oskolkov’s introduction is neat. John Baez discusses the category theoretic underpinning.

5.2 For indexing my database

See learnable indexes.

5.3 Locality Preserving projections

Figure 3

Try to preserve the nearness of points if they are connected on some (weight) graph.

\[\sum_{i,j}(y_i-y_j)^2 w_{i,j}\]

So we seen an optimal projection vector.

(requirement for sparse similarity matrix?)

5.4 Diffusion maps

This manifold-learning technique seemed fashionable for a while. (Ronald R. Coifman and Lafon 2006; R. R. Coifman et al. 2005, 2005)

Mikhail Belkin connects this to the graph laplacian literature.

5.5 As manifold learning

Same thing, with some different emphases and history, over at manifold learning.

5.6 Multidimensional scaling

Figure 5


5.7 Random projection

See random embeddings

5.8 Stochastic neighbour embedding and other visualisation-oriented methods

These methods are designed to make high-dimensional data sets look comprehensible in low-dimensional representation.

Probabilistically preserving closeness. The height of this technique is the famous t-SNE, although as far as I understand it has been superseded by UMAP.

My colleague Ben Harwood advises:

Instead of reducing and visualising higher dimensional data with t-SNE or PCA, here are three relatively recent non-linear dimension reduction techniques that are designed for visualising high dimensional data in 2D or 3D:


Trimap and LargeVis are learned mappings that I would expect to be more representative of the original data than what t-SNE provides. UMAP assumes connectedness of the manifold so it’s probably less suitable for data that contains distinct clusters but otherwise still a great option.

6 Autoencoder and word2vec

The β€œnonlinear PCA” interpretation of word2vec, I just heard from Junbin Gao.

\[L(x, x') = \|x-x\|^2=\|x-\sigma(U*\sigma*W^Tx+b)) + b')\|^2\]


7 For models rather than data

I do’t want to reduce the order of data but rather a model? See model order reduction.

8 Incoming

9 References

Azadkia, and Chatterjee. 2019. β€œA Simple Measure of Conditional Dependence.” arXiv:1910.12327 [Cs, Math, Stat].
Bach, and Jordan. 2002. β€œKernel Independent Component Analysis.” Journal of Machine Learning Research.
Charpentier, Mussard, and Ouraga. 2021. β€œPrincipal Component Analysis: A Generalized Gini Approach.” European Journal of Operational Research.
Coifman, Ronald R., and Lafon. 2006. β€œDiffusion Maps.” Applied and Computational Harmonic Analysis, Special Issue: Diffusion Maps and Wavelets,.
Coifman, R. R., Lafon, Lee, et al. 2005. β€œGeometric Diffusions as a Tool for Harmonic Analysis and Structure Definition of Data: Diffusion Maps.” Proceedings of the National Academy of Sciences.
Collins, Dasgupta, and Schapire. 2001. β€œA Generalization of Principal Components Analysis to the Exponential Family.” In Advances in Neural Information Processing Systems.
Cook. 2018. β€œPrincipal Components, Sufficient Dimension Reduction, and Envelopes.” Annual Review of Statistics and Its Application.
de Castro, and Dorigo. 2019. β€œINFERNO: Inference-Aware Neural Optimisation.” Computer Physics Communications.
Dwibedi, Aytar, Tompson, et al. 2019. β€œTemporal Cycle-Consistency Learning.”
Gerber, Pospisil, Navandar, et al. 2020. β€œLow-Cost Scalable Discretization, Prediction, and Feature Selection for Complex Systems.” Science Advances.
Globerson, and Roweis. 2006. β€œMetric Learning by Collapsing Classes.” In Advances in Neural Information Processing Systems. NIPS’05.
Gordon. 2002. β€œGeneralizedΒ² LinearΒ² Models.” In Proceedings of the 15th International Conference on Neural Information Processing Systems. NIPS’02.
Goroshin, Bruna, Tompson, et al. 2014. β€œUnsupervised Learning of Spatiotemporally Coherent Metrics.” arXiv:1412.6056 [Cs].
Hadsell, Chopra, and LeCun. 2006. β€œDimensionality Reduction by Learning an Invariant Mapping.” In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
Hinton, Geoffrey, and Roweis. 2002. β€œStochastic Neighbor Embedding.” In Proceedings of the 15th International Conference on Neural Information Processing Systems. NIPS’02.
Hinton, Geoffrey E., and Salakhutdinov. 2006. β€œReducing the Dimensionality of Data with Neural Networks.” Science.
HyvΓ€rinen, and Oja. 2000. β€œIndependent Component Analysis: Algorithms and Applications.” Neural Networks.
Jun Li, and Dacheng Tao. 2013. β€œSimple Exponential Family PCA.” IEEE Transactions on Neural Networks and Learning Systems.
Kandanaarachchi, and Hyndman. 2021. β€œDimension Reduction for Outlier Detection Using DOBIN.” Journal of Computational and Graphical Statistics.
Kim, and Klabjan. 2020. β€œA Simple and Fast Algorithm for L1-Norm Kernel PCA.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Lawrence. 2005. β€œProbabilistic Non-Linear Principal Component Analysis with Gaussian Process Latent Variable Models.” Journal of Machine Learning Research.
Liu, Dobriban, and Singer. 2017. β€œ\(e\)PCA: High Dimensional Exponential Family PCA.”
Lopez-Paz, Sra, Smola, et al. 2014. β€œRandomized Nonlinear Component Analysis.” arXiv:1402.0119 [Cs, Stat].
McInnes, Healy, and Melville. 2018. β€œUMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv:1802.03426 [Cs, Stat].
Mohamed, Ghahramani, and Heller. 2008. β€œBayesian Exponential Family PCA.” In Advances in Neural Information Processing Systems.
Murdock, and Torre. 2017. β€œAdditive Component Analysis.” In Conference on Computer Vision and Pattern Recognition (CVPR).
Oymak, and Tropp. 2015. β€œUniversality Laws for Randomized Dimension Reduction, with Applications.” arXiv:1511.09433 [Cs, Math, Stat].
Peluffo-OrdΓ³nez, Lee, and Verleysen. 2014. β€œShort Review of Dimensionality Reduction Methods Based on Stochastic Neighbour Embedding.” In Advances in Self-Organizing Maps and Learning Vector Quantization.
Rohe, and Zeng. 2020. β€œVintage Factor Analysis with Varimax Performs Statistical Inference.” arXiv:2004.05387 [Math, Stat].
Roy, Gordon, and Thrun. 2005. β€œFinding Approximate POMDP Solutions Through Belief Compression.” Journal of Artificial Intelligence Research.
Salakhutdinov, and Hinton. 2007. β€œLearning a Nonlinear Embedding by Preserving Class Neighbourhood Structure.” In PMLR.
Smola, Williamson, Mika, et al. 1999. β€œRegularized Principal Manifolds.” In Computational Learning Theory. Lecture Notes in Computer Science 1572.
Sohn, and Lee. 2012. β€œLearning Invariant Representations with Local Transformations.” In Proceedings of the 29th International Conference on Machine Learning (ICML-12).
Sorzano, Vargas, and Montano. 2014. β€œA Survey of Dimensionality Reduction Techniques.” arXiv:1403.2877 [Cs, q-Bio, Stat].
van der Maaten, and Hinton. 2008. β€œVisualizing Data Using t-SNE.” Journal of Machine Learning Research.
Wang, Hu, Gao, et al. 2017. β€œLocality Preserving Projections for Grassmann Manifold.” In PRoceedings of IJCAI, 2017.
Wasserman. 2018. β€œTopological Data Analysis.” Annual Review of Statistics and Its Application.
Weinberger, Dasgupta, Langford, et al. 2009. β€œFeature Hashing for Large Scale Multitask Learning.” In Proceedings of the 26th Annual International Conference on Machine Learning. ICML ’09.