- The classics
- Approximate decompositions
- Tutorials
- Non-negative matrix factorisations
- Why is approximate factorisation ever useful?
- As regression
- Sketching
- \([\mathcal{H}]\)-matrix methods
- Randomized methods
- Connections to kernel learning
- Bayesian
- Lanczos
- Solving \((\mathrm{Z} \mathrm{Z}^{\top}+\sigma^2\mathrm{I})^{-1}\)=
- As an optimisation problem
- Implementations
- References
The classics
The big six exact matrix decompositions are (Stewart 2000) Cholesky decomposition; pivoted LU decomposition; QR decomposition; spectral decomposition; Schur decomposition; and singular value decomposition.
See Nick Highamโs summary of those.
Approximate decompositions
Mastered QR and LU decompositions? There are now so many ways of factorising matrices that there are not enough acronyms in the alphabet to hold them, especially if we suspect our matrix is sparse, or could be made sparse because of some underlying constraint, or probably could, if squinted at in the right fashion, be such as a graph transition matrix, or Laplacian, or noisy transform of some smooth object, or at least would be close to sparse if we chose the right metric, orโฆ
A big matrix is close to, in some sense, the (tensor/matrix) product (or sum, orโฆ) of some matrices that are in some way simple (small-rank, small dimension, sparse), possibly with additional constraints. Can we find those simple matrices?
Ethan Epperlyโs introduction to Low-rank Matrices puts many ideas clearly.
Hereโs an example: Godec โ A decomposition into low-rank and sparse components which loosely speaking, combines multidimensional factorisation and outlier detection.
There are so many more of these things, depending on your preferred choice of loss function, free variables and such.
Keywords: Matrix sketching, low-rank approximation, traditional dimensionality reduction.
Matrix concentration inequalities turn out to a useful tool to prove that a given matrix decomp is not too bad in a PAC-sense.
I would like to learn more about
- sparse or low-rank matrix approximation as clustering for density estimation, which is how I imagine high-dimensional mixture models would need to work, and thereby also
- Mercer kernel approximation.
- Connection to manifold learning is also probably worth examining.
Igor Carronโs Matrix Factorization Jungle classifies the following problems as matrix-factorisation type.
- Kernel Factorizations
- โฆ
- Spectral clustering
- \([A = DX]\) with unknown D and X, solve for sparse X and X_i = 0 or 1
- K-Means / K-Median clustering
- \([A = DX]\) with unknown D and X, solve for XX^{} = I and X_i = 0 or 1
- Subspace clustering
- \([A = AX]\) with unknown X, solve for sparse/other conditions on X
- Graph Matching
- \([A = XBX^{\TOP}]\) with unknown X, B solve for B and X as a permutation
- NMF
- \([A = DX]\) with unknown D and X, solve for elements of D,X positive
- Generalized Matrix Factorization
- \([W.*L โ W.*UV']\) with W a known mask, U,V unknowns solve for U,V and L lowest rank possible
- Matrix Completion
- \([A = H.*L]\) with H a known mask, L unknown solve for L lowest rank possible
- Stable Principle Component Pursuit (SPCP)/ Noisy Robust PCA
- \([A = L + S + N]\) with L, S, N unknown, solve for L low rank, S sparse, N noise
- Robust PCA
- \([A = L + S]\) with L, S unknown, solve for L low rank, S sparse
- Sparse PCA
- \([A = DX]\) with unknown D and X, solve for sparse D
- Dictionary Learning
- \([A = DX]\) with unknown D and X, solve for sparse X
- Archetypal Analysis
- \([A = DX]\) with unknown D and X, solve for D = AB with D, B positive
- Matrix Compressive Sensing (MCS)
- find a rank-r matrix L such that \([A(L) ~= b]\) / or \([A(L+S) = b]\)
- Multiple Measurement Vector (MMV)
- \([Y = A X]\) with unknown X and rows of X are sparse
- Compressed sensing
- \([Y = A X]\) with unknown X and rows of X are sparse, X is one column.
- Blind Source Separation (BSS)
- \([Y = A X]\) with unknown A and X and statistical independence between columns of X or subspaces of columns of X
- Partial and Online SVD/PCA
- โฆ
- Tensor Decomposition
- Many, many options; see tensor decompositions for some tractable ones.
Truncated Classic PCA is clearly also an example, but is excluded from the list for some reason. Boringness? the fact itโs a special case of Sparse PCA?
I also add
- Square root
- \(Y = X^{\top}X\) for \(Y\in\mathbbb{R}^{N\times N}, X\in\mathbbb{R}^{N\times n}\), with (typically) \(n<N\).
See also learning on manifolds, compressed sensing, optimisation random linear algebra and clustering, penalised regressionโฆ
Tutorials
- Data mining seminar: Matrix sketching
- Kumar and Schneider have a literature survey on low rank approximation of matrices (Kumar and Shneider 2016)
- Preconditioning tutorial by Erica Klarreich
- Andrew McGregorโs ICML Tutorial Streaming, sampling, sketching
- more at signals and graph.
- Another one that makes the link to clustering is Chris Dingโs Principal Component Analysis and Matrix Factorizations for Learning
- Igor Carronโs Advanced Matrix Factorization Jungle.
Non-negative matrix factorisations
Why is approximate factorisation ever useful?
For certain types of data matrix, here is a suggestive observation: Udell and Townsend (2019) ask โWhy Are Big Data Matrices Approximately Low Rank?โ
Matrices of (approximate) low rank are pervasive in data science, appearing in movie preferences, text documents, survey data, medical records, and genomics. While there is a vast literature on how to exploit low rank structure in these datasets, there is less attention paid to explaining why the low rank structure appears in the first place. Here, we explain the effectiveness of low rank models in data science by considering a simple generative model for these matrices: we suppose that each row or column is associated to a (possibly high dimensional) bounded latent variable, and entries of the matrix are generated by applying a piecewise analytic function to these latent variables. These matrices are in general full rank. However, we show that we can approximate every entry of an \(m\times n\) matrix drawn from this model to within a fixed absolute error by a low rank matrix whose rank grows as \(\mathcal{O}(\log(m+n))\). Hence any sufficiently large matrix from such a latent variable model can be approximated, up to a small entrywise error, by a low rank matrix.
Ethan Epperly argues from a function approximation perspective (e.g.) that we can deduce this property from smoothness of functons.
Saul (2023) connects non-negative matrix factorisation to geometric algebra and linear algebra via deep learning and kernels. that sounds like fun.
As regression
Total Least Squares (a.k.a. orthogonal distance regression, or error-in-variables least-squares linear regression) is a low-rank matrix approximation that minimises the Frobenius divergence from the data matrix. Who knew?
Various other dimensionality reduction techniques can be put in a regression framing, notable Exponential-family PCA.
Sketching
โSketchingโ is a common term to describe a certain type of low-rank factorisation, although I am not sure which types. ๐
(Martinsson 2016) mentions CUR and interpolative decompositions. What is that now?
\([\mathcal{H}]\)-matrix methods
It seems like low-rank matrix factorisation could related to \([\mathcal{H}]\)-matrix methods, but I do not know enough to say more.
See hmatrix.org for one labโs backgrounder and their implementation, h2lib, hlibpro for a black-box closed-source one.
Randomized methods
Rather than find an optimal solution, why not just choose a random one which might be good enough? There are indeed randomised versions.
Connections to kernel learning
See (Grosse et al. 2012) for a mind-melting compositional matrix factorization diagram, constructing a search over hierarchical kernel decompositions that also turn out to have some matrix factorisation interpretations.
Exploiting compositionality to explore a large space of model structures
Bayesian
Nakajima and Sugiyama (2012):
Mnih and Salakhutdinov (2008) proposed a Bayesian maximum a posteriori (MAP) method based on the Gaussian noise model and Gaussian priors on the decomposed matrices. This method actually corresponds to minimizing the squared-loss with the trace-norm penalty (Srebro, Rennie, and Jaakkola 2004) Recently, the variational Bayesian (VB) approach (Attias 1999) has been applied to MF (Lim and Teh 2007; Raiko, Ilin, and Karhunen 2007), which we refer to as VBMF. The VBMF method was shown to perform very well in experiments. However, its good performance was not completely understood beyond its experimental success.
โ Insert further developments here. Possibly Brouwerโs thesis (Brouwer 2017) or Shakir Mohamedโs (Mohamed 2011) would be a good start, or Benjamin Draveโs tutorial, Probabilistic Matrix Factorization and Xinghao Ding, Lihan He, and Carin (2011).
I am currently sitting in a seminar by He Zhao on Bayesian matrix factorisation, wherein he is building up this tool for discrete data, which is an interesting case. He starts from M. Zhou et al. (2012) and builds up to Zhao et al. (2018), introducing some hierarchical descriptions along the way. His methods seem to be sampling-based rather than variational (?).
Generalizedยฒ Linearยฒ models (Gordon 2002) unify nonlinear matrix factorisations with Generalized Linear Models. I had not heard of that until recently; I wonder how common it is?
Lanczos
Lanczos decomposition is handy approximation for matrices which are cheap to multiply because of some structure, but expensive to store. It can also be used to invert them cheaply.
I learnt this trick from Gaussian process literature in the context of Lanczos Variance Estimates (LOVE) (Pleiss et al. 2018), although I believe it exists elsewhere.
Given some rank \(k\) and an arbitrary starting vector \(\boldsymbol{b}\), the Lanczos algorithm iteratively approximates \(\mathrm{K} \in\mathbb{R}^{n \times n}\) by a low rank factorisation \(\mathrm{K}\approx \mathrm{Q} \mathrm{T} \mathrm{Q}^{\top}\), where \(\mathrm{T} \in \mathbb{R}^{k \times k}\) is tridiagonal and \(\mathrm{Q} \in \mathbb{R}^{n \times k}\) has orthogonal columns. Crucially, we do not need to form \(\mathrm{K}\) to evaluate matrix vector products \(\mathrm{K}\boldsymbol{b}\) for arbitrary vector \(\boldsymbol{b}\). Moreover, with a given Lanczos approximand \(\mathrm{Q},\mathrm{T}\) we may estimate \[\begin{align*} \mathrm{K}^{-1}\boldsymbol{c}\approx \mathrm{Q}\mathrm{T}^{-1}\mathrm{Q}^{\top}\boldsymbol{c}. \end{align*}\] even for \(\boldsymbol{b}\neq\boldsymbol{c}\). Say we wish to calculate \(\left(\mathrm{Z} \mathrm{Z}^{\top}+\sigma^2 \mathrm{I}\right)^{-1}\mathrm{B}\), with \(\mathrm{Z}\in\mathbb{R}^{D_{z}\times N }\) and \(N\ll D\).
We approximate the solution to this linear system using the partial Lanczos decomposition starting with probe vector \(\boldsymbol{b}=\overline{\mathrm{B}}\) and \(\mathrm{K}=\left(\mathrm{Z} \mathrm{Z}^{\top}+\sigma^2 \mathrm{I}\right)\).
This requires \(k\) matrix vector products of the form \[\begin{align*} \underbrace{\left(\underbrace{\mathrm{Z} \mathrm{Z}^{\top}}_{\mathcal{O}(ND^2)}+\sigma^2 \mathrm{I}\right)\boldsymbol{b}}_{\mathcal{O}(D^2)} =\underbrace{\mathrm{Z} \underbrace{(\mathrm{Z}^{\top}\boldsymbol{b})}_{\mathcal{O}(ND)}}_{\mathcal{O}(ND)} +\sigma^2 \boldsymbol{b}. \end{align*}\] Using the latter representation, the required matrix-vector product may be found with a time complexity cost of \(\mathcal{O}(ND)\). Space complexity is also \(\mathcal{O}(ND)\). The output of the Lanczos decomposition is \(\mathrm{Q},\mathrm{T}\) such that \(\left(\mathrm{Z}\mathrm{Z}^{\top} +\sigma^2 \mathrm{I}\right)\boldsymbol{b}\approx \mathrm{Q} \mathrm{T} \mathrm{Q}^{\top}\boldsymbol{b}\). Then the solution to the inverse-matrix-vector product may be approximated by \(\left(\mathrm{Z} \mathrm{Z}^{\top} +\sigma^2 \mathrm{I}\right)^{-1} \mathrm{B}\approx \mathrm{Q}\mathrm{T}^{-1}\mathrm{Q}^{\top}\mathrm{B}\). requiring the solution in \(\mathrm{X}\) of the much smaller linear system \(\mathrm{X}\mathrm{T}=\mathrm{Q}\). Exploiting the positive-definiteness of \(\mathrm{T}\) we may use the Cholesky decomposition of \(\mathrm{T}=\mathrm{L}^{\top}\mathrm{L}\) for a constant speedup over solving an arbitrary linear system. The time cost of the solution is \(\mathcal{O}(Dk^3)\), for an overall cost to the matrix inversions of \(\mathcal{O}(NDk+Dk^3)\).
Solving \((\mathrm{Z} \mathrm{Z}^{\top}+\sigma^2\mathrm{I})^{-1}\)=
This trick is very useful but surprisingly rarely mentioned in the classic textbooks.
Let us suppose we have a nearly-low rank composition of some matrix \(\mathrm{K}=\mathrm{Z} \mathrm{Z}^{\top}+\sigma^2\mathrm{I}\) and wish to solve \(\mathrm{K}\mathrm{B}=\mathrm{X}\) in \(\mathrm{X}\), with \(\mathrm{K}\in\mathbb{R}^{D\times D },\mathrm{Z}\in\mathbb{R}^{D\times N },\mathrm{B},\mathrm{X}\in\mathbb{R}^{D\times M}\) and \(N\ll D\). Further, we would like to solve this efficiently, in the sense that we
- exploit the computational efficiency of the low rank structure of \(\mathrm{K}\) so that it costs less than \(\mathcal{O}(D^3M)\) to compute \(\mathrm{K}^{-1}\mathrm{B}\).
- avoid every forming the explicit inverse matrix \(\mathrm{K}^{-1}\) which requires storage \(\mathcal{O}(D^2)\).
This is possible using the following recipe. Applying the Woodbury identity, \[\begin{align*} \mathrm{K}^{-1}=\sigma^{-2}\mathrm{I}-\sigma^{-4} \mathrm{Z}\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\top} \mathrm{Z}\right)^{-1} \mathrm{Z}^{\top}. \end{align*}\] We compute the lower Cholesky decomposition \(\mathrm{L} \mathrm{L}^{\top}=\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\top} \mathrm{Z}\right)^{-1}\) at a cost of \(\mathcal{O}(N^3)\), and define \(\mathrm{R}=\sigma^{-2}\mathrm{Z} \mathrm{L}\). We use this to discover \[ \mathrm{K}^{-1}=\sigma^{-2}\mathrm{I}-\mathrm{R} \mathrm{R}^{\top}, \] and we may thus compute the solution by matrix multiplication \[\begin{aligned} \mathrm{K}^{-1}\mathrm{B} &=\left(\mathrm{Z} \mathrm{Z}^{\top}+\sigma^2 \mathrm{I}\right)^{-1}\mathrm{B}\\ &=\left(\sigma^{-2}\mathrm{I}-\mathrm{R} \mathrm{R}^{\top}\right)\mathrm{B}\\ &=\underbrace{\sigma^{-2}\mathrm{B}}_{D \times M} - \underbrace{\mathrm{R}}_{D\times N} \underbrace{\mathrm{R}^{\top}\mathrm{B}}_{N\times M} \end{aligned}\]
The solution of the linear system is available at cost which looks something like \(\mathcal{O}\left(N^2 D + NDM +N^3\right)\) (hmm, should check that). Generalising from \(\sigma^2\mathrm{I}\) to arbitrary diagonal is easy.
TODO: discuss what happens if the low-rank matrix is not positive-definite.
As an optimisation problem
There are some generalised optimisation problems which look useful for this class of problem, e.g. Bhardwaj, Klep, and Magron (2021):
Polynomial optimization problems (POP) are prevalent in many areas of modern science and engineering. The goal of POP is to minimize a given polynomial over a set defined by finitely many polynomial inequalities, a semialgebraic set. This problem is well known to be NP-hard, and has motivated research for more practical methods to obtain approximate solutions with high accuracy.[โฆ]
One can naturally extend the ideas of positivity and sums of squares to the noncommutative (nc) set- ting by replacing the commutative variables \(z_1, \dots , z_n\) with noncommuting letters \(x_1, \dots , x_n\). The extension to the noncommutative setting is an inevitable consequence of the many areas of science which regularly optimize functions with noncommuting variables, such as matrices or operators. For instance in control theory, matrix completion, quantum information theory, or quantum chemistry
Implementations
โEnough theory! Plug some algorithms into my data!โ
In pytorch, various operations are made easier with cornellius-gp/linear_operator.
NMF Toolbox (MATLAB and Python):
Nonnegative matrix factorization (NMF) is a family of methods widely used for information retrieval across domains including text, images, and audio. Within music processing, NMF has been used for tasks such as transcription, source separation, and structure analysis. Prior work has shown that initialization and constrained update rules can drastically improve the chances of NMF converging to a musically meaningful solution. Along these lines we present the NMF toolbox, containing MATLAB and Python implementations of conceptually distinct NMF variantsโin particular, this paper gives an overview for two algorithms. The first variant, called nonnegative matrix factor deconvolution (NMFD), extends the original NMF algorithm to the convolutive case, enforcing the temporal order of spectral templates. The second variant, called diagonal NMF, supports the development of sparse diagonal structures in the activation matrix. Our toolbox contains several demo applications and code examples to illustrate its potential and functionality. By providing MATLAB and Python code on a documentation website under a GNU-GPL license, as well as including illustrative examples, our aim is to foster research and education in the field of music processing.
Vowpal Wabbit factors matrices, e.g for recommender systems.
It seems the --qr
version is more favoured.
HPC for matlab, R, python, c++: libpmf:
LIBPMF implements the CCD++ algorithm, which aims to solve large-scale matrix factorization problems such as the low-rank factorization problems for recommender systems.
Spams (C++/MATLAB/python) includes some matrix factorisations in its sparse approx toolbox. (see optimisation)
scikit-learn
(python) does
a few matrix factorisation
in its inimitable
batteries-in-the-kitchen-sink way.
โฆ is a Python library for nonnegative matrix factorization. It includes implementations of several factorization methods, initialization approaches, and quality scoring. Both dense and sparse matrix representation are supported.โ
Tapkee (C++). Pro-tip โ even without coding C++, tapkee does a long list of dimensionality reduction from the CLI.
- PCA and randomized PCA
- Kernel PCA (kPCA)
- Random projection
- Factor analysis
tensorly supports some interesting tensor decompositions.
No comments yet. Why not leave one?