Nearly-low-rank Hermitian matrices

a.k.a. perturbations of the identity, low-rank-plus-diagonal matrices

August 5, 2014 — October 28, 2024

feature construction
functional analysis
high d
linear algebra
networks
probability
signal processing
sparser than thou
statistics

Assumed audience:

People with undergrad linear algebra

Figure 1

A decomposition that some matrix might possess that ends up being practically useful: K=σ2I+aZZ where KRN×N, ZRN×D with DN and a{1,1}. For compactness, we call matrices of this form nearly-low-rank, as a sibling of the actually low rank matrices. I write Z for the conjugate transpose of Z and use that here rather than the transpose because sometimes I want to think of Z as a complex matrix, and last I checked, most stuff here generalised to that case easily. Such lowish-rank matrices are clearly Hermitian and thus arise in, e.g. covariance estimation.

After many arguments with reviewers I have discovered that you should not use this name, because nearly low rank to some people means most of the eigenvalues are close to zero and cannot abide any other meaning to that phrase. I would parse that as meaning they think near means near in the nuclear norm; there is nothing particularly canonical about that interpretation, IMO, but I will use any bloody name if it saves me time arguing with reviewers. Or even a contorted acronym, how about that?

So, if you are one such person, who finds nearly low rank to be a frustration, imagine that I called them Perturbation Enhanced Diagonal Adjusted Numerical Tensors; this might help.

This matrix structure is a workhorse. We might get such low-rank decompositions from matrix factorisation or from some prior structure in the problem; To pick one example, Ensemble filters have covariance matrices with such a form. In many applications, σ2 is chosen so that the entire matrix is positive definite; whether we need this or not depends upon the application; usually we do.

Sometimes we admit a more general version, where the diagonal is allowed to be other-than-constant, so that K=V+aZZ where V=diag(s) for sRD.

1 Factorisations

1.1 Eigendecomposition

Nakatsukasa () observes that, for general YZ,

The nonzero eigenvalues of ZY are equal to those of YZ: an identity that holds as long as the products are square, even when Y,Z are rectangular. This fact naturally suggests an efficient algorithm for computing eigenvalues and eigenvectors of a low-rank matrix K=ZY with Y,ZCD×N,DN: form the small N×N matrix ZY and find its eigenvalues and eigenvectors. For nonzero eigenvalues, the eigenvectors are related by ZYv=λvYZw=λw with w=Yv[…]

Cool. Our low-rank matrix K has additional special structure Y=Z, i.e. symmetry. We use that the nonzero eigenvalues of ZZ are equal to those of ZZ. So, to compute eigenvalues and eigenvectors of a low-rank matrix X=ZZ: form the small N×N matrix ZZ and find its eigenvalues and eigenvectors. For nonzero eigenvalues, the eigenvectors are related by ZZv=λvZZw=λw with w=Zv.

A classic piece of lore is cheap eigendecomposition of K=ZZ+σ2I by exploiting the low-rank structure and SVD. First, we calculate the SVD of Z to obtain Z=USV, where URD×N and VRN×N are orthogonal and SRN×N is diagonal. Then we may write K=ZZ+σ2I=USVVSU+σ2I=US2U+σ2I Thus the top N eigenvalues of K are σ2+sn2, and the corresponding eigenvectors are un. The remaining eigenvalues are σ2, and the corresponding eigenvectors are an arbitrary subset in the complement of the U eigenvectors.

1.2 Square roots

One useful trick, at matrix square roots. Note that it is not necessarily closed, i.e. it does not generate a new low rank matrix.

1.3 Cholesky

Louis Tiao, in Efficient Cholesky decomposition of low-rank updates summarises Seeger ().

1.4 Cholesky decomposition of a harmonic mean

This one pops up from time to time too. Suppose C,Z1,Z2RD×N and ND. We wish to find CC=((Z1Z1)1+(Z2Z2)1)1=Z1Z1(Z1Z1+Z2Z2)1Z2Z2. where that last line is the Woodbury identity. Can we do that in a low-rank way?

2 Inverting

Specifically, solving X=KY for X with YRD×M and, in particular, solving it efficiently, in the sense that we

  1. exploit the computational efficiency of the low-rank structure of K so that it costs less than O(D3M) to compute K1Y.
  2. avoid forming the explicit inverse matrix K1 which requires storage O(D2).

The workhorse tool for these is the Woodbury identity, which has many variants, but often the basic version does what we want: Assuming both A and B are invertible, (A+CBC)1=A1A1C(B1+CA1C)1CA1. See Ken Tay’s intro to Woodbury identities for more about those. For now, we note that they give us an efficient way of calculating matrix inverses.

There is a connection to the eigendecomposition clearly; we may also think about inversion by operating on the eigenvalues.

2.1 Classic

I have no idea who invented this; it seems to be part of the folklore now, but also it was not in my undergraduate degree.

Assume a=1. Applying the Woodbury identity, K1=σ2Iσ4Z(I+σ2ZZ)1Z. Computing the lower Cholesky decomposition LL=(I+σ2ZZ)1 at a cost of O(N3), and defining R=σ2ZL we can write the inverse compactly as K1=σ2IRR. Cool, so we have an explicit form for the inverse, which is still a nearly-low-rank operator. We may solve for X by matrix multiplication, K1Y=(σ2I+ZZ)1Y=(σ2IRR)Y=σ2YD×MRD×NRYN×M The solution of the linear system is available at a cost which looks something like O(N2D+NDM+N3) (hmm, should check that).

The price of this efficient inversion is that pre-multiplying by the nearly-low-rank inverse is not as numerically stable as classic matrix solution methods; but for many purposes, this price is acceptable.

What else can we say now? Well, if K is positive definite, then so is K1, i.e. both have positive eigenvalues. Suppose the eigenvalues of K in descending order are λjK,j{1,,D}. Note that for j{N+1,,D} we have λjK=σ2. That is, all λjKσ2. We can deduce that any non-zero eigenvalues λjZ of Z have the form λjZ=λjKσ2.

The eigenvalues λjK1 of K1 in descending order are λjK1=(λD+1jK)1. Further, λjK1=σ2 for j{1,,DN}. The remaining eigenvalues are sandwiched in between σ2 and 0, as we’d expect with a matrix inverse, from which it follows that the eigenvalues of RR are all negative, specifically j,λjR[σ2,0], so the eigenvalues of R are in the range λjR[0,σ1].

Note that the inverse is also a nearly-low-rank matrix, but with a=1. We can invert that guy also. Applying the Woodbury identity again, (K1)1=(σ2IRR)1=σ2I+σ4R(I+σ2RR)1R Once again we compute the Cholesky decomposition MM=(I+σ2RR)1 at a cost of O(N3), and define Z=σ2RM. We write the recovered inverse as K1=σ2I+ZZ.

In principle, Z=Z if the Cholesky decomposition is unique, which is true if (I+σ2RR) resp. (I+σ2ZZ) is positive definite. In practice, none of this is terribly numerically stable, so I wouldn’t depend upon that in computational practice.

Terminology: For some reason, these (I+σ2RR) and (I+σ2ZZ) are referred to as capacitance matrices.

If such a capacitance is merely positive semidefinite, then we need to do some more work to make it unique. And if it is indefinite (presumably because of numerical stability problems), then we are potentially in trouble because the square root is not real. We can still have complex-valued solutions if we want to go there.

2.2 General diagonals

OK, how about if we admit general diagonal matrices V? Then KV1=V1V1Z(I+ZV1Z)1ZV1. Now we need the Cholesky decomposition of LL=(I+ZV1Z)1, and define R as before; the new low-rank inverse is KV1=V1RR.

2.3 Centred

Suppose KZC=ZCZ+σ2I where C is a centring matrix. Then we can no longer use the Woodbury identity directly because C is rank deficient, but a variant Woodbury identity (; ) applies, to wit: (V+ZCZ)1=V1V1ZC(I+ZV1ZC)1ZV1 from which KZC1=(ZCZ+σ2I)1=σ2Iσ4ZC(I+σ2ZZC)1Z=σ2Iσ4Z(C+σ2CZZC)1Z We recover a different form for the R factor, namely RC:=σ2Zchol((C+σ2CZZC)1)=σ2Zchol((C+σ2ZZ)1). If Z is already centred I think we get RC=σ2Zchol((C+σ2ZZ)1). There are a lot of alternate Woodbury identities for alternate setups (; , ; ).

3 Determinant

The matrix determinant lemma tells us:

Suppose A is an invertible n×n matrix and U,V are n×m matrices. Then det(A+UV)=det(Im+VA1U)det(A).

In the special case A=In this is the Weinstein-Aronszajn identity. Given additionally an invertible m-by- m matrix W, the relationship can also be expressed as det(A+UWV)=det(W1+VA1U)det(W)det(A).

Clearly, det(V+ZZ)=det(ID+ZV1Z)det(V)

4 Products

4.1 Primal

Specifically, (YY+σ2I)(ZZ+σ2I). Are low rank products cheap?

(YY+σ2I)(ZZ+σ2I)=YYZZ+σ2YY+σ2ZZ+σ4I=Y(YZ)Z+σ2YY+σ2ZZ+σ4I which is still a sum of low-rank approximations. At this point it might be natural to consider a tensor decomposition.

4.2 Inverse

Suppose the low-rank inverse factors of Y and X are, respectively, R and C. Then we have

(YY+σ2I)1(ZZ+σ2I)1=(σ2IRR)(σ2ICC)=σ4Iσ4RRσ4CC+σ4R(RC)C

Once again, cheap to evaluate, but not so obviously nice.

5 Distances

5.1 Frobenius

Suppose we want to measure the Frobenius distance between KU,σ2 and KR,γ2. We recall that we might expect things to be nice if they are exactly low rank because, e.g. UUF2=tr(UUUU)=UUF2 How does it come out as a distance between two nearly-low-rank matrices? The answer may be found without forming the full matrices. For compactness, we define δ2=σ2γ2. UU+σ2IRR+γ2IF2=UURR+δ2IF2=UU+iRiR+δ2IF2=[UiR][UiR]+δ2IF2=[UiR][UiR]+δ2I,[UiR][UiR]+δ2IF=[UiR][UiR],[UiR][UiR]F+δ2I,δ2IF+2Re([UiR][UiR],δ2IF)=Tr([UiR][UiR][UiR][UiR])+δ4D+2δ2ReTr([UiR][UiR])=Tr((UURR)(UURR))+δ4D+2δ2Tr(UURRdagger)=Tr(UUUU)2Tr(UURR)+Tr(RRRR)+δ4D+2δ2(Tr(UU)Tr(RR))=Tr(UUUU)2Tr(URRU)+Tr(RRRR)+δ4D+2δ2(Tr(UU)Tr(RR))=UUF22URF2+RRF2+δ4D+2δ2(UF2RF2)

6 Stochastic approximation

TBC

7 Tools

There is support for some of the simplifications mentioned for PyTorch’s linear algebra in cornellius-gp/linear_operator.

8 Incoming

Discuss in terms of perturbation theory? (; )

Bamieh () in particular is compact and clear.

9 References

Akimoto. 2017. Fast Eigen Decomposition for Low-Rank Matrix Approximation.”
Ameli, and Shadden. 2023. A Singular Woodbury and Pseudo-Determinant Matrix Identities and Application to Gaussian Process Regression.” Applied Mathematics and Computation.
Babacan, Luessi, Molina, et al. 2012. Sparse Bayesian Methods for Low-Rank Matrix Estimation.” IEEE Transactions on Signal Processing.
Bach, Francis R. 2013. Sharp Analysis of Low-Rank Kernel Matrix Approximations. In COLT.
Bach, C, Ceglia, Song, et al. 2019. Randomized Low-Rank Approximation Methods for Projection-Based Model Order Reduction of Large Nonlinear Dynamical Problems.” International Journal for Numerical Methods in Engineering.
Bamieh. 2022. A Tutorial on Matrix Perturbation Theory (Using Compact Matrix Notation).”
Barbier, Macris, and Miolane. 2017. The Layered Structure of Tensor Estimation and Its Mutual Information.” arXiv:1709.10368 [Cond-Mat, Physics:math-Ph].
Bauckhage. 2015. K-Means Clustering Is Matrix Factorization.” arXiv:1512.07548 [Stat].
Berry, Browne, Langville, et al. 2007. Algorithms and Applications for Approximate Nonnegative Matrix Factorization.” Computational Statistics & Data Analysis.
Brand. 2002. Incremental Singular Value Decomposition of Uncertain Data with Missing Values.” In Computer Vision — ECCV 2002.
———. 2006. Fast Low-Rank Modifications of the Thin Singular Value Decomposition.” Linear Algebra and Its Applications, Special Issue on Large Scale Linear and Nonlinear Eigenvalue Problems,.
Chen, and Chi. 2018. Harnessing Structures in Big Data via Guaranteed Low-Rank Matrix Estimation: Recent Theory and Fast Algorithms via Convex and Nonconvex Optimization.” IEEE Signal Processing Magazine.
Chi, Lu, and Chen. 2019. Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview.” IEEE Transactions on Signal Processing.
Cichocki, Lee, Oseledets, et al. 2016. Low-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1.” arXiv:1609.00893 [Cs].
Dahleh, Dahleh, and Verghese. 1990. Matrix Perturbations.” In Lectures on Dynamic Systems and Control.
Drineas, and Mahoney. 2005. On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning.” Journal of Machine Learning Research.
Fasi, Higham, and Liu. 2023. Computing the Square Root of a Low-Rank Perturbation of the Scaled Identity Matrix.” SIAM Journal on Matrix Analysis and Applications.
Flammia, Gross, Liu, et al. 2012. Quantum Tomography via Compressed Sensing: Error Bounds, Sample Complexity, and Efficient Estimators.” New Journal of Physics.
Ghashami, Liberty, Phillips, et al. 2015. Frequent Directions : Simple and Deterministic Matrix Sketching.” arXiv:1501.01711 [Cs].
Gordon. 2002. Generalized² Linear² Models.” In Proceedings of the 15th International Conference on Neural Information Processing Systems. NIPS’02.
Gross, D. 2011. Recovering Low-Rank Matrices From Few Coefficients in Any Basis.” IEEE Transactions on Information Theory.
Gross, David, Liu, Flammia, et al. 2010. Quantum State Tomography via Compressed Sensing.” Physical Review Letters.
Hager. 1989. Updating the Inverse of a Matrix.” SIAM Review.
Halko, Martinsson, and Tropp. 2010. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions.”
Harbrecht, Peters, and Schneider. 2012. On the Low-Rank Approximation by the Pivoted Cholesky Decomposition.” Applied Numerical Mathematics, Third Chilean Workshop on Numerical Analysis of Partial Differential Equations (WONAPDE 2010),.
Harville. 1976. Extension of the Gauss-Markov Theorem to Include the Estimation of Random Effects.” The Annals of Statistics.
———. 1977. Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems.” Journal of the American Statistical Association.
Hastie, Mazumder, Lee, et al. 2015. Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.” In Journal of Machine Learning Research.
Henderson, and Searle. 1981. On Deriving the Inverse of a Sum of Matrices.” SIAM Review.
Hoaglin, and Welsch. 1978. The Hat Matrix in Regression and ANOVA.” The American Statistician.
Kala, and Klaczyński. 1994. Generalized Inverses of a Sum of Matrices.” Sankhyā: The Indian Journal of Statistics, Series A (1961-2002).
Kannan. 2016. Scalable and Distributed Constrained Low Rank Approximations.”
Kannan, Ballard, and Park. 2016. A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization.” In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’16.
Kim, and Park. 2008. Nonnegative Matrix Factorization Based on Alternating Nonnegativity Constrained Least Squares and Active Set Method.” SIAM Journal on Matrix Analysis and Applications.
Kumar, and Shneider. 2016. Literature Survey on Low Rank Approximation of Matrices.” arXiv:1606.06511 [Cs, Math].
Liberty, Woolfe, Martinsson, et al. 2007. Randomized Algorithms for the Low-Rank Approximation of Matrices.” Proceedings of the National Academy of Sciences.
Lim, and Teh. 2007. “Variational Bayesian Approach to Movie Rating Prediction.” In Proceedings of KDD Cup and Workshop.
Lin. 2016. A Review on Low-Rank Models in Signal and Data Analysis.”
Liu, and Tao. 2015. On the Performance of Manhattan Nonnegative Matrix Factorization.” IEEE Transactions on Neural Networks and Learning Systems.
Lu. 2022. A Rigorous Introduction to Linear Models.”
Mahoney. 2010. Randomized Algorithms for Matrices and Data.
Martinsson. 2016. Randomized Methods for Matrix Computations and Analysis of High Dimensional Data.” arXiv:1607.01649 [Math].
Miller. 1981. On the Inverse of the Sum of Matrices.” Mathematics Magazine.
Minka. 2000. Old and new matrix algebra useful for statistics.
Nakajima, and Sugiyama. 2012. “Theoretical Analysis of Bayesian Matrix Factorization.” Journal of Machine Learning Research.
Nakatsukasa. 2019. The Low-Rank Eigenvalue Problem.”
Nowak, and Litvinenko. 2013. Kriging and Spatial Design Accelerated by Orders of Magnitude: Combining Low-Rank Covariance Approximations with FFT-Techniques.” Mathematical Geosciences.
Petersen, and Pedersen. 2012. The Matrix Cookbook.”
Rellich. 1954. Perturbation theory of eigenvalue problems.
Rokhlin, Szlam, and Tygert. 2009. A Randomized Algorithm for Principal Component Analysis.” SIAM J. Matrix Anal. Appl.
Saad. 2003. Iterative Methods for Sparse Linear Systems: Second Edition.
Salakhutdinov, and Mnih. 2008. Bayesian Probabilistic Matrix Factorization Using Markov Chain Monte Carlo.” In Proceedings of the 25th International Conference on Machine Learning. ICML ’08.
Saul. 2023. A Geometrical Connection Between Sparse and Low-Rank Matrices and Its Application to Manifold Learning.” Transactions on Machine Learning Research.
Seeger, ed. 2004. Low Rank Updates for the Cholesky Decomposition.
Seshadhri, Sharma, Stolman, et al. 2020. The Impossibility of Low-Rank Representations for Triangle-Rich Complex Networks.” Proceedings of the National Academy of Sciences.
Shi, Zheng, and Yang. 2017. Survey on Probabilistic Models of Low-Rank Matrix Factorizations.” Entropy.
Srebro, Rennie, and Jaakkola. 2004. Maximum-Margin Matrix Factorization.” In Advances in Neural Information Processing Systems. NIPS’04.
Sundin. 2016. “Bayesian Methods for Sparse and Low-Rank Matrix Problems.”
Tropp, Yurtsever, Udell, et al. 2016. Randomized Single-View Algorithms for Low-Rank Matrix Approximation.” arXiv:1609.00048 [Cs, Math, Stat].
———, et al. 2017. Practical Sketching Algorithms for Low-Rank Matrix Approximation.” SIAM Journal on Matrix Analysis and Applications.
Tufts, and Kumaresan. 1982. Estimation of Frequencies of Multiple Sinusoids: Making Linear Prediction Perform Like Maximum Likelihood.” Proceedings of the IEEE.
Türkmen. 2015. A Review of Nonnegative Matrix Factorization Methods for Clustering.” arXiv:1507.03194 [Cs, Stat].
Udell, and Townsend. 2019. Why Are Big Data Matrices Approximately Low Rank? SIAM Journal on Mathematics of Data Science.
Wilkinson, Andersen, Reiss, et al. 2019. End-to-End Probabilistic Inference for Nonstationary Audio Analysis.” arXiv:1901.11436 [Cs, Eess, Stat].
Woodruff. 2014. Sketching as a Tool for Numerical Linear Algebra. Foundations and Trends in Theoretical Computer Science 1.0.
Woolfe, Liberty, Rokhlin, et al. 2008. A Fast Randomized Algorithm for the Approximation of Matrices.” Applied and Computational Harmonic Analysis.
Xinghao Ding, Lihan He, and Carin. 2011. Bayesian Robust Principal Component Analysis.” IEEE Transactions on Image Processing.
Yang, Fang, Duan, et al. 2018. Fast Low-Rank Bayesian Matrix Completion with Hierarchical Gaussian Prior Models.” IEEE Transactions on Signal Processing.
Yin, Gao, and Lin. 2016. Laplacian Regularized Low-Rank Representation and Its Applications.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Yu, Hsiang-Fu, Hsieh, Si, et al. 2012. Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems.” In IEEE International Conference of Data Mining.
———, et al. 2014. Parallel Matrix Factorization for Recommender Systems.” Knowledge and Information Systems.
Yu, Chenhan D., March, and Biros. 2017. An NlogN Parallel Fast Direct Solver for Kernel Matrices.” In arXiv:1701.02324 [Cs].
Zhang, Kai, Liu, Zhang, et al. 2017. Randomization or Condensation?: Linear-Cost Matrix Sketching Via Cascaded Compression Sampling.” In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’17.
Zhang, Xiao, Wang, and Gu. 2017. Stochastic Variance-Reduced Gradient Descent for Low-Rank Matrix Recovery from Linear Measurements.” arXiv:1701.00481 [Stat].
Zhou, and Tao. 2011. GoDec: Randomized Low-Rank & Sparse Matrix Decomposition in Noisy Case.” In Proceedings of the 28th International Conference on International Conference on Machine Learning. ICML’11.
———. 2012. Multi-Label Subspace Ensemble.” Journal of Machine Learning Research.