Matrix square roots

Whitening, preconditioning etc

August 5, 2014 — May 13, 2023

feature construction
functional analysis
high d
linear algebra
networks
probability
signal processing
sparser than thou
statistics

Assumed audience:

People with undergrad linear algebra

Figure 1

Interpretations and tricks for matrix square roots.

There are two types of things which are referred to as matrix square roots of a Hermitian matrix M in circulation:

  1. X such that XX=M, and
  2. X such that XX=M, when M is positive semi-definite.

Sometimes either will do; other times we care about which. Sometimes we want a particular structure, e.g. lower triangular as in the Cholesky decomposition.

1 Almost low-rank roots

Perturbations of nearly-low rank matrices are not themselves nearly low rank in general, but there exist efficient algorithms for finding them at least. Fasi, Higham, and Liu ():

We consider the problem of computing the square root of a perturbation of the scaled identity matrix, A=αIn+UVH, where U and V are n×k matrices with kn. This problem arises in various applications, including computer vision and optimization methods for machine learning. We derive a new formula for the pth root of A that involves a weighted sum of powers of the pth root of the k×k matrix αIk+VHU. This formula is particularly attractive for the square root, since the sum has just one term when p=2. We also derive a new class of Newton iterations for computing the square root that exploit the low-rank structure.

Their method works for low-rank-plus-diagonal matrices without negative eigenvalues.

Theorem: Let U,VCn×k with kn and assume that VHU is nonsingular. Let f be defined on the spectrum of A=αIn+UVH, and if k=n let f be defined at α. Then f(A)=f(α)In+U(VHU)1(f(αIk+VHU)f(α)Ik)VH. The theorem says two things: that f(A), like A, is a perturbation of rank at most k of the identity matrix and that f(A) can be computed by evaluating f and the inverse at two k×k matrices.

I would call this a generalized Woodbury formula, and I think it is pretty cool; which tells you something about my current obsession profile. Anyway, they use it to discover the following:

Let U,VCn×k with kn have full rank and let the matrix A=αIn+UVH have no eigenvalues on R. Then for any integer p1, A1/p=α1/pIn+U(i=0p1αi/p(αIk+VHU)(pi1)/p)1VH

and in particular

Let U,VCn×k with kn have full rank and let the matrix A=αIn+UVH have no eigenvalues on R. Then A1/2=α1/2In+U((αIk+VHU)1/2+α1/2Ik)1VH.

They also derive an explicit gradient step for calculating it, namely “Denman-Beaver iteration”:

The (scaled) DB iteration is Xi+1=12(μiXi+μi1Yi1),X0=A,Yi+1=12(μiYi+μi1Xi1),Y0=I, where the positive scaling parameter μiR can be used to accelerate the convergence of the method in its initial steps. The choice μi=1 yields the unscaled DB method, for which Xi and Yi converge quadratically to A1/2 and A1/2, respectively.

… for i0 the iterates Xi and Yi can be written in the form Xi=βiIn+UBiVH,βiC,BiCk×k,Yi=γiIn+UCiVH,γiC,CiCk×k.

This gets us a computational speed up, although of a rather complicated kind. For a start, its constant factor is favourable compared to the naive approach, but it also has a somewhat favourable scaling with n, being less-than-cubic although more than quadratic depending on some optimisation convergence rates, which depends both on the problem and upon optimal selection of βi,γi, which they give a recipe for but it gets kinda complicated and engineering-y.

Anyway, let us suppose U=V=Z and replay that. Then A1/2=α1/2In+Z((αIk+ZHZ)1/2+α1/2Ik)1ZH.

The Denman-Beaver step becomes Xi+1=12(μiXi+μi1Yi1),X0=A,Yi+1=12(μiYi+μi1Xi1),Y0=I,

This looked useful although note that this is giving us a full-size square root.

2 Diagonal-plus-low-rank Cholesky

Louis Tiao, in Efficient Cholesky decomposition of low-rank updates summarises Seeger ().

3 References

Del Moral, and Niclas. 2018. A Taylor Expansion of the Square Root Matrix Functional.”
Dolcetti, and Pertici. 2020. Real Square Roots of Matrices: Differential Properties in Semi-Simple, Symmetric and Orthogonal Cases.”
Fasi, Higham, and Liu. 2023. Computing the Square Root of a Low-Rank Perturbation of the Scaled Identity Matrix.” SIAM Journal on Matrix Analysis and Applications.
Kessy, Lewin, and Strimmer. 2018. Optimal Whitening and Decorrelation.” The American Statistician.
Minka. 2000. Old and new matrix algebra useful for statistics.
Petersen, and Pedersen. 2012. The Matrix Cookbook.”
Pleiss, Jankowiak, Eriksson, et al. 2020. Fast Matrix Square Roots with Applications to Gaussian Processes and Bayesian Optimization.” Advances in Neural Information Processing Systems.
Saad. 2003. Iterative Methods for Sparse Linear Systems: Second Edition.
Seeger, ed. 2004. Low Rank Updates for the Cholesky Decomposition.
Song, Sebe, and Wang. 2022. Fast Differentiable Matrix Square Root.” In.