Matrix square roots

Whitening, preconditioning etc



Assumed audience:

People with undergrad linear algebra

Interpretations and tricks for matrix square roots. TBC

Low-rank-plus-diagonal roots

Perturbations of low rank matrices have non-low-rank roots, but there exist efficient algorithms for findimg them at least. Fasi, Higham, and Liu (2023):

We consider the problem of computing the square root of a perturbation of the scaled identity matrix, \(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\), where \(\mathrm{U}\) and \(\mathrm{V}\) are \(n \times k\) matrices with \(k \leq n\). This problem arises in various applications, including computer vision and optimization methods for machine learning. We derive a new formula for the \(p\) th root of \(\mathrm{A}\) that involves a weighted sum of powers of the \(p\) th root of the \(k \times k\) matrix \(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\). This formula is particularly attractive for the square root, since the sum has just one term when \(p=2\). We also derive a new class of Newton iterations for computing the square root that exploit the low-rank structure.

Their method works for low-rank-plus-diagonal matrices without negative eigenvalues.

Theorem: Let \(\mathrm{U}, \mathrm{V} \in \mathbb{C}^{n \times k}\) with \(k \leq n\) and assume that \(\mathrm{V}^{H} \mathrm{U}\) is nonsingular. Let \(f\) be defined on the spectrum of \(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\), and if \(k=n\) let \(f\) be defined at \(\alpha\). Then \[\quad f(\mathrm{A})=f(\alpha) \mathrm{I}_n+\mathrm{U}\left(\mathrm{V}^{H} \mathrm{U}\right)^{-1}\left(f\left(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\right)-f(\alpha) \mathrm{I}_k\right) \mathrm{V}^{H}.\] The theorem says two things: that \(f(\mathrm{A})\), like \(\mathrm{A}\), is a perturbation of rank at most \(k\) of the identity matrix and that \(f(\mathrm{A})\) can be computed by evaluating \(f\) and the inverse at two \(k \times k\) matrices.

I would call this a generalized Woodbury formula, and I think it is pretty cool; which tells you something about my current obsession profile. Anyway, they use it to discover the following:

Let \(\mathrm{U}, \mathrm{V} \in \mathbb{C}^{n \times k}\) with \(k \leq n\) have full rank and let the matrix \(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\) have no eigenvalues on \(\mathbb{R}^{-}\). Then for any integer \(p \geq 1\), \[ \mathrm{A}^{1 / p}=\alpha^{1 / p} \mathrm{I}_n+\mathrm{U}\left(\sum_{i=0}^{p-1} \alpha^{i / p} \cdot\left(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\right)^{(p-i-1) / p}\right)^{-1} \mathrm{V}^{H} \]

and in particular

Let \(\mathrm{U}, \mathrm{V} \in \mathbb{C}^{n \times k}\) with \(k \leq n\) have full rank and let the matrix \(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\) have no eigenvalues on \(\mathbb{R}^{-}\). Then \[ \mathrm{A}^{1 / 2}=\alpha^{1 / 2} \mathrm{I}_n+\mathrm{U}\left(\left(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\right)^{1 / 2}+\alpha^{1 / 2} \mathrm{I}_k\right)^{-1} \mathrm{V}^{H}. \]

They also derive an explicit gradient step for calculating it, namely Denman-Beaver iteration:

The (scaled) DB iteration is \[ \begin{aligned} \mathrm{X}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{X}_i+\mu_i^{-1} \mathrm{Y}_i^{-1}\right), & \mathrm{X}_0=\mathrm{A}, \\ \mathrm{Y}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{Y}_i+\mu_i^{-1} \mathrm{X}_i^{-1}\right), & \mathrm{Y}_0=\mathrm{I}, \end{aligned} \] where the positive scaling parameter \(\mu_i \in \mathbb{R}\) can be used to accelerate the convergence of the method in its initial steps. The choice \(\mu_i=1\) yields the unscaled DB method, for which \(\mathrm{X}_i\) and \(\mathrm{Y}_i\) converge quadratically to \(\mathrm{A}^{1 / 2}\) and \(\mathrm{A}^{-1 / 2}\), respectively.

… for \(i \geq 0\) the iterates \(\mathrm{X}_i\) and \(\mathrm{Y}_i\) can be written in the form \[ \begin{aligned} & \mathrm{X}_i=\beta_i \mathrm{I}_n+\mathrm{U} \mathrm{B}_i \mathrm{V}^{H}, \quad \beta_i \in \mathbb{C}, \quad \mathrm{B}_i \in \mathbb{C}^{k \times k}, \\ & \mathrm{Y}_i=\gamma_i \mathrm{I}_n+\mathrm{U} \mathrm{C}_i \mathrm{V}^{H}, \quad \gamma_i \in \mathbb{C}, \quad \mathrm{C}_i \in \mathbb{C}^{k \times k} . \\ & \end{aligned}\]

This gets us a computational speedup, although of a rather complicated kind. For a start its constant factor is very favourable compared to the naive approach, but it also has a somewhat favourable scaling with \(n\), being less-than-cubic although more than quadratic depending on some optimisation convergence rates, which depends both on the problem and upon optimal selection of \(\beta_i,\gamma_i\), which they give a recipe for but it gets kinda complicated and engineering-y.

Anyway, let us suppose \(\mathrm{U}=\mathrm{V}=\mathrm{Z}\) and replay that. Then

\[ \mathrm{A}^{1 / 2}=\alpha^{1 / 2} \mathrm{I}_n+\mathrm{Z}\left(\left(\alpha \mathrm{I}_k+\mathrm{Z}^{H} \mathrm{Z}\right)^{1 / 2}+\alpha^{1 / 2} \mathrm{I}_k\right)^{-1} \mathrm{Z}^{H}. \]

The Denman-Beaver step becomes

\[ \begin{aligned} \mathrm{X}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{X}_i+\mu_i^{-1} \mathrm{Y}_i^{-1}\right), & \mathrm{X}_0=\mathrm{A}, \\ \mathrm{Y}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{Y}_i+\mu_i^{-1} \mathrm{X}_i^{-1}\right), & \mathrm{Y}_0=\mathrm{I}, \end{aligned} \]

This looked useful but now I note that this is giving us a full-size square root, rather than a low-rank square root, which is not helpful to me.

References

Del Moral, Pierre, and Angele Niclas. 2018. A Taylor Expansion of the Square Root Matrix Functional.” arXiv.
Dolcetti, Alberto, and Donato Pertici. 2020. Real Square Roots of Matrices: Differential Properties in Semi-Simple, Symmetric and Orthogonal Cases.” arXiv.
Fasi, Massimiliano, Nicholas J. Higham, and Xiaobo Liu. 2023. Computing the Square Root of a Low-Rank Perturbation of the Scaled Identity Matrix.” SIAM Journal on Matrix Analysis and Applications 44 (1): 156–74.
Kessy, Agnan, Alex Lewin, and Korbinian Strimmer. 2018. Optimal Whitening and Decorrelation.” The American Statistician 72 (4): 309–14.
Minka, Thomas P. 2000. Old and new matrix algebra useful for statistics.
Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2012. The Matrix Cookbook.”
Pleiss, Geoff, Martin Jankowiak, David Eriksson, Anil Damle, and Jacob Gardner. 2020. Fast Matrix Square Roots with Applications to Gaussian Processes and Bayesian Optimization.” Advances in Neural Information Processing Systems 33.
Song, Yue, Nicu Sebe, and Wei Wang. 2022. Fast Differentiable Matrix Square Root.” In.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.