Nearly-low-rank Hermitian matrices

a.k.a. perturbations of the identity, low-rank-plus-diagonal matrices



Assumed audience:

People with undergrad linear algebra

Here is a decomposition that some matrix might possess that ends up being useful: \[\mathrm{K}= \sigma^2\mathrm{I} + a\mathrm{Z} \mathrm{Z}^{\dagger}\] where \(\mathrm{K}\in\mathbb{R}^{N\times N}\), \(\mathrm{Z}\in\mathbb{R}^{N\times D}\) with \(D\ll N\) and \(a\in\{-1, 1\}\). For compactness we call matrices of this form nearly-low-rank, as sibling of the actually low rank matrices. I write \(\mathrm{Z}^{\dagger}\) for the conjugate transpose of \(\mathrm{Z}\) and use that here rahter than the transpose, because sometimes I want to think of \(\mathrm{Z}\) as a complex matrix, and last I checked, most stuff here generalised to that case easily. Such lowish-rank matrices are clearly Hermitian, and thus arise in, e.g. covariance estimation.

This matrix structure is a workhorse. We might get such a low rank decompositions from matrix factorisation, or from some prior structuring of a problem; To pick one example, Ensemble filters have coavariance matrices that look a lot like this. In many applications \(\sigma^2\) is chosen so that the entire matrix is positive definite; whether we need this or not depends upon the application, but usually we do want that.

Sometimes we admit a more general version, where the diagonal is allowed to be other-than-constant, so that \(\mathrm{K}=\mathrm{V} + a\mathrm{Z} \mathrm{Z}^{\dagger}\) where \(\mathrm{V} = \operatorname{diag}(\boldsymbol{s})\) for \(s\in\mathbb{R}^D\).

Factorisations

Eigendecomposition

Nakatsukasa (2019) observes that, for general \(\mathrm{Y} \mathrm{Z}\),

The nonzero eigenvalues of \(\mathrm{Z} \mathrm{Y}\) are equal to those of \(\mathrm{Y} \mathrm{Z}\) : an identity that holds as long as the products are square, even when \(\mathrm{Y}, \mathrm{Z}\) are rectangular. This fact naturally suggests an efficient algorithm for computing eigenvalues and eigenvectors of a low-rank matrix \(\mathrm{K}=\mathrm{Z} \mathrm{Y}\) with \(\mathrm{Y}, \mathrm{Z}^{\dagger} \in \mathbb{C}^{D \times N}, D \gg N\) : form the small \(N \times N\) matrix \(\mathrm{Z} \mathrm{Y}\) and find its eigenvalues and eigenvectors. For nonzero eigenvalues, the eigenvectors are related by \(\mathrm{Z} \mathrm{Y} v=\lambda v \Leftrightarrow \mathrm{Y} \mathrm{Z} w=\lambda w\) with \(w=\mathrm{Y} v\)[…]

Cool. Our low-rank matrix \(\mathrm{K}\) has additional special structure \(\mathrm{Y}=\mathrm{Z}^\dagger\), i.e. symmetry. We use that the nonzero eigenvalues of \(\mathrm{Z} \mathrm{Z}^{\dagger}\) are equal to those of \(\mathrm{Z}^{\dagger} \mathrm{Z}\). So, to compute eigenvalues and eigenvectors of a low-rank matrix \(\mathrm{X}=\mathrm{Z} \mathrm{Z}^{\dagger}\): form the small \(N \times N\) matrix \(\mathrm{Z}^{\dagger} \mathrm{Z}\) and find its eigenvalues and eigenvectors. For nonzero eigenvalues, the eigenvectors are related by \(\mathrm{Z} \mathrm{Z}^{\dagger} v=\lambda v \Leftrightarrow \mathrm{Z}^{\dagger} \mathrm{Z} w=\lambda w\) with \(w=\mathrm{Z}^{\dagger} v\).

A classic piece of lore is cheap eigendecomposition of \(\mathrm{K}=\mathrm{Z}\mathrm{Z}^{\dagger}+\sigma^2\mathrm{I}\) by exploiting the low rank structure and SVD. First we calculate the SVD of \(\mathrm{Z}\) to obtain \(\mathrm{Z}=\mathrm{U}\mathrm{S}\mathrm{V}^{\dagger}\), where \(\mathrm{U}\in\mathbb{R}^{D\times N}\) and \(\mathrm{V}\in\mathbb{R}^{N\times N}\) are orthogonal and \(\mathrm{S}\in\mathbb{R}^{N\times N}\) is diagonal. Then we may write \[ \begin{aligned} \mathrm{K} &= \mathrm{Z} \mathrm{Z}^{\dagger} + \sigma^2 \mathrm{I} \\ &= \mathrm{U} \mathrm{S} \mathrm{V}^{\dagger} \mathrm{V} \mathrm{S} \mathrm{U}^{\dagger} + \sigma^2 \mathrm{I} \\ &= \mathrm{U} \mathrm{S}^2 \mathrm{U}^{\dagger} + \sigma^2 \mathrm{I} \end{aligned} \] Thus the top \(N\) eigenvalues of \(\mathrm{K}\) are \(\sigma^2+s_n^2\), and the corresponding eigenvectors are \(\boldsymbol{u}_n\). The remaining eigenvalues are \(\sigma^2\), and the corresponding eigenvectors are an arbitrary subset in the complement of the \(\mathrm{U}\) eigenvectors.

Square roots

No useful tricks that I know rn, See matrix square roots.

Cholesky

Louis Tiao, in Efficient Cholesky decomposition of low-rank updates summarises Seeger (2004).

Cholesky decomposition of a harmonic mean

This one pops up from time to time too. Suppose \(\mathrm{C},\mathrm{Z}_1,\mathrm{Z}_2\in\mathbb{R}^{D\times N }\) and \(N\ll D\). We wish to find \[ \begin{aligned} \mathrm{C}\mathrm{C}^{\dagger} &=\left((\mathrm{Z}_1\mathrm{Z}_1^{\dagger})^{-1}+(\mathrm{Z}_2\mathrm{Z}_2^{\dagger})^{-1}\right)^{-1}\\ &={\mathrm{Z}_1\mathrm{Z}_1^{\dagger}\left(\mathrm{Z}_1\mathrm{Z}_1^{\dagger}+\mathrm{Z}_2\mathrm{Z}_2^{\dagger}\right)^{-1}\mathrm{Z}_2\mathrm{Z}_2^{\dagger}}. \end{aligned} \] where that last line is the Woodbury identity. Can we do that in a low rank way?

Inverting

Specifically, solving \(\mathrm{X}=\mathrm{K}\mathrm{Y}\) for \(\mathrm{X}\) with \(\mathrm{Y}\in\mathbb{R}^{D\times M}\) and, in particular, solving it efficiently, in the sense that we

  1. exploit the computational efficiency of the low rank structure of \(\mathrm{K}\) so that it costs less than \(\mathcal{O}(D^3M)\) to compute \(\mathrm{K}^{-1}\mathrm{Y}\).
  2. avoid forming the explicit inverse matrix \(\mathrm{K}^{-1}\) which requires storage \(\mathcal{O}(D^2)\).

The workhorse here is the Woodbury identity, which has many variants, but often the basic version does what we want: Assuming both \(\mathrm{A}\) and \(\mathrm{B}\) are invertible, \[ \left(\mathrm{A}+\mathrm{C B C}^{\dagger}\right)^{-1}=\mathrm{A}^{-1}-\mathrm{A}^{-1} \mathrm{C}\left(\mathrm{B}^{-1}+\mathrm{C}^{\dagger} \mathrm{A}^{-1} \mathrm{C}\right)^{-1} \mathrm{C}^{\dagger} \mathrm{A}^{-1}. \] See Ken Tay’s intro to Woodbury identities for more about those. For now we note that they give us an efficient way of calculating matrix inverses.

There is a connection to the eigendecomposition clearly; we may also think about inversion by operating on the eigenvalues.

Classic

I have no idea who invented this; it seems to be part of the folklore now, but also it was not in my undergraduate degreee.

Assume \(a=1\). Applying the Woodbury identity, \[\begin{align*} \mathrm{K}^{-1}=\sigma^{-2}\mathrm{I}-\sigma^{-4} \mathrm{Z}\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z}\right)^{-1} \mathrm{Z}^{\dagger}. \end{align*}\] Computing the lower Cholesky decomposition \(\mathrm{L} \mathrm{L}^{\dagger}=\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z}\right)^{-1}\) at a cost of \(\mathcal{O}(N^3)\), and defining \(\mathrm{R}=\sigma^{-2}\mathrm{Z} \mathrm{L}\) we can write the inverse compactly as \[ \mathrm{K}^{-1}=\sigma^{-2}\mathrm{I}-\mathrm{R} \mathrm{R}^{\dagger}. \] Cool, so we have an explicit form for the inverse, which is still a nearly-low-rank operator. We may solve for \(\mathrm{X}\) by matrix multiplication, \[\begin{aligned} \mathrm{K}^{-1}\mathrm{Y} &=\left(\sigma^2 \mathrm{I}+\mathrm{Z} \mathrm{Z}^{\dagger}\right)^{-1}\mathrm{Y}\\ &=\left(\sigma^{-2}\mathrm{I}-\mathrm{R} \mathrm{R}^{\dagger}\right)\mathrm{Y}\\ &=\underbrace{\sigma^{-2}\mathrm{Y}}_{D \times M} - \underbrace{\mathrm{R}}_{D\times N} \underbrace{\mathrm{R}^{\dagger}\mathrm{Y}}_{N\times M} \end{aligned}\] The solution of the linear system is available at cost which looks something like \(\mathcal{O}\left(N^2 D + NDM +N^3\right)\) (hmm, should check that).

The price of this efficient inversion is that pre-multiplying by the nearly-low-rank inverse is not as numerically stable as classic matrix solution methods; but for many purposes this price is acceptable.

What else can we say now? Well, if \(\mathrm{K}\) is positive definite, then so is \(\mathrm{K}^{-1}\), i.e. both have positive eigenvalues. Suppose the eigenvalues of \(\mathrm{K}\) in descending order are \(\lambda^{\mathrm{K}}_{j}, j\in\{1, \ldots, D\}\). Note that for \(j\in \{N+1,\dots,D\}\) we have \(\lambda^{\mathrm{K}}_j=\sigma^2\). That is, all \(\lambda^{\mathrm{K}}_j\geq\sigma^2\). We can deduce that any non-zero eigenvalues \(\lambda^{\mathrm{Z}}_j\) of \(\mathrm{Z}\) have the form \(\lambda_j^{\mathrm{Z}}=\sqrt{\lambda^{\mathrm{K}}_{j}-\sigma^{2}}\).

The eigenvalues \(\lambda^{\mathrm{K}^{-1}}_j\) of \(\mathrm{K}^{-1}\) in descending order are \(\lambda^{\mathrm{K}^{-1}}_j=(\lambda^{\mathrm{K}}_{D+1-j})^{-1}\). Further, \(\lambda^{\mathrm{K}^{-1}}_j=\sigma^{-2}\) for \(j\in \{1, \dots, D-N\}\). The remaining eigenvalues are sandwiched in between \(\sigma^{-2}\) and \(0\), as we’d expect with a matrix inverse, from which it follows that the eigenvalues of \(-\mathrm{R}\mathrm{R}^{\dagger}\) are all negative, specifically \(\forall j,\,\lambda^{\mathrm{R}}_j\in[-\sigma^{-2},0]\), so the eigenvalues of \(\mathrm{R}\) are in the range \(\lambda^{\mathrm{R}}_j\in[0,\sigma^{-1}]\).

Note that the inverse is also a nearly-low-rank matrix, but with \(a=-1\). We can invert that guy also. Applying the Woodbury identity again, \[\begin{aligned} (\mathrm{K}^{-1})^{-1} &=\left(\sigma^{-2}\mathrm{I}-\mathrm{R} \mathrm{R}^{\dagger}\right)^{-1}\\ &=\sigma^{2}\mathrm{I}+\sigma^{4} \mathrm{R}\left(-\mathrm{I}+\sigma^{2}\mathrm{R}^{\dagger} \mathrm{R}\right)^{-1} \mathrm{R}^{\dagger} \end{aligned}\] Once again we compute the Cholesky decomposition \(\mathrm{M} \mathrm{M}^{\dagger}=\left(-\mathrm{I}+\sigma^{2}\mathrm{R}^{\dagger} \mathrm{R}\right)^{-1}\) at a cost of \(\mathcal{O}(N^3)\), and define \(\mathrm{Z}'=\sigma^{-2}\mathrm{R} \mathrm{M}\). We write the recovered inverse as \[ \mathrm{K}^{-1}=\sigma^{-2}\mathrm{I}+\mathrm{Z}' \mathrm{Z}'{}^{\dagger}. \]

In principle, \(\mathrm{Z}=\mathrm{Z}'\) if the Cholesky decomposition is unique, which is true if \(\left(-\mathrm{I}+\sigma^{2}\mathrm{R}^{\dagger} \mathrm{R}\right)\) resp. \(\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z}\right)\) is positive definite. In practice, none of this is terribly numerically stable, so I wouldn’t depend upon that in computational practice.

Terminology: For some reason, these \(\left(-\mathrm{I}+\sigma^{2}\mathrm{R}^{\dagger} \mathrm{R}\right)\) and \(\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z}\right)\) are referred to as capacitance matrices.

If such a capacitance is merely positive semidefinite then we need to do some more work to make it unique. And if it is indefinite (presumably because of numerical stability problems) then we are potentially in trouble, because the square root is not real. We can still have complex-valued solutions, if we want to go there.

General diagonals

OK, how about if we admit general diagonal matrices \(\mathrm{V}\)? Then \[\begin{align*} \mathrm{K}_{\mathrm{V}}^{-1}=\mathrm{V}^{-1}-\mathrm{V}^{-1} \mathrm{Z}\left(\mathrm{I}+\mathrm{Z}^{\dagger}\mathrm{V}^{-1} \mathrm{Z}\right)^{-1} \mathrm{Z}^{\dagger}\mathrm{V}^{-1}. \end{align*}\] Now we need the Cholesky decomposition of \(\mathrm{L} \mathrm{L}^{\dagger}=\left(\mathrm{I}+\mathrm{Z}^{\dagger}\mathrm{V}^{-1}\mathrm{Z}\right)^{-1}\), and define \(\mathrm{R}\) as before; the new low-rank inverse is \[ \mathrm{K}_{\mathrm{V}}^{-1}=\mathrm{V}^{-1}-\mathrm{R} \mathrm{R}^{\dagger}. \]

Centred

Suppose \(\mathrm{K}_{\mathrm{Z}\mathrm{C}}=\mathrm{Z}\mathrm{C}\mathrm{Z}^{\dagger}+\sigma^2\mathrm{I}\) where \(\mathrm{C}\) is a centering matrix. Then we can no longer use the Woodbury identity directly because \(\mathrm{C}\) is rank deficient, but a variant Woodbury identity (Harville 1977; Henderson and Searle 1981) applies, to wit: \[\begin{aligned} \left(\mathrm{V}+\mathrm{Z} \mathrm{C} \mathrm{Z}^{\dagger}\right)^{-1}=\mathrm{V}^{-1}-\mathrm{V}^{-1} \mathrm{Z} \mathrm{C}\left(\mathrm{I}+\mathrm{Z}^{\dagger} \mathrm{V}^{-1} \mathrm{Z} \mathrm{C}\right)^{-1} \mathrm{Z}^{\dagger} \mathrm{V}^{-1} \end{aligned}\] from which \[\begin{aligned} \mathrm{K}_{\mathrm{Z}\mathrm{C}}^{-1} &= \left(\mathrm{Z}\mathrm{C}\mathrm{Z}^{\dagger}+\sigma^2\mathrm{I}\right)^{-1}\\ &=\sigma^{-2}\mathrm{I}-\sigma^{-4}\mathrm{Z} \mathrm{C}\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z} \mathrm{C}\right)^{-1} \mathrm{Z}^{\dagger}\\ &=\sigma^{-2}\mathrm{I}-\sigma^{-4}\mathrm{Z} \left(\mathrm{C}+\sigma^{-2}\mathrm{C}\mathrm{Z}^{\dagger} \mathrm{Z} \mathrm{C}\right)^{-1} \mathrm{Z}^{\dagger} \end{aligned}\] We recover a different form for the \(\mathrm{R}\) factor, namely \[\begin{aligned} \mathrm{R}_{\mathrm{C}} &:=\sigma^{2}\mathrm{Z}\operatorname{chol}((\mathrm{C}+\sigma^{-2}\mathrm{C}\mathrm{Z}^{\dagger} \mathrm{Z} \mathrm{C})^{-1}). &=\sigma^{2}\mathrm{Z}\operatorname{chol}((\mathrm{C}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z} )^{-1}). \end{aligned}\] If \(\mathrm{Z}\) is already centred I think we get \[\begin{aligned} \mathrm{R}_{\mathrm{C}} &=\sigma^{2}\mathrm{Z}\operatorname{chol}((\mathrm{C}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z} )^{-1}). \end{aligned}\] There are a lot of alternate Woodbury identities for alternate setups (Ameli and Shadden 2023; Harville 1976, 1977; Henderson and Searle 1981).

Products

Primal

Specifically, \((\mathrm{Y} \mathrm{Y}^{\dagger}+\sigma^2\mathrm{I})(\mathrm{Z} \mathrm{Z}^{\dagger}+\sigma^2\mathrm{I})\). Are low rank products cheap?

\[ \begin{aligned} (\mathrm{Y} \mathrm{Y}^{\dagger}+\sigma^2\mathrm{I})(\mathrm{Z} \mathrm{Z}^{\dagger}+\sigma^2\mathrm{I}) &=\mathrm{Y} \mathrm{Y}^{\dagger} \mathrm{Z} \mathrm{Z}^{\dagger}+\sigma^2\mathrm{Y} \mathrm{Y}^{\dagger}+\sigma^2\mathrm{Z} \mathrm{Z}^{\dagger}+\sigma^4\mathrm{I}\\ &=\mathrm{Y} (\mathrm{Y}^{\dagger} \mathrm{Z} )\mathrm{Z}^{\dagger}+\sigma^2\mathrm{Y} \mathrm{Y}^{\dagger}+\sigma^2\mathrm{Z} \mathrm{Z}^{\dagger}+\sigma^4\mathrm{I} \end{aligned} \] which is still a sum of low-rank approximations. At this point it might be natural to consider a tensor decomposition.

Inverse

Suppose the low-rank inverse factors of \(\mathrm{Y}\) and \(\mathrm{X}\) are, respectively, \(\mathrm{R}\) and \(\mathrm{C}\). Then we have

\[ \begin{aligned} &(\mathrm{Y} \mathrm{Y}^{\dagger}+\sigma^2\mathrm{I})^{-1}(\mathrm{Z} \mathrm{Z}^{\dagger}+\sigma^2\mathrm{I})^{-1}\\ &=(\sigma^{-2}\mathrm{I}-\mathrm{R} \mathrm{R}^{\dagger})(\sigma^{-2}\mathrm{I}-\mathrm{C} \mathrm{C}^{\dagger})\\ &=\sigma^{-4}\mathrm{I}-\sigma^{-4}\mathrm{R} \mathrm{R}^{\dagger}-\sigma^{-4}\mathrm{C} \mathrm{C}^{\dagger}+\sigma^{-4}\mathrm{R} (\mathrm{R}^{\dagger}\mathrm{C}) \mathrm{C}^{\dagger}\\ \end{aligned} \]

Once again, cheap to evaluate, but not so obviously nice.

Distances

Frobenius

Suppose we want to measure the Frobenius distance between \(\mathrm{K}_{\mathrm{U},\sigma^2}\) and \(\mathrm{K}_{\mathrm{R},\gamma^2}\). We recall that we might expect things to be nice if they are exactly low rank because, e.g. \[ \begin{aligned} \|\mathrm{U}\mathrm{U}^{\dagger}\|_F^2 =\operatorname{tr}\left(\mathrm{U}\mathrm{U}^{\dagger}\mathrm{U}\mathrm{U}^{\dagger}\right) =\|\mathrm{U}^{\dagger}\mathrm{U}\|_F^2 \end{aligned} \] How does it come out as a distance between two nearly-low-rank matrices? The answer may be found without forming the full matrices. For compactness, we define \(\delta^2=\sigma^2-\gamma^2\). \[ \begin{aligned} &\|\mathrm{U}\mathrm{U}^{\dagger}+\sigma^{2}\mathrm{I}-\mathrm{R}\mathrm{R}^{\dagger}+\gamma^{2}\mathrm{I}\|_F^2\\ &=\left\|\mathrm{U}\mathrm{U}^{\dagger} -\mathrm{R}\mathrm{R}^{\dagger} + \delta^2\mathrm{I}\right\|_{F}^2\\ &=\left\|\mathrm{U}\mathrm{U}^{\dagger}+i\mathrm{R}i\mathrm{R}^{\dagger} + \delta^2\mathrm{I} \right\|_{F}^2\\ &=\left\|\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger} + \delta^2\mathrm{I} \right\|_{F}^2\\ &=\left\langle\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger} + \delta^2\mathrm{I},\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger} + \delta^2\mathrm{I} \right\rangle_{F}\\ &=\left\langle\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger},\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}\right\rangle_{F} +\left\langle\delta^2\mathrm{I}, \delta^2\mathrm{I} \right\rangle_{F}\\ &\quad+2\operatorname{Re}\left(\left\langle\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}, \delta^2\mathrm{I} \right\rangle_{F}\right)\\ &=\operatorname{Tr}\left(\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}\right) +\delta^4D\\ &\quad+2\delta^2\operatorname{Re}\operatorname{Tr}\left(\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}\right)\\ &=\operatorname{Tr}\left(\left(\mathrm{U}\mathrm{U}^{\dagger} -\mathrm{R}\mathrm{R}^{\dagger}\right)\left(\mathrm{U}\mathrm{U}^{\dagger} -\mathrm{R}\mathrm{R}^{\dagger}\right)\right) +\delta^4D\\ &\quad+2\delta^2\operatorname{Tr}\left(\mathrm{U}\mathrm{U}^{\dagger} -\mathrm{R}\mathrm{R}^{\dagger}\right)\\ &=\operatorname{Tr}\left(\mathrm{U}\mathrm{U}^{\dagger}\mathrm{U}\mathrm{U}^{\dagger}\right) -2\operatorname{Tr}\left(\mathrm{U}\mathrm{U}^{\dagger}\mathrm{R}\mathrm{R}^{\dagger}\right) + \operatorname{Tr}\left(\mathrm{R}\mathrm{R}^{\dagger}\mathrm{R}\mathrm{R}^{\dagger}\right) +\delta^4D \\ &\quad+2\delta^2\left(\operatorname{Tr}\left(\mathrm{U}\mathrm{U}^{\dagger}\right) -\operatorname{Tr}\left(\mathrm{R}\mathrm{R}^{\dagger}\right)\right)\\ &=\operatorname{Tr}\left(\mathrm{U}^{\dagger}\mathrm{U}\mathrm{U}^{\dagger}\mathrm{U}\right) -2\operatorname{Tr}\left(\mathrm{U}^{\dagger}\mathrm{R}\mathrm{R}^{\dagger}\mathrm{U}\right) + \operatorname{Tr}\left(\mathrm{R}^{\dagger}\mathrm{R}\mathrm{R}^{\dagger}\mathrm{R}\right) +\delta^4D \\ &\quad+2\delta^2\left(\operatorname{Tr}\left(\mathrm{U}^{\dagger}\mathrm{U}\right) -\operatorname{Tr}\left(\mathrm{R}^{\dagger}\mathrm{R}\right)\right)\\ &=\left\|\mathrm{U}^{\dagger}\mathrm{U}\right\|^2_F -2\left\|\mathrm{U}^{\dagger}\mathrm{R}\right\|^2_F + \left\|\mathrm{R}^{\dagger}\mathrm{R}\right\|^2_F +\delta^4D +2\delta^2\left(\left\|\mathrm{U}\right\|^2_F -\left\|\mathrm{R}\right\|^2_F\right) \end{aligned} \]

Stochastic approximation

Incoming

Discuss in terms of perturbation theory? (Rellich 1954; Bamieh 2022)

Bamieh (2022) in particular is compact and clear.

Tools

There is support for some of the simplifications mentioned for pytorch’s linear algebra in cornellius-gp/linear_operator.

References

Akimoto, Youhei. 2017. β€œFast Eigen Decomposition for Low-Rank Matrix Approximation.” arXiv.
Ameli, Siavash, and Shawn C. Shadden. 2023. β€œA Singular Woodbury and Pseudo-Determinant Matrix Identities and Application to Gaussian Process Regression.” Applied Mathematics and Computation 452 (April): 128032.
Babacan, S. Derin, Martin Luessi, Rafael Molina, and Aggelos K. Katsaggelos. 2012. β€œSparse Bayesian Methods for Low-Rank Matrix Estimation.” IEEE Transactions on Signal Processing 60 (8): 3964–77.
Bach, C, D. Ceglia, L. Song, and F. Duddeck. 2019. β€œRandomized Low-Rank Approximation Methods for Projection-Based Model Order Reduction of Large Nonlinear Dynamical Problems.” International Journal for Numerical Methods in Engineering 118 (4): 209–41.
Bach, Francis R. 2013. β€œSharp Analysis of Low-Rank Kernel Matrix Approximations.” In COLT, 30:185–209.
Bamieh, Bassam. 2022. β€œA Tutorial on Matrix Perturbation Theory (Using Compact Matrix Notation).” arXiv.
Barbier, Jean, Nicolas Macris, and LΓ©o Miolane. 2017. β€œThe Layered Structure of Tensor Estimation and Its Mutual Information.” arXiv:1709.10368 [Cond-Mat, Physics:math-Ph], September.
Bauckhage, Christian. 2015. β€œK-Means Clustering Is Matrix Factorization.” arXiv:1512.07548 [Stat], December.
Berry, Michael W., Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. 2007. β€œAlgorithms and Applications for Approximate Nonnegative Matrix Factorization.” Computational Statistics & Data Analysis 52 (1): 155–73.
Brand, Matthew. 2002. β€œIncremental Singular Value Decomposition of Uncertain Data with Missing Values.” In Computer Vision β€” ECCV 2002, edited by Anders Heyden, Gunnar Sparr, Mads Nielsen, and Peter Johansen, 2350:707–20. Berlin, Heidelberg: Springer Berlin Heidelberg.
β€”β€”β€”. 2006. β€œFast Low-Rank Modifications of the Thin Singular Value Decomposition.” Linear Algebra and Its Applications, Special Issue on Large Scale Linear and Nonlinear Eigenvalue Problems, 415 (1): 20–30.
Chen, Yudong, and Yuejie Chi. 2018. β€œHarnessing Structures in Big Data via Guaranteed Low-Rank Matrix Estimation: Recent Theory and Fast Algorithms via Convex and Nonconvex Optimization.” IEEE Signal Processing Magazine 35 (4): 14–31.
Chi, Yuejie, Yue M. Lu, and Yuxin Chen. 2019. β€œNonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview.” IEEE Transactions on Signal Processing 67 (20): 5239–69.
Cichocki, A., N. Lee, I. V. Oseledets, A.-H. Phan, Q. Zhao, and D. Mandic. 2016. β€œLow-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1.” arXiv:1609.00893 [Cs], September.
Dahleh, Mohammed, Munther A. Dahleh, and George Verghese. 1990. β€œMatrix Perturbations.” In Lectures on Dynamic Systems and Control, 20:91.
Drineas, Petros, and Michael W. Mahoney. 2005. β€œOn the NystrΓΆm Method for Approximating a Gram Matrix for Improved Kernel-Based Learning.” Journal of Machine Learning Research 6 (December): 2153–75.
Fasi, Massimiliano, Nicholas J. Higham, and Xiaobo Liu. 2023. β€œComputing the Square Root of a Low-Rank Perturbation of the Scaled Identity Matrix.” SIAM Journal on Matrix Analysis and Applications 44 (1): 156–74.
Flammia, Steven T., David Gross, Yi-Kai Liu, and Jens Eisert. 2012. β€œQuantum Tomography via Compressed Sensing: Error Bounds, Sample Complexity, and Efficient Estimators.” New Journal of Physics 14 (9): 095022.
Ghashami, Mina, Edo Liberty, Jeff M. Phillips, and David P. Woodruff. 2015. β€œFrequent Directions : Simple and Deterministic Matrix Sketching.” arXiv:1501.01711 [Cs], January.
Gordon, Geoffrey J. 2002. β€œGeneralizedΒ² LinearΒ² Models.” In Proceedings of the 15th International Conference on Neural Information Processing Systems, 593–600. NIPS’02. Cambridge, MA, USA: MIT Press.
Gross, D. 2011. β€œRecovering Low-Rank Matrices From Few Coefficients in Any Basis.” IEEE Transactions on Information Theory 57 (3): 1548–66.
Gross, David, Yi-Kai Liu, Steven T. Flammia, Stephen Becker, and Jens Eisert. 2010. β€œQuantum State Tomography via Compressed Sensing.” Physical Review Letters 105 (15).
Hager, William W. 1989. β€œUpdating the Inverse of a Matrix.” SIAM Review 31 (2): 221–39.
Halko, Nathan, Per-Gunnar Martinsson, and Joel A. Tropp. 2010. β€œFinding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions.” arXiv.
Harbrecht, Helmut, Michael Peters, and Reinhold Schneider. 2012. β€œOn the Low-Rank Approximation by the Pivoted Cholesky Decomposition.” Applied Numerical Mathematics, Third Chilean Workshop on Numerical Analysis of Partial Differential Equations (WONAPDE 2010), 62 (4): 428–40.
Harville, David A. 1976. β€œExtension of the Gauss-Markov Theorem to Include the Estimation of Random Effects.” The Annals of Statistics 4 (2): 384–95.
β€”β€”β€”. 1977. β€œMaximum Likelihood Approaches to Variance Component Estimation and to Related Problems.” Journal of the American Statistical Association 72 (358): 320–38.
Hastie, Trevor, Rahul Mazumder, Jason D. Lee, and Reza Zadeh. 2015. β€œMatrix Completion and Low-Rank SVD via Fast Alternating Least Squares.” In Journal of Machine Learning Research, 16:3367–3402.
Henderson, H. V., and S. R. Searle. 1981. β€œOn Deriving the Inverse of a Sum of Matrices.” SIAM Review 23 (1): 53–60.
Hoaglin, David C., and Roy E. Welsch. 1978. β€œThe Hat Matrix in Regression and ANOVA.” The American Statistician 32 (1): 17–22.
Kala, Radoslaw, and Krzysztof KlaczyΕ„ski. 1994. β€œGeneralized Inverses of a Sum of Matrices.” Sankhyā: The Indian Journal of Statistics, Series A (1961-2002) 56 (3): 458–64.
Kannan, Ramakrishnan. 2016. β€œScalable and Distributed Constrained Low Rank Approximations,” April.
Kannan, Ramakrishnan, Grey Ballard, and Haesun Park. 2016. β€œA High-Performance Parallel Algorithm for Nonnegative Matrix Factorization.” In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 9:1–11. PPoPP ’16. New York, NY, USA: ACM.
Kim, H., and H. Park. 2008. β€œNonnegative Matrix Factorization Based on Alternating Nonnegativity Constrained Least Squares and Active Set Method.” SIAM Journal on Matrix Analysis and Applications 30 (2): 713–30.
Kumar, N. Kishore, and Jan Shneider. 2016. β€œLiterature Survey on Low Rank Approximation of Matrices.” arXiv:1606.06511 [Cs, Math], June.
Liberty, Edo, Franco Woolfe, Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. 2007. β€œRandomized Algorithms for the Low-Rank Approximation of Matrices.” Proceedings of the National Academy of Sciences 104 (51): 20167–72.
Lim, Yew Jin, and Yee Whye Teh. 2007. β€œVariational Bayesian Approach to Movie Rating Prediction.” In Proceedings of KDD Cup and Workshop, 7:15–21. Citeseer.
Lin, Zhouchen. 2016. β€œA Review on Low-Rank Models in Signal and Data Analysis.”
Liu, T., and D. Tao. 2015. β€œOn the Performance of Manhattan Nonnegative Matrix Factorization.” IEEE Transactions on Neural Networks and Learning Systems PP (99): 1–1.
Lu, Jun. 2022. β€œA Rigorous Introduction to Linear Models.” arXiv.
Mahoney, Michael W. 2010. Randomized Algorithms for Matrices and Data. Vol. 3.
Martinsson, Per-Gunnar. 2016. β€œRandomized Methods for Matrix Computations and Analysis of High Dimensional Data.” arXiv:1607.01649 [Math], July.
Miller, Kenneth S. 1981. β€œOn the Inverse of the Sum of Matrices.” Mathematics Magazine 54 (2): 67–72.
Minka, Thomas P. 2000. Old and new matrix algebra useful for statistics.
Nakajima, Shinichi, and Masashi Sugiyama. 2012. β€œTheoretical Analysis of Bayesian Matrix Factorization.” Journal of Machine Learning Research, 66.
Nakatsukasa, Yuji. 2019. β€œThe Low-Rank Eigenvalue Problem.” arXiv.
Nowak, W., and A. Litvinenko. 2013. β€œKriging and Spatial Design Accelerated by Orders of Magnitude: Combining Low-Rank Covariance Approximations with FFT-Techniques.” Mathematical Geosciences 45 (4): 411–35.
Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2012. β€œThe Matrix Cookbook.”
Rellich, F. 1954. Perturbation theory of eigenvalue problems. New York: Courant Institute of Mathematical Sciences, New York University.
Rokhlin, Vladimir, Arthur Szlam, and Mark Tygert. 2009. β€œA Randomized Algorithm for Principal Component Analysis.” SIAM J. Matrix Anal. Appl. 31 (3): 1100–1124.
Saad, Yousef. 2003. Iterative Methods for Sparse Linear Systems: Second Edition. 2nd ed. SIAM.
Salakhutdinov, Ruslan, and Andriy Mnih. 2008. β€œBayesian Probabilistic Matrix Factorization Using Markov Chain Monte Carlo.” In Proceedings of the 25th International Conference on Machine Learning, 880–87. ICML ’08. New York, NY, USA: ACM.
Saul, Lawrence K. 2023. β€œA Geometrical Connection Between Sparse and Low-Rank Matrices and Its Application to Manifold Learning.” Transactions on Machine Learning Research, January.
Seeger, Matthias, ed. 2004. Low Rank Updates for the Cholesky Decomposition.
Seshadhri, C., Aneesh Sharma, Andrew Stolman, and Ashish Goel. 2020. β€œThe Impossibility of Low-Rank Representations for Triangle-Rich Complex Networks.” Proceedings of the National Academy of Sciences 117 (11): 5631–37.
Shi, Jiarong, Xiuyun Zheng, and Wei Yang. 2017. β€œSurvey on Probabilistic Models of Low-Rank Matrix Factorizations.” Entropy 19 (8): 424.
Srebro, Nathan, Jason D. M. Rennie, and Tommi S. Jaakkola. 2004. β€œMaximum-Margin Matrix Factorization.” In Advances in Neural Information Processing Systems, 17:1329–36. NIPS’04. Cambridge, MA, USA: MIT Press.
Sundin, Martin. 2016. β€œBayesian Methods for Sparse and Low-Rank Matrix Problems.” PhD Thesis, KTH Royal Institute of Technology.
Tropp, Joel A., Alp Yurtsever, Madeleine Udell, and Volkan Cevher. 2016. β€œRandomized Single-View Algorithms for Low-Rank Matrix Approximation.” arXiv:1609.00048 [Cs, Math, Stat], August.
β€”β€”β€”. 2017. β€œPractical Sketching Algorithms for Low-Rank Matrix Approximation.” SIAM Journal on Matrix Analysis and Applications 38 (4): 1454–85.
Tufts, D. W., and R. Kumaresan. 1982. β€œEstimation of Frequencies of Multiple Sinusoids: Making Linear Prediction Perform Like Maximum Likelihood.” Proceedings of the IEEE 70 (9): 975–89.
TΓΌrkmen, Ali Caner. 2015. β€œA Review of Nonnegative Matrix Factorization Methods for Clustering.” arXiv:1507.03194 [Cs, Stat], July.
Udell, M., and A. Townsend. 2019. β€œWhy Are Big Data Matrices Approximately Low Rank?” SIAM Journal on Mathematics of Data Science 1 (1): 144–60.
Wilkinson, William J., Michael Riis Andersen, Joshua D. Reiss, Dan Stowell, and Arno Solin. 2019. β€œEnd-to-End Probabilistic Inference for Nonstationary Audio Analysis.” arXiv:1901.11436 [Cs, Eess, Stat], January.
Woodruff, David P. 2014. Sketching as a Tool for Numerical Linear Algebra. Foundations and Trends in Theoretical Computer Science 1.0. Now Publishers.
Woolfe, Franco, Edo Liberty, Vladimir Rokhlin, and Mark Tygert. 2008. β€œA Fast Randomized Algorithm for the Approximation of Matrices.” Applied and Computational Harmonic Analysis 25 (3): 335–66.
Xinghao Ding, Lihan He, and L. Carin. 2011. β€œBayesian Robust Principal Component Analysis.” IEEE Transactions on Image Processing 20 (12): 3419–30.
Yang, Linxiao, Jun Fang, Huiping Duan, Hongbin Li, and Bing Zeng. 2018. β€œFast Low-Rank Bayesian Matrix Completion with Hierarchical Gaussian Prior Models.” IEEE Transactions on Signal Processing 66 (11): 2804–17.
Yin, M., J. Gao, and Z. Lin. 2016. β€œLaplacian Regularized Low-Rank Representation and Its Applications.” IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (3): 504–17.
Yu, Chenhan D., William B. March, and George Biros. 2017. β€œAn \(N \log N\) Parallel Fast Direct Solver for Kernel Matrices.” In arXiv:1701.02324 [Cs].
Yu, Hsiang-Fu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon. 2012. β€œScalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems.” In IEEE International Conference of Data Mining, 765–74.
β€”β€”β€”. 2014. β€œParallel Matrix Factorization for Recommender Systems.” Knowledge and Information Systems 41 (3): 793–819.
Zhang, Kai, Chuanren Liu, Jie Zhang, Hui Xiong, Eric Xing, and Jieping Ye. 2017. β€œRandomization or Condensation?: Linear-Cost Matrix Sketching Via Cascaded Compression Sampling.” In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 615–23. KDD ’17. New York, NY, USA: ACM.
Zhang, Xiao, Lingxiao Wang, and Quanquan Gu. 2017. β€œStochastic Variance-Reduced Gradient Descent for Low-Rank Matrix Recovery from Linear Measurements.” arXiv:1701.00481 [Stat], January.
Zhou, Tianyi, and Dacheng Tao. 2011. β€œGoDec: Randomized Low-Rank & Sparse Matrix Decomposition in Noisy Case.” In Proceedings of the 28th International Conference on International Conference on Machine Learning, 33–40. ICML’11. Madison, WI, USA: Omnipress.
β€”β€”β€”. 2012. β€œMulti-Label Subspace Ensemble.” Journal of Machine Learning Research.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.