Nearly-low-rank Hermitian matrices
a.k.a. perturbations of the identity, low-rank-plus-diagonal matrices
August 5, 2014 — October 28, 2024
Assumed audience:
People with undergrad linear algebra
A decomposition that some matrix might possess that ends up being practically useful: \[\mathrm{K}= \sigma^2\mathrm{I} + a\mathrm{Z} \mathrm{Z}^{\dagger}\] where \(\mathrm{K}\in\mathbb{R}^{N\times N}\), \(\mathrm{Z}\in\mathbb{R}^{N\times D}\) with \(D\ll N\) and \(a\in\{-1, 1\}\). For compactness, we call matrices of this form nearly-low-rank, as a sibling of the actually low rank matrices. I write \(\mathrm{Z}^{\dagger}\) for the conjugate transpose of \(\mathrm{Z}\) and use that here rather than the transpose because sometimes I want to think of \(\mathrm{Z}\) as a complex matrix, and last I checked, most stuff here generalised to that case easily. Such lowish-rank matrices are clearly Hermitian and thus arise in, e.g. covariance estimation.
After many arguments with reviewers I have discovered that you should not use this name, because nearly low rank to some people means most of the eigenvalues are close to zero and cannot abide any other meaning to that phrase. I would parse that as meaning they think near means near in the nuclear norm; there is nothing particularly canonical about that interpretation, IMO, but I will use any bloody name if it saves me time arguing with reviewers. Or even a contorted acronym, how about that?
So, if you are one such person, who finds nearly low rank to be a frustration, imagine that I called them Perturbation Enhanced Diagonal Adjusted Numerical Tensors; this might help.
This matrix structure is a workhorse. We might get such low-rank decompositions from matrix factorisation or from some prior structure in the problem; To pick one example, Ensemble filters have covariance matrices with such a form. In many applications, \(\sigma^2\) is chosen so that the entire matrix is positive definite; whether we need this or not depends upon the application; usually we do.
Sometimes we admit a more general version, where the diagonal is allowed to be other-than-constant, so that \(\mathrm{K}=\mathrm{V} + a\mathrm{Z} \mathrm{Z}^{\dagger}\) where \(\mathrm{V} = \operatorname{diag}(\boldsymbol{s})\) for \(s\in\mathbb{R}^D\).
1 Factorisations
1.1 Eigendecomposition
Nakatsukasa (2019) observes that, for general \(\mathrm{Y} \mathrm{Z}\),
The nonzero eigenvalues of \(\mathrm{Z} \mathrm{Y}\) are equal to those of \(\mathrm{Y} \mathrm{Z}\): an identity that holds as long as the products are square, even when \(\mathrm{Y}, \mathrm{Z}\) are rectangular. This fact naturally suggests an efficient algorithm for computing eigenvalues and eigenvectors of a low-rank matrix \(\mathrm{K}=\mathrm{Z} \mathrm{Y}\) with \(\mathrm{Y}, \mathrm{Z}^{\dagger} \in \mathbb{C}^{D \times N}, D \gg N\): form the small \(N \times N\) matrix \(\mathrm{Z} \mathrm{Y}\) and find its eigenvalues and eigenvectors. For nonzero eigenvalues, the eigenvectors are related by \(\mathrm{Z} \mathrm{Y} v=\lambda v \Leftrightarrow \mathrm{Y} \mathrm{Z} w=\lambda w\) with \(w=\mathrm{Y} v\)[…]
Cool. Our low-rank matrix \(\mathrm{K}\) has additional special structure \(\mathrm{Y}=\mathrm{Z}^\dagger\), i.e. symmetry. We use that the nonzero eigenvalues of \(\mathrm{Z} \mathrm{Z}^{\dagger}\) are equal to those of \(\mathrm{Z}^{\dagger} \mathrm{Z}\). So, to compute eigenvalues and eigenvectors of a low-rank matrix \(\mathrm{X}=\mathrm{Z} \mathrm{Z}^{\dagger}\): form the small \(N \times N\) matrix \(\mathrm{Z}^{\dagger} \mathrm{Z}\) and find its eigenvalues and eigenvectors. For nonzero eigenvalues, the eigenvectors are related by \(\mathrm{Z} \mathrm{Z}^{\dagger} v=\lambda v \Leftrightarrow \mathrm{Z}^{\dagger} \mathrm{Z} w=\lambda w\) with \(w=\mathrm{Z}^{\dagger} v\).
A classic piece of lore is cheap eigendecomposition of \(\mathrm{K}=\mathrm{Z}\mathrm{Z}^{\dagger}+\sigma^2\mathrm{I}\) by exploiting the low-rank structure and SVD. First, we calculate the SVD of \(\mathrm{Z}\) to obtain \(\mathrm{Z}=\mathrm{U}\mathrm{S}\mathrm{V}^{\dagger}\), where \(\mathrm{U}\in\mathbb{R}^{D\times N}\) and \(\mathrm{V}\in\mathbb{R}^{N\times N}\) are orthogonal and \(\mathrm{S}\in\mathbb{R}^{N\times N}\) is diagonal. Then we may write \[ \begin{aligned} \mathrm{K} &= \mathrm{Z} \mathrm{Z}^{\dagger} + \sigma^2 \mathrm{I} \\ &= \mathrm{U} \mathrm{S} \mathrm{V}^{\dagger} \mathrm{V} \mathrm{S} \mathrm{U}^{\dagger} + \sigma^2 \mathrm{I} \\ &= \mathrm{U} \mathrm{S}^2 \mathrm{U}^{\dagger} + \sigma^2 \mathrm{I} \end{aligned} \] Thus the top \(N\) eigenvalues of \(\mathrm{K}\) are \(\sigma^2+s_n^2\), and the corresponding eigenvectors are \(\boldsymbol{u}_n\). The remaining eigenvalues are \(\sigma^2\), and the corresponding eigenvectors are an arbitrary subset in the complement of the \(\mathrm{U}\) eigenvectors.
1.2 Square roots
One useful trick, at matrix square roots. Note that it is not necessarily closed, i.e. it does not generate a new low rank matrix.
1.3 Cholesky
Louis Tiao, in Efficient Cholesky decomposition of low-rank updates summarises Seeger (2004).
1.4 Cholesky decomposition of a harmonic mean
This one pops up from time to time too. Suppose \(\mathrm{C},\mathrm{Z}_1,\mathrm{Z}_2\in\mathbb{R}^{D\times N }\) and \(N\ll D\). We wish to find \[ \begin{aligned} \mathrm{C}\mathrm{C}^{\dagger} &=\left((\mathrm{Z}_1\mathrm{Z}_1^{\dagger})^{-1}+(\mathrm{Z}_2\mathrm{Z}_2^{\dagger})^{-1}\right)^{-1}\\ &={\mathrm{Z}_1\mathrm{Z}_1^{\dagger}\left(\mathrm{Z}_1\mathrm{Z}_1^{\dagger}+\mathrm{Z}_2\mathrm{Z}_2^{\dagger}\right)^{-1}\mathrm{Z}_2\mathrm{Z}_2^{\dagger}}. \end{aligned} \] where that last line is the Woodbury identity. Can we do that in a low-rank way?
2 Inverting
Specifically, solving \(\mathrm{X}=\mathrm{K}\mathrm{Y}\) for \(\mathrm{X}\) with \(\mathrm{Y}\in\mathbb{R}^{D\times M}\) and, in particular, solving it efficiently, in the sense that we
- exploit the computational efficiency of the low-rank structure of \(\mathrm{K}\) so that it costs less than \(\mathcal{O}(D^3M)\) to compute \(\mathrm{K}^{-1}\mathrm{Y}\).
- avoid forming the explicit inverse matrix \(\mathrm{K}^{-1}\) which requires storage \(\mathcal{O}(D^2)\).
The workhorse tool for these is the Woodbury identity, which has many variants, but often the basic version does what we want: Assuming both \(\mathrm{A}\) and \(\mathrm{B}\) are invertible, \[ \left(\mathrm{A}+\mathrm{C B C}^{\dagger}\right)^{-1}=\mathrm{A}^{-1}-\mathrm{A}^{-1} \mathrm{C}\left(\mathrm{B}^{-1}+\mathrm{C}^{\dagger} \mathrm{A}^{-1} \mathrm{C}\right)^{-1} \mathrm{C}^{\dagger} \mathrm{A}^{-1}. \] See Ken Tay’s intro to Woodbury identities for more about those. For now, we note that they give us an efficient way of calculating matrix inverses.
There is a connection to the eigendecomposition clearly; we may also think about inversion by operating on the eigenvalues.
2.1 Classic
I have no idea who invented this; it seems to be part of the folklore now, but also it was not in my undergraduate degree.
Assume \(a=1\). Applying the Woodbury identity, \[\begin{align*} \mathrm{K}^{-1}=\sigma^{-2}\mathrm{I}-\sigma^{-4} \mathrm{Z}\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z}\right)^{-1} \mathrm{Z}^{\dagger}. \end{align*}\] Computing the lower Cholesky decomposition \(\mathrm{L} \mathrm{L}^{\dagger}=\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z}\right)^{-1}\) at a cost of \(\mathcal{O}(N^3)\), and defining \(\mathrm{R}=\sigma^{-2}\mathrm{Z} \mathrm{L}\) we can write the inverse compactly as \[ \mathrm{K}^{-1}=\sigma^{-2}\mathrm{I}-\mathrm{R} \mathrm{R}^{\dagger}. \] Cool, so we have an explicit form for the inverse, which is still a nearly-low-rank operator. We may solve for \(\mathrm{X}\) by matrix multiplication, \[\begin{aligned} \mathrm{K}^{-1}\mathrm{Y} &=\left(\sigma^2 \mathrm{I}+\mathrm{Z} \mathrm{Z}^{\dagger}\right)^{-1}\mathrm{Y}\\ &=\left(\sigma^{-2}\mathrm{I}-\mathrm{R} \mathrm{R}^{\dagger}\right)\mathrm{Y}\\ &=\underbrace{\sigma^{-2}\mathrm{Y}}_{D \times M} - \underbrace{\mathrm{R}}_{D\times N} \underbrace{\mathrm{R}^{\dagger}\mathrm{Y}}_{N\times M} \end{aligned}\] The solution of the linear system is available at a cost which looks something like \(\mathcal{O}\left(N^2 D + NDM +N^3\right)\) (hmm, should check that).
The price of this efficient inversion is that pre-multiplying by the nearly-low-rank inverse is not as numerically stable as classic matrix solution methods; but for many purposes, this price is acceptable.
What else can we say now? Well, if \(\mathrm{K}\) is positive definite, then so is \(\mathrm{K}^{-1}\), i.e. both have positive eigenvalues. Suppose the eigenvalues of \(\mathrm{K}\) in descending order are \(\lambda^{\mathrm{K}}_{j}, j\in\{1, \ldots, D\}\). Note that for \(j\in \{N+1,\dots,D\}\) we have \(\lambda^{\mathrm{K}}_j=\sigma^2\). That is, all \(\lambda^{\mathrm{K}}_j\geq\sigma^2\). We can deduce that any non-zero eigenvalues \(\lambda^{\mathrm{Z}}_j\) of \(\mathrm{Z}\) have the form \(\lambda_j^{\mathrm{Z}}=\sqrt{\lambda^{\mathrm{K}}_{j}-\sigma^{2}}\).
The eigenvalues \(\lambda^{\mathrm{K}^{-1}}_j\) of \(\mathrm{K}^{-1}\) in descending order are \(\lambda^{\mathrm{K}^{-1}}_j=(\lambda^{\mathrm{K}}_{D+1-j})^{-1}\). Further, \(\lambda^{\mathrm{K}^{-1}}_j=\sigma^{-2}\) for \(j\in \{1, \dots, D-N\}\). The remaining eigenvalues are sandwiched in between \(\sigma^{-2}\) and \(0\), as we’d expect with a matrix inverse, from which it follows that the eigenvalues of \(-\mathrm{R}\mathrm{R}^{\dagger}\) are all negative, specifically \(\forall j,\,\lambda^{\mathrm{R}}_j\in[-\sigma^{-2},0]\), so the eigenvalues of \(\mathrm{R}\) are in the range \(\lambda^{\mathrm{R}}_j\in[0,\sigma^{-1}]\).
Note that the inverse is also a nearly-low-rank matrix, but with \(a=-1\). We can invert that guy also. Applying the Woodbury identity again, \[\begin{aligned} (\mathrm{K}^{-1})^{-1} &=\left(\sigma^{-2}\mathrm{I}-\mathrm{R} \mathrm{R}^{\dagger}\right)^{-1}\\ &=\sigma^{2}\mathrm{I}+\sigma^{4} \mathrm{R}\left(-\mathrm{I}+\sigma^{2}\mathrm{R}^{\dagger} \mathrm{R}\right)^{-1} \mathrm{R}^{\dagger} \end{aligned}\] Once again we compute the Cholesky decomposition \(\mathrm{M} \mathrm{M}^{\dagger}=\left(-\mathrm{I}+\sigma^{2}\mathrm{R}^{\dagger} \mathrm{R}\right)^{-1}\) at a cost of \(\mathcal{O}(N^3)\), and define \(\mathrm{Z}'=\sigma^{-2}\mathrm{R} \mathrm{M}\). We write the recovered inverse as \[ \mathrm{K}^{-1}=\sigma^{-2}\mathrm{I}+\mathrm{Z}' \mathrm{Z}'{}^{\dagger}. \]
In principle, \(\mathrm{Z}=\mathrm{Z}'\) if the Cholesky decomposition is unique, which is true if \(\left(-\mathrm{I}+\sigma^{2}\mathrm{R}^{\dagger} \mathrm{R}\right)\) resp. \(\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z}\right)\) is positive definite. In practice, none of this is terribly numerically stable, so I wouldn’t depend upon that in computational practice.
Terminology: For some reason, these \(\left(-\mathrm{I}+\sigma^{2}\mathrm{R}^{\dagger} \mathrm{R}\right)\) and \(\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z}\right)\) are referred to as capacitance matrices.
If such a capacitance is merely positive semidefinite, then we need to do some more work to make it unique. And if it is indefinite (presumably because of numerical stability problems), then we are potentially in trouble because the square root is not real. We can still have complex-valued solutions if we want to go there.
2.2 General diagonals
OK, how about if we admit general diagonal matrices \(\mathrm{V}\)? Then \[\begin{align*} \mathrm{K}_{\mathrm{V}}^{-1}=\mathrm{V}^{-1}-\mathrm{V}^{-1} \mathrm{Z}\left(\mathrm{I}+\mathrm{Z}^{\dagger}\mathrm{V}^{-1} \mathrm{Z}\right)^{-1} \mathrm{Z}^{\dagger}\mathrm{V}^{-1}. \end{align*}\] Now we need the Cholesky decomposition of \(\mathrm{L} \mathrm{L}^{\dagger}=\left(\mathrm{I}+\mathrm{Z}^{\dagger}\mathrm{V}^{-1}\mathrm{Z}\right)^{-1}\), and define \(\mathrm{R}\) as before; the new low-rank inverse is \[ \mathrm{K}_{\mathrm{V}}^{-1}=\mathrm{V}^{-1}-\mathrm{R} \mathrm{R}^{\dagger}. \]
2.3 Centred
Suppose \(\mathrm{K}_{\mathrm{Z}\mathrm{C}}=\mathrm{Z}\mathrm{C}\mathrm{Z}^{\dagger}+\sigma^2\mathrm{I}\) where \(\mathrm{C}\) is a centring matrix. Then we can no longer use the Woodbury identity directly because \(\mathrm{C}\) is rank deficient, but a variant Woodbury identity (Harville 1977; Henderson and Searle 1981) applies, to wit: \[\begin{aligned} \left(\mathrm{V}+\mathrm{Z} \mathrm{C} \mathrm{Z}^{\dagger}\right)^{-1}=\mathrm{V}^{-1}-\mathrm{V}^{-1} \mathrm{Z} \mathrm{C}\left(\mathrm{I}+\mathrm{Z}^{\dagger} \mathrm{V}^{-1} \mathrm{Z} \mathrm{C}\right)^{-1} \mathrm{Z}^{\dagger} \mathrm{V}^{-1} \end{aligned}\] from which \[\begin{aligned} \mathrm{K}_{\mathrm{Z}\mathrm{C}}^{-1} &= \left(\mathrm{Z}\mathrm{C}\mathrm{Z}^{\dagger}+\sigma^2\mathrm{I}\right)^{-1}\\ &=\sigma^{-2}\mathrm{I}-\sigma^{-4}\mathrm{Z} \mathrm{C}\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z} \mathrm{C}\right)^{-1} \mathrm{Z}^{\dagger}\\ &=\sigma^{-2}\mathrm{I}-\sigma^{-4}\mathrm{Z} \left(\mathrm{C}+\sigma^{-2}\mathrm{C}\mathrm{Z}^{\dagger} \mathrm{Z} \mathrm{C}\right)^{-1} \mathrm{Z}^{\dagger} \end{aligned} \] We recover a different form for the \(\mathrm{R}\) factor, namely \[\begin{aligned} \mathrm{R}_{\mathrm{C}} &:=\sigma^{2}\mathrm{Z}\operatorname{chol}((\mathrm{C}+\sigma^{-2}\mathrm{C}\mathrm{Z}^{\dagger} \mathrm{Z} \mathrm{C})^{-1})\\ &=\sigma^{2}\mathrm{Z}\operatorname{chol}((\mathrm{C}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z} )^{-1}). \end{aligned}\] If \(\mathrm{Z}\) is already centred I think we get \[\begin{aligned} \mathrm{R}_{\mathrm{C}} &=\sigma^{2}\mathrm{Z}\operatorname{chol}((\mathrm{C}+\sigma^{-2}\mathrm{Z}^{\dagger} \mathrm{Z} )^{-1}). \end{aligned}\] There are a lot of alternate Woodbury identities for alternate setups (Ameli and Shadden 2023; Harville 1976, 1977; Henderson and Searle 1981).
3 Determinant
The matrix determinant lemma tells us:
Suppose \(\mathrm{A}\) is an invertible \(n\times n\) matrix and \(\mathrm{U}, \mathrm{V}\) are \(n\times m\) matrices. Then \[ \operatorname{det}\left(\mathrm{A}+\mathrm{U V}^{\boldsymbol{\top}}\right)=\operatorname{det}\left(\mathrm{I}_{\mathrm{m}}+\mathrm{V}^{\boldsymbol{\top}} \mathrm{A}^{-1} \mathrm{U}\right) \operatorname{det}(\mathrm{A}) . \]
In the special case \(\mathrm{A}=\mathrm{I}_{\mathrm{n}}\) this is the Weinstein-Aronszajn identity. Given additionally an invertible \(m\)-by- \(m\) matrix \(\mathrm{W}\), the relationship can also be expressed as \[ \operatorname{det}\left(\mathrm{A}+\mathrm{U} \mathrm{W} \mathrm{V}^{\boldsymbol{\top}}\right)=\operatorname{det}\left(\mathrm{W}^{-1}+\mathrm{V}^{\top} \mathrm{A}^{-1} \mathrm{U}\right) \operatorname{det}(\mathrm{W}) \operatorname{det}(\mathrm{A}). \]
Clearly, \[ \operatorname{det}\left(\mathrm{V}+\mathrm{Z}\mathrm{Z}^\top\right)=\operatorname{det}\left(\mathrm{I}_{\mathrm{D}}+\mathrm{Z}^\top \mathrm{V}^{-1} \mathrm{Z}\right) \operatorname{det}(\mathrm{V}) \]
4 Products
4.1 Primal
Specifically, \((\mathrm{Y} \mathrm{Y}^{\dagger}+\sigma^2\mathrm{I})(\mathrm{Z} \mathrm{Z}^{\dagger}+\sigma^2\mathrm{I})\). Are low rank products cheap?
\[ \begin{aligned} (\mathrm{Y} \mathrm{Y}^{\dagger}+\sigma^2\mathrm{I})(\mathrm{Z} \mathrm{Z}^{\dagger}+\sigma^2\mathrm{I}) &=\mathrm{Y} \mathrm{Y}^{\dagger} \mathrm{Z} \mathrm{Z}^{\dagger}+\sigma^2\mathrm{Y} \mathrm{Y}^{\dagger}+\sigma^2\mathrm{Z} \mathrm{Z}^{\dagger}+\sigma^4\mathrm{I}\\ &=\mathrm{Y} (\mathrm{Y}^{\dagger} \mathrm{Z} )\mathrm{Z}^{\dagger}+\sigma^2\mathrm{Y} \mathrm{Y}^{\dagger}+\sigma^2\mathrm{Z} \mathrm{Z}^{\dagger}+\sigma^4\mathrm{I} \end{aligned} \] which is still a sum of low-rank approximations. At this point it might be natural to consider a tensor decomposition.
4.2 Inverse
Suppose the low-rank inverse factors of \(\mathrm{Y}\) and \(\mathrm{X}\) are, respectively, \(\mathrm{R}\) and \(\mathrm{C}\). Then we have
\[ \begin{aligned} &(\mathrm{Y} \mathrm{Y}^{\dagger}+\sigma^2\mathrm{I})^{-1}(\mathrm{Z} \mathrm{Z}^{\dagger}+\sigma^2\mathrm{I})^{-1}\\ &=(\sigma^{-2}\mathrm{I}-\mathrm{R} \mathrm{R}^{\dagger})(\sigma^{-2}\mathrm{I}-\mathrm{C} \mathrm{C}^{\dagger})\\ &=\sigma^{-4}\mathrm{I}-\sigma^{-4}\mathrm{R} \mathrm{R}^{\dagger}-\sigma^{-4}\mathrm{C} \mathrm{C}^{\dagger}+\sigma^{-4}\mathrm{R} (\mathrm{R}^{\dagger}\mathrm{C}) \mathrm{C}^{\dagger}\\ \end{aligned} \]
Once again, cheap to evaluate, but not so obviously nice.
5 Distances
5.1 Frobenius
Suppose we want to measure the Frobenius distance between \(\mathrm{K}_{\mathrm{U},\sigma^2}\) and \(\mathrm{K}_{\mathrm{R},\gamma^2}\). We recall that we might expect things to be nice if they are exactly low rank because, e.g. \[ \begin{aligned} \|\mathrm{U}\mathrm{U}^{\dagger}\|_F^2 =\operatorname{tr}\left(\mathrm{U}\mathrm{U}^{\dagger}\mathrm{U}\mathrm{U}^{\dagger}\right) =\|\mathrm{U}^{\dagger}\mathrm{U}\|_F^2 \end{aligned} \] How does it come out as a distance between two nearly-low-rank matrices? The answer may be found without forming the full matrices. For compactness, we define \(\delta^2=\sigma^2-\gamma^2\). \[ \begin{aligned} &\|\mathrm{U}\mathrm{U}^{\dagger}+\sigma^{2}\mathrm{I}-\mathrm{R}\mathrm{R}^{\dagger}+\gamma^{2}\mathrm{I}\|_F^2\\ &=\left\|\mathrm{U}\mathrm{U}^{\dagger} -\mathrm{R}\mathrm{R}^{\dagger} + \delta^2\mathrm{I}\right\|_{F}^2\\ &=\left\|\mathrm{U}\mathrm{U}^{\dagger}+i\mathrm{R}i\mathrm{R}^{\dagger} + \delta^2\mathrm{I} \right\|_{F}^2\\ &=\left\|\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger} + \delta^2\mathrm{I} \right\|_{F}^2\\ &=\left\langle\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger} + \delta^2\mathrm{I},\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger} + \delta^2\mathrm{I} \right\rangle_{F}\\ &=\left\langle\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger},\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}\right\rangle_{F} +\left\langle\delta^2\mathrm{I}, \delta^2\mathrm{I} \right\rangle_{F}\\ &\quad+2\operatorname{Re}\left(\left\langle\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}, \delta^2\mathrm{I} \right\rangle_{F}\right)\\ &=\operatorname{Tr}\left(\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}\right) +\delta^4D\\ &\quad+2\delta^2\operatorname{Re}\operatorname{Tr}\left(\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}\right)\\ &=\operatorname{Tr}\left(\left(\mathrm{U}\mathrm{U}^{\dagger} -\mathrm{R}\mathrm{R}^{\dagger}\right)\left(\mathrm{U}\mathrm{U}^{\dagger} -\mathrm{R}\mathrm{R}^{\dagger}\right)\right) +\delta^4D\\ &\quad+2\delta^2\operatorname{Tr}\left(\mathrm{U}\mathrm{U}^{\dagger} -\mathrm{R}\mathrm{R}^{dagger}\right)\\ &=\operatorname{Tr}\left(\mathrm{U}\mathrm{U}^{\dagger}\mathrm{U}\mathrm{U}^{\dagger}\right) -2\operatorname{Tr}\left(\mathrm{U}\mathrm{U}^{\dagger}\mathrm{R}\mathrm{R}^{\dagger}\right) + \operatorname{Tr}\left(\mathrm{R}\mathrm{R}^{\dagger}\mathrm{R}\mathrm{R}^{\dagger}\right) +\delta^4D \\ &\quad+2\delta^2\left(\operatorname{Tr}\left(\mathrm{U}\mathrm{U}^{\dagger}\right) -\operatorname{Tr}\left(\mathrm{R}\mathrm{R}^{\dagger}\right)\right)\\ &=\operatorname{Tr}\left(\mathrm{U}^{\dagger}\mathrm{U}\mathrm{U}^{\dagger}\mathrm{U}\right) -2\operatorname{Tr}\left(\mathrm{U}^{\dagger}\mathrm{R}\mathrm{R}^{\dagger}\mathrm{U}\right) + \operatorname{Tr}\left(\mathrm{R}^{\dagger}\mathrm{R}\mathrm{R}^{\dagger}\mathrm{R}\right) +\delta^4D \\ &\quad+2\delta^2\left(\operatorname{Tr}\left(\mathrm{U}^{\dagger}\mathrm{U}\right) -\operatorname{Tr}\left(\mathrm{R}^{\dagger}\mathrm{R}\right)\right)\\ &=\left\|\mathrm{U}^{\dagger}\mathrm{U}\right\|^2_F -2\left\|\mathrm{U}^{\dagger}\mathrm{R}\right\|^2_F + \left\|\mathrm{R}^{\dagger}\mathrm{R}\right\|^2_F +\delta^4D +2\delta^2\left(\left\|\mathrm{U}\right\|^2_F -\left\|\mathrm{R}\right\|^2_F\right) \end{aligned} \]
6 Stochastic approximation
TBC
7 Tools
There is support for some of the simplifications mentioned for PyTorch’s linear algebra in cornellius-gp/linear_operator.
8 Incoming
Discuss in terms of perturbation theory? (Rellich 1954; Bamieh 2022)
Bamieh (2022) in particular is compact and clear.