Singular Value Decomposition
The ML workhorse for linear algebra
2014-08-05 — 2025-05-23
Suspiciously similar content
Assumed audience:
People with undergrad linear algebra
An important matrix factorisation. TBC
1 What is SVD?
For any \(A\in\R^{m\times n}\) of rank \(r\), the (thin) singular‐value decomposition is
\[ A \;=\; U\,\Sigma\,V^{T}, \]
where
- \(U\in\R^{m\times r}\) has orthonormal columns (\(U^T U = I_r\)),
- \(V\in\R^{n\times r}\) has orthonormal columns (\(V^T V = I_r\)),
- \(\Sigma = \mathrm{diag}(\sigma_1,\dots,\sigma_r)\), with \(\sigma_1\ge\cdots\ge\sigma_r>0\).
You can do SO MUCH with this, but rapidly run in to problems of computational cost and numerical stability if you are naive about it.
2 Randomized methods
There are a few mentioned; e.g. everyone name-checks Halko, Martinsson, and Tropp (2010) and Bach et al. (2019). I’ve used Halko, Martinsson, and Tropp (2010) a lot because it is included in PyTorch. I am curious about Allen-Zhu and Li (2017) which claims to be stable at low floating point precision, which is very desirable in these big-network times.
3 Incremental updates / downdates
A rank-1 incremental update corresponds to appending a new column \(a\in\R^m\) to \(A\): Given current SVD \(A = U\,\Sigma\,V^T\), form
\[ A' = [\,A\;\;a\,] \in \R^{m\times (n+1)}. \]
We want \(A' = U'\,\Sigma'\,{V'}^T\) cheaply.
Project \(a\) onto \(\mathrm{span}(U)\):
\[ w = U^T a \quad(\in\R^r), \]
and compute the residual
\[ p = a - U\,w,\quad \alpha = \|p\|_2. \]
If \(\alpha>0\), set \(q = p/\alpha\) (else pick any unit \(q\perp U\)).
Form the small “core”
\[ K \;=\; \begin{bmatrix} \Sigma & w \\ 0 & \alpha \end{bmatrix} \;\in\R^{(r+1)\times(r+1)}. \]
Compute SVD of \(K\):
\[ K = \widetilde U\,\widetilde\Sigma\,\widetilde V^T, \]
where \(\widetilde U,\widetilde V\in\R^{(r+1)\times(r+1)}\).
Update full factors:
\[ U' = [\,U\;\;q\,]\;\widetilde U, \quad \Sigma' = \widetilde\Sigma, \quad V' = \begin{bmatrix} V & 0 \\ 0 & 1\end{bmatrix} \;\widetilde V. \]
Cost: SVD of one \((r+1)\times(r+1)\) matrix instead of \(m\times n\).
There are also downdates (removing a column), which are pretty similar.
For fancy variations, there are many papers (Brand 2006, 2002; Bunch and Nielsen 1978; Gu and Eisenstat 1995, 1993; Sarwar et al. 2002; Zhang 2022).
4 …as Frobenius minimiser
TODO.
5 …for PCA
This is nearly trivial but needs spelling out: SVD gives us the PCA, among other useful things.