I write the singular value decomposition of a \(d_1\times d_2\) matrix \(\mathbf{B}\)

\[ \mathbf{B} = \mathbf{Q}_1\boldsymbol{\Sigma}\mathbf{Q}_2 \]

where we have unitary matrices \(\mathbf{Q}_1,\, \mathbf{Q}_1\) and a matrix, with non-negative diagonals \(\boldsymbol{\Sigma}\), of respective dimensions \(d_1\times d_1,\,d_1\times d_2,\,d_2\times d_2\).

The diagonal entries of \(\boldsymbol{\Sigma}\), written
\(\sigma_i(B)\) are the *singular values* of \(\mathbf{B}\).

For Hermitian \(\mathbf{H}\) matrices we may write an eigenvalue decomposition

\[ \mathbf{H} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^* \]

For unitary \(\mathbf{Q}\) and diagonal matrix \(\boldsymbol{\Lambda}\) with entries \(\lambda_i{H}\) the eigenvalues.

## Spectral norm

π

## Frobenius norm

Coincides with the \(\ell_2\) norm when the matrix happens to be a column vector.

We can define this in terms of the entries \(b_{jk}\) of \(\mathbf{B}\):

\[ \|\mathbf{B}\|_F^2:=\sum_{j=1}^{d_1}\sum_{k=1}^{d_2}|b_{jk}|^2 \]

Equivalently, if \(\mathbf{B}\) is square,

\[ \|\mathbf{B}\|_F^2=\text{tr}(\mathbf{B}\mathbf{B}^*) \]

If we have the SVD, we might instead use

\[ \|\mathbf{B}\|_F^2=\sum_{j=1}^{\min(d_1,d_2)}\sigma_{j}(B)^2 \]

## Schatten norms

incorporating nuclear and Frobenius norms.

If the singular values are denoted by \(\sigma_i\), then the Schatten p-norm is defined by

\[ \|A\|_{p}=\left(\sum _{i=1}^{\min\{m,\,n\}}\sigma _{i}^{p}(A)\right)^{1/p}. \]

The most familiar cases are p = 1, 2, β. The case p = 2 yields the Frobenius norm, introduced before. The case p = β yields the spectral norm, which is the matrix norm induced by the vector 2-norm (see above). Finally, p = 1 yields the nuclear norm

\[ \|A\|_{*}=\operatorname {trace} \left({\sqrt {A^{*}A}}\right)=\sum _{i=1}^{\min\{m,\,n\}}\!\sigma _{i}(A) \]

## Bregman divergence

π Relation to exponential family and maximum likelihood.

Mark Reid: Meet the Bregman divergences:

If you have some abstract way of measuring the βdistanceβ between any two points and, for any choice of distribution over points the mean point minimises the average distance to all the others, then your distance measure must be a Bregman divergence.