Matrix norms, divergences, metrics


I write the singular value decomposition of a \(d_1\times d_2\) matrix \(\mathbf{B}\)

\[ \mathbf{B} = \mathbf{Q}_1\boldsymbol{\Sigma}\mathbf{Q}_2 \]

where we have unitary matrices \(\mathbf{Q}_1,\, \mathbf{Q}_1\) and a matrix, with non-negative diagonals \(\boldsymbol{\Sigma}\), of respective dimensions \(d_1\times d_1,\,d_1\times d_2,\,d_2\times d_2\).

The diagonal entries of \(\boldsymbol{\Sigma}\), written \(\sigma_i(B)\) are the singular values of \(\mathbf{B}\).

For Hermitian \(\mathbf{H}\) matrices we may write an eigenvalue decomposition

\[ \mathbf{H} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^* \]

For unitary \(\mathbf{Q}\) and diagonal matrix \(\boldsymbol{\Lambda}\) with entries \(\lambda_i{H}\) the eigenvalues.

Spectral norm

πŸ—

Frobenius norm

Coincides with the \(\ell_2\) norm when the matrix happens to be a column vector.

We can define this in terms of the entries \(b_{jk}\) of \(\mathbf{B}\):

\[ \|\mathbf{B}\|_F^2:=\sum_{j=1}^{d_1}\sum_{k=1}^{d_2}|b_{jk}|^2 \]

Equivalently, if \(\mathbf{B}\) is square,

\[ \|\mathbf{B}\|_F^2=\text{tr}(\mathbf{B}\mathbf{B}^*) \]

If we have the SVD, we might instead use

\[ \|\mathbf{B}\|_F^2=\sum_{j=1}^{\min(d_1,d_2)}\sigma_{j}(B)^2 \]

Schatten norms

incorporating nuclear and Frobenius norms.

If the singular values are denoted by \(\sigma_i\), then the Schatten p-norm is defined by

\[ \|A\|_{p}=\left(\sum _{i=1}^{\min\{m,\,n\}}\sigma _{i}^{p}(A)\right)^{1/p}. \]

The most familiar cases are p = 1, 2, ∞. The case p = 2 yields the Frobenius norm, introduced before. The case p = ∞ yields the spectral norm, which is the matrix norm induced by the vector 2-norm (see above). Finally, p = 1 yields the nuclear norm

\[ \|A\|_{*}=\operatorname {trace} \left({\sqrt {A^{*}A}}\right)=\sum _{i=1}^{\min\{m,\,n\}}\!\sigma _{i}(A) \]

Bregman divergence

πŸ— Relation to exponential family and maximum likelihood.

Mark Reid: Meet the Bregman divergences:

If you have some abstract way of measuring the β€œdistance” between any two points and, for any choice of distribution over points the mean point minimises the average distance to all the others, then your distance measure must be a Bregman divergence.