Matrix norms, divergences, metrics

Matrix norms! Ways of measuring the “size” of a matrix, or, more typically for me, the “distance” between matrices, which is the size of one minus the other. Useful as optimisation objectives, or as metrics for comparing matrices.

I write the singular value decomposition of a \(m\times n\) matrix \(\mathrm{B}\) \[ \mathrm{B} = \mathrm{U}\boldsymbol{\Sigma}\mathrm{V} \] into unitary matrices \(\mathrm{U},\, \mathrm{V}\) and a matrix, with non-negative diagonals \(\boldsymbol{\Sigma}\), of respective dimensions \(m\times m,\,n\times n,\,m\times n\).

The diagonal entries of \(\boldsymbol{\Sigma}\), written \(\sigma_i(\mathrm{B})\) are the singular values of \(\mathrm{B}\) and the entries of \(\mathrm{B}\) itself as with the entries \(b_{jk}\).

Frobenius norm

Coincides with the \(\ell_2\) norm when the matrix happens to be a column vector, and in fact can easily be described as the \(\ell_2\) norm of the singular values.

We can define the Frobenius norm in several equivalent ways, \[ \begin{aligned} \|\mathrm{B}\|_F^2 &=\sum_{j=1}^{m}\sum_{k=1}^{n}|b_{jk}|^2\\ &=\operatorname{tr}\left(\mathrm{B}\mathrm{B}^{T}\right)\\ &=\operatorname{tr}\left(\mathrm{B}^T\mathrm{B}\right)\\ &=\sum_{j=1}^{\min(m,n)}\sigma_{j}(\mathrm{B})^2 &=\langle\mathrm{B},\mathrm{B}\rangle_F \end{aligned} \]

This one seems to be mostly useful because it is tractable, especially in the squared form as written here. There are many matrix calculus results that use it because it is so convenient differentiating traces, which usually produce low-dimensional quadratic forms if you play it right.

It is useful to define the Frobenius norm in terms of the Frobenius inner product that we slyly introduced above: \[ \begin{aligned} \langle\mathrm{A}, \mathrm{B}\rangle_{\mathrm{F}}=\sum_{i, j} \overline{A_{i j}} B_{i j}=\operatorname{Tr}\left(\overline{\mathrm{A}^T} \mathrm{B}\right) \equiv \operatorname{Tr}\left(\mathrm{A}^{\dagger} \mathrm{B}\right). \end{aligned} \]

Handy property for partitioned matrices: \[ \begin{aligned} \left\|\left[\begin{array}{c|c} \mathbf{A}_{11} & \mathbf{A}_{12} \\ \hline \mathbf{A}_{21} & \mathbf{A}_{22} \end{array}\right]\right\|_F^2 &=\left\|\mathbf{A}_{11} \right\|_F^2 + \left\|\mathbf{A}_{12}\right\|_F^2 + \left\|\mathbf{A}_{21}\right\|_F^2 + \left\|\mathbf{A}_{22}\right\|_F^2 \end{aligned} \]

Nuclear norm

A.k.a. trace norm, Schatten 1-norm. The sum of its singular values

\[ \begin{aligned} \|\mathrm{B}\|_* &=\operatorname{tr}\left(\sqrt{\mathrm{B}^T\mathrm{B}}\right)\\ &=\sum_{j=1}^{\min(m,n)}\sigma_{j}(\mathrm{B}) \end{aligned} \]

Schatten norms

Generalising nuclear and Frobenius norms. The Schatten p-norm is defined \[ \|\mathrm{B}\|_{p}=\left(\sum _{i=1}^{\min\{m,\,n\}}\sigma _{i}^{p}(\mathrm{B})\right)^{1/p}. \]

The case \(p = 2\) yields the Frobenius norm. The case \(p =\infty\) yields the spectral norm. Finally, \(p = 1\) yields the nuclear norm, also known as the Ky Fan norm.

Operator norm

The absolutely classic one, defined in terms of norms on two other spaces, and using the matrix as an operator mapping between them. Define two norms, one on \(\mathbb{R}^m\) and one on \(\mathbb{R}^n\). These norm induces an operator norm for matrices \(\mathrm{B}: \mathbb{R}^n \to \mathbb{R}^m\), \[ \|\mathrm{B}\|_{\text{o p}}=\sup _{\|v\|\leq 1}\|\mathrm{B} v\|=\sup _{v \neq 0} \frac{\|\mathrm{B} v\|}{\|v\|}, \] where \(v \in \mathbb{R}^n\). Note that the operator norm depends on the underlying norms. When the norm on both spaces is \(\ell_p\) for some \(1\leq p \leq \infty\), i.e. \(\|v\|_p=\left(\sum_i v_i^p\right)^{1 / p}\), we conventionally write \(\|\mathrm{B}\|_p\). The definition of the operator norm gives rise to the following useful inequality which was the workhorse of my functional analysis classes for proving convergence and thus equivalence of norms: \[ \|\mathrm{B} v\| \leq\|\mathrm{B}\|_{o p}\|v\| \] for any \(v \in \mathbb{R}^n\).

Spectral norm

\(p=2\) : The largest singular value.

Useful relation to the Frobenius norm: \[\|\mathrm{B}\|_2 \leqslant\|\mathrm{B}\|_F \leqslant \sqrt{n}\|\mathrm{B}\|_2\].

Bregman divergence

🏗 Relation to exponential family and maximum likelihood.

Mark Reid: Meet the Bregman divergences:

If you have some abstract way of measuring the “distance” between any two points and, for any choice of distribution over points the mean point minimises the average distance to all the others, then your distance measure must be a Bregman divergence.



The inequality table

from Petersen and Pedersen (2012):

E. H. Rasmussen has in yet unpublished material derived and collected the following inequalities. They are collected in a table as below, assuming \(\mathrm{B}\) is an \(m \times n\), and \(d=\operatorname{rank}(\mathrm{B})\)


which are to be read as, e.g.

\[ \|\mathrm{B}\|_2 \leq \sqrt{m} \cdot\|\mathrm{B}\|_{\infty} \]

Sorry there are some weird line breaks in that table; not sure how to fix that.

Ken Tay notes

Mazumder, Hastie, and Tibshirani (2010) has a very nice little lemma (Lemma 6 in the paper) that links the nuclear norm of a matrix (i.e. the sum of its singular values) to the solution of an optimization problem involving Frobenius norms. Here is the lemma: Lemma: For any matrix \(Z \in \mathbb{R}^{m \times n}\), we have \[ \|Z\|_*=\min _{U, V: Z=U V^T} \frac{1}{2}\left(\|U\|_F^2+\|V\|_F^2\right) . \] If \(\operatorname{rank}(Z)=k \leq \min (m, n)\), then the minimum above is attained at a factor decomposition \(Z=U_{m \times k} V_{n \times k}^T\).



Chen, Yudong, and Yuejie Chi. 2018. Harnessing Structures in Big Data via Guaranteed Low-Rank Matrix Estimation: Recent Theory and Fast Algorithms via Convex and Nonconvex Optimization.” IEEE Signal Processing Magazine 35 (4): 14–31.
Dokmanić, Ivan, and Rémi Gribonval. 2017. Beyond Moore-Penrose Part II: The Sparse Pseudoinverse.” arXiv:1706.08701 [Cs, Math], June.
Mazumder, Rahul, Trevor Hastie, and Robert Tibshirani. 2010. Spectral Regularization Algorithms for Learning Large Incomplete Matrices.” Journal of Machine Learning Research 11 (80): 2287–2322.
Mercer, A. McD. 2000. Bounds for A–G, A–H, G–H, and a Family of Inequalities of Ky Fan’s Type, Using a General Method.” Journal of Mathematical Analysis and Applications 243 (1): 163–73.
Minka, Thomas P. 2000. Old and new matrix algebra useful for statistics.
Moakher, Maher, and Philipp G. Batchelor. 2006. Symmetric Positive-Definite Matrices: From Geometry to Applications and Visualization.” In Visualization and Processing of Tensor Fields, edited by Joachim Weickert and Hans Hagen, 285–98. Berlin, Heidelberg: Springer Berlin Heidelberg.
Omladič, Matjaž, and Peter Šemrl. 1990. “On the Distance Between Normal Matrices.” Proceedings of the American Mathematical Society 110 (3): 591–96.
Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2012. The Matrix Cookbook.”
Sharma, Rajesh. 2008. Some More Inequalities for Arithmetic Mean, Harmonic Mean and Variance.” Journal of Mathematical Inequalities, no. 1: 109–14.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.