In which I think about parameterisations and implementations of finite-dimensional energy-preserving operators, a.k.a. orthonormal matrices. A type of tool in the linear feedback process library, closely related to stability in linear dynamical systems, since every orthonormal matrix is the forward operator of an energy-preserving system, which is an edge case for certain natural types of stability. Also important in random low-dimensional projections.

Uses include maintaining stable gradients in recurrent neural networks (Arjovsky, Shah, and Bengio 2016; Jing et al. 2017; Mhammedi et al. 2017) and efficient invertible normalising flows (van den Berg et al. 2018; Hasenclever, Tomczak, and Welling 2017). Also, parameterising stable Multi-Input-Multi-Output (MIMO) delay networks in signal processing. Probably other stuff too.

Terminology: Some writers refer to orthogonal matrices (but I prefer that to mean matrices where the columns are not necessarily 2-norm 1), and some refer to unitary matrices, which implies that the matrix is over the complex field instead of the reals but is basically the same from my perspective.

We also might want to consider the implied manifolds upon which these objects live, the Stiefel manifold. Formally, the Stiefel manifold $V_{k, m}$ is the space of $k$ frames in the $m$ -dimensional real Euclidean space $R^{m},$ represented by the set of $m \times k$ matrices $M$ such that $M^{'} M = I_{k},$ where $I_{k}$ is the $k \times k$ identity matrix. Usually my purposes are served here by $k = m$ . There are some interesting cases in low dimensional projections served by $k < m,$ including $k = 1.$

Finding an orthonormal matrix is equivalent to choosing a finite orthonormal basis, so any way we can parameterise such a basis gives us an orthonormal matrix.

NB the normalisation implies that the basis for an $n \times n$ matrix has at most $n (n - 1)$ free parameters.

TODO: discuss rectangular and square orthogonal matrices.

1 Take the QR decomposition

HT Russell Tsuchida for pointing out that the $Q$ matrix in the QR decomposition, $M = Q R$ by construction gives me an orthonormal matrix from any square matrix. Likewise with the $U, V$ matrices in the $M = U Σ V^{*}$ SVD. This construction is overparameterised, with $n^{2}$ free parameters.

The construction of the QR decomposition Householder reflections is, Wikipedia reckons, $O (n^{3})$ multiplications for an $n \times n$ matrix.

2 Other decompositions?

We can get something which looks similar via the Lanczos algorithm, which handles warm starts, finding $A = Q T Q^{⊤}$ for orthonormal $Q$ ; although the $T$ matrix is tri-diagonal which is not quite what we want.

Question: do the spectral radius upper-bounds of NO-BEARS (Lee et al. 2019; Zhu et al. 2020), give us a pointer towards another method for finding such matrices? (HT Dario Draca for mentioning this.) I think that gets us something “sub”-orthonormal in general, since it will upper-bound the determinant. Or something.

3 Iterative normalising

Have a nearly orthonormal matrix? van den Berg et al. (2018) gives a contraction towards a nearby orthonormal matrix: $Q^{(k + 1)} = Q^{(k)} (I + \frac{1}{2} (I - Q^{(k) ⊤} Q^{(k)})) .$ This reputedly converges if ${‖ Q^{(0) ⊤} Q^{(0)} - I ‖}_{2} < 1.$ They attribute this to Björck and Bowie (1971) and Kovarik (1970), wherein it is derived from the Newton iteration for solving $Q^{- 1} -$ The iterations in this method look like $(n^2). An $O (n)$ option would be nice, but is intuitively not possible. This one is differentiable, however.

4 Perturbing an existing orthonormal matrix

Unitary transforms map unitary matrices to unitary matrices. We can even start from the identity matrix and perturb it to traverse the space of unitary matrices.

4.1 Householder reflections

We can apply successive reflections about hyperplanes, the so-called Householder reflections, to an orthonormal matrix to construct a new one. For a unit vector $v$ the associated Householder reflection is $H (v) = I - 2 v v^{*} .$ NB $det H = - 1$ so we need to apply an even number of Householder reflections to preserve orthonormality.

4.2 Givens rotation

One obvious method for constructing unitary matrices is composing Givens rotations, which are atomic rotations about 2 axes.

A Givens rotation is represented by a matrix of the form $G (i, j, θ) = [\begin{matrix} 1 & \dots & 0 & \dots & 0 & \dots & 0 \\ ⋮ & ⋱ & ⋮ & ⋮ & ⋮ \\ 0 & \dots & c & \dots & - s & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ 0 & \dots & s & \dots & c & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & \dots & 0 & \dots & 0 & \dots & 1 \end{matrix}],$ where $c = \cos θ$ and $s = \sin θ$ appear at the intersections of the ith and jth rows and columns. The product $G (i, j, θ) x$ represents a $θ$ -radian counterclockwise rotation of the vector x in the $(i, j)$ plane.

5 Cayley map

The Cayley map maps the skew-symmetric matrices to the orthogonal matrices of positive determinant, and parameterising skew-symmetric matrices is easy; just take the upper triangular component of some matrix and flip/negate it. This still requires a matrix inversion in general, AFAICS.

6 Exponential map

The exponential map (Golinski et al., 2019). Given a skew-symmetric matrix A, i.e. a $D \times D$ matrix such that $A^{⊤} = - A$ , the matrix exponential $Q = \exp A$ is always an orthogonal matrix with determinant 1. Moreover, any orthogonal matrix with determinant 1 can be written this way. However, computing the matrix exponential takes in general $O (D^{3})$ time, so this parameterisation is only suitable for small-dimensional data.

7 Parametric sub-families

Citing MATLAB, Nick Higham gives the following two parametric families of orthonormal matrices. These are clearly far from covering the whole space of orthonormal matrices.

$q_{i j} = \frac{2}{\sqrt{2 n + 1}} \sin (\frac{2 i j π}{2 n + 1})$

$q_{i j} = \sqrt{\frac{2}{n}} \cos (\frac{(i - 1 / 2) (j - 1 / 2) π}{n})$

8 Newton-Schulz iteration

Noted at SGD at scale, in Deriving Muon (Bernstein and Newhouse 2024b)

Muon: An optimizer for hidden layers in neural networks | Keller Jordan blog

Some papers using it here:

(Bernstein et al. 2020; Bernstein and Newhouse 2024b, 2024a; HighamMatrix?; Large et al. 2024; Yang, Simon, and Bernstein 2023)

The modula.systems Newton-Schulz docs are best for now I think

9 Structured

Orthogonal convolutions? TBD

10 Random distributions over

I wonder what the distribution of orthonormal decompositions matrices is for some, say, matrix with independent standard Gaussian entries? Nick Higham has the answer, in his compact introduction to random orthonormal matrices. A uniform, rotation-invariant distribution is given by the Haar measure over the group of orthogonal matrices. He also gives the construction for drawing them by random Householder reflections derived from random standard normal vectors. See random rotations.

11 Hurwitz matrix

A related concept. Hurwitz matrices define asymptotically stable systems of ODEs, which is not the same as conserving the energy of a vector. Also they pack the transfer function polynomial in a weird way.

Hurwitz matrix

12 References

Anderson, Olkin, and Underhill. 1987. “Generation of Random Orthogonal Matrices.” SIAM Journal on Scientific and Statistical Computing.

Arjovsky, Shah, and Bengio. 2016. “Unitary Evolution Recurrent Neural Networks.” In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. ICML’16.

Bernstein, and Newhouse. 2024a. “Old Optimizer, New Norm: An Anthology.”

———. 2024b. “Modular Duality in Deep Learning.”

Bernstein, Vahdat, Yue, et al. 2020. “On the Distance Between Two Neural Networks and the Stability of Learning.”

Björck, and Bowie. 1971. “An Iterative Algorithm for Computing the Best Estimate of an Orthogonal Matrix.” SIAM Journal on Numerical Analysis.

Chou, Rauhut, and Ward. 2023. “Robust Implicit Regularization via Weight Normalization.”

De Sena, Haciihabiboglu, Cvetkovic, et al. 2015. “Efficient Synthesis of Room Acoustics via Scattering Delay Networks.” IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Edelman, and Rao. 2005. “Random Matrix Theory.” Acta Numerica.

Hasenclever, Tomczak, and Welling. 2017. “Variational Inference with Orthogonal Normalizing Flows.”

Hendeković. 1974. “On Parametrization of Orthogonal and Unitary Matrices with Respect to Their Use in the Description of Molecules.” Chemical Physics Letters.

Jarlskog. 2005. “A Recursive Parametrization of Unitary Matrices.” Journal of Mathematical Physics.

Jing, Shen, Dubcek, et al. 2017. “Tunable Efficient Unitary Neural Networks (EUNN) and Their Application to RNNs.” In PMLR.

Kovarik. 1970. “Some Iterative Methods for Improving Orthonormality.” SIAM Journal on Numerical Analysis.

Large, Liu, Huh, et al. 2024. “Scalable Optimization in the Modular Norm.”

Lee, Danieletto, Miotto, et al. 2019. “Scaling Structural Learning with NO-BEARS to Infer Causal Transcriptome Networks.” In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2020.

Menzer, and Faller. 2010. “Unitary Matrix Design for Diffuse Jot Reverberators.”

Mezzadri. 2007. “How to Generate Random Matrices from the Classical Compact Groups.”

Mhammedi, Hellicar, Rahman, et al. 2017. “Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections.” In PMLR.

Papamakarios, Nalisnick, Rezende, et al. 2021. “Normalizing Flows for Probabilistic Modeling and Inference.” Journal of Machine Learning Research.

Regalia, and Sanjit. 1989. “Kronecker Products, Unitary Matrices and Signal Processing Applications.” SIAM Review.

Schroeder. 1961. “Improved Quasi-Stereophony and ‘Colorless’ Artificial Reverberation.” The Journal of the Acoustical Society of America.

Schroeder, and Logan. 1961. “‘Colorless’ Artificial Reverberation.” Audio, IRE Transactions on.

Tilma, and Sudarshan. 2002. “Generalized Euler Angle Paramterization for SU(N).” Journal of Physics A: Mathematical and General.

Valimaki, and Laakso. 2012. “Fractional Delay Filters-Design and Applications.” In Nonuniform Sampling: Theory and Practice.

van den Berg, Hasenclever, Tomczak, et al. 2018. “Sylvester Normalizing Flows for Variational Inference.” In UAI18.

Yang, Simon, and Bernstein. 2023. “A Spectral Condition for Feature Learning.”

Zhu, Pfadler, Wu, et al. 2020. “Efficient and Scalable Structure Learning for Bayesian Networks: Algorithms and Applications.”