Deep linear networks
Let’s pretend our networks are almost polynomial
2025-08-08 — 2025-08-09
I want a theory that predicts which features deep nets learn, when they learn them, and why. But neural nets are messy and hard to analyse, so we need to find some way of simplifying them for analysis which still recovers the properties we care about.
Deep linear networks (DLNs) are one attempt at that: the models that keep depth, nonconvexity, and hierarchical representation formation while remaining analytically tractable. In principle, they let me connect data geometry (singular values/vectors) to gradient-flow trajectories: which modes win first, how layers align, and why low-rank, semantic structure emerges.
I haven’t been terribly convinced that they are that plausible models for things I care about—they are just linear functions written a weird way, right? But what about mixtures of linear functions? Gated neural networks (Li and Sompolinsky 2022; A. Saxe, Sodhani, and Lewallen 2022) are a weird type of mixture over a weird type of linear model that juuuuuuust about approximates a ReLU activation function if you treat it right. I just saw Devon Jarvis present the work Jarvis et al. (2024) which pushes this quite a long way as model for a kind of multi-modal network, so I am now more receptive to the idea that these things could actually be useful
A dot-point curriculum that gets me from deep linear networks to their gated (and ReLU-equivalent) extensions.
The absolute essentials: SVD, eigendecompositions, matrix calculus, gradient flow/continuous-time limit. Note the classic non-convexity result (in weight space) and why depth changes dynamics (Baldi and Hornik 1989; Fukumizu 1998).
Next: the exact gradient-flow dynamics of deep linear networks from first principles, including the SVD change of variables, balanced solutions, mode decoupling, and closed-form singular-value trajectories (Andrew M. Saxe, McClelland, and Ganguli 2014).
Structured datasets and “semantic development” in deep linear nets—how hierarchies emerge mode-by-mode, and why the time course depends on Σ_yx singular values while the fixed point depends on Σ_x (Andrew M. Saxe, McClelland, and Ganguli (2019)). I’ll reproduce the hierarchy demo and plot singular-value races.
“low-rank simplicity bias” and its consequences for feature learning and generalization in linear (and near-linear) regimes (Huh et al. 2023) sounds interesting
Gated Deep Linear Network (GDLN) formalism and the neural race reduction—how gating induces pathway-specific effective datasets and why “fastest path wins” (A. Saxe, Sodhani, and Lewallen (2022)).
Compare GDLNs with globally-gated deep linear networks and understand the relationship/limits of each gating scheme (Li and Sompolinsky 2022).
After that I think I might have a chance at grokking the ReLU↔︎GDLN equivalence via Rectified Linear Networks (ReLNs) (Jarvis et al. 2024)
They seem to be used as a model in singular learning theory, once again for tractability. Should work out how that works.
Confusing things:
- “linear network” ≠ “linear training dynamics” because depth makes training non-convex and highly structured (Andrew M. Saxe, McClelland, and Ganguli 2014; Baldi and Hornik 1989).
- disentanglement is not necessarily the default inductive bias (Locatello et al. 2019; Jarvis et al. 2024).