Deep linear networks
Let’s pretend our networks are almost polynomial
2025-08-08 — 2025-08-09
Wherein the gradient-flow dynamics of depth-preserving linear nets are described, and singular-value trajectories, mode-by-mode learning, and gated mixtures approximating ReLU are rendered analytically tractable.
I want a theory that predicts which features deep nets learn, when they learn them, and why. But neural nets are messy and hard to analyse, so we need a way to simplify them for analysis that still recovers the properties we care about.
Deep linear networks (DLNs) are one attempt: models that keep depth, nonconvexity, and hierarchical representation formation while remaining analytically tractable. In principle, they let me connect data geometry (singular values/vectors) to gradient-flow trajectories: which modes win first, how layers align, and why low-rank semantic structure emerges.
I haven’t been terribly convinced that they’re plausible models for the things I care about — they’re just linear functions written a weird way, right? But what about mixtures of linear functions? Gated neural networks (Li and Sompolinsky 2022; A. Saxe, Sodhani, and Lewallen 2022) are a weird type of mixture over a weird type of linear model that just about approximates a ReLU activation if we treat it right. I just saw Devon Jarvis present the work Jarvis et al. (2024), which pushes this quite a long way as a model for a kind of multi-modal network, so I’m now more receptive to the idea that these things could actually be useful
A dot-point curriculum that gets me from deep linear networks to their gated (and ReLU-equivalent) extensions.
The absolute essentials: SVD, eigendecompositions, matrix calculus, gradient flow/continuous-time limit. Note the classic non-convexity result (in weight space) and why depth changes dynamics (Baldi and Hornik 1989; Fukumizu 1998).
Next: the exact gradient-flow dynamics of deep linear networks from first principles, including the SVD change of variables, balanced solutions, mode decoupling, and closed-form singular-value trajectories (Andrew M. Saxe, McClelland, and Ganguli 2014).
Structured datasets and “semantic development” in deep linear nets—how hierarchies emerge mode-by-mode, and why the time course depends on Σ_yx singular values while the fixed point depends on Σ_x (Andrew M. Saxe, McClelland, and Ganguli (2019)). I’ll reproduce the hierarchy demo and plot singular-value races.
“low-rank simplicity bias” and its consequences for feature learning and generalization in linear (and near-linear) regimes (Huh et al. 2023) — sounds interesting.
Gated Deep Linear Network (GDLN) formalism and the neural race reduction—how gating induces pathway-specific effective datasets and why “fastest path wins” (A. Saxe, Sodhani, and Lewallen (2022)).
Compare GDLNs with globally-gated deep linear networks and understand the relationship and limits of each gating scheme (Li and Sompolinsky 2022).
After that I think I might have a chance at grokking the ReLU↔︎GDLN equivalence via Rectified Linear Networks (ReLNs) (Jarvis et al. 2024).
They seem to be used as a model in singular learning theory, once again for tractability. I should work out how that works.
Confusing things:
- “linear network” ≠ “linear training dynamics” because depth makes training non-convex and highly structured (Andrew M. Saxe, McClelland, and Ganguli 2014; Baldi and Hornik 1989).
- disentanglement isn’t necessarily the default inductive bias (Locatello et al. 2019; Jarvis et al. 2024).