Deep linear networks

Let’s pretend our networks are almost polynomial

2025-08-08 — 2025-08-09

Wherein the gradient-flow dynamics of depth-preserving linear nets are described, and singular-value trajectories, mode-by-mode learning, and gated mixtures approximating ReLU are rendered analytically tractable.

machine learning

neural nets

optimization

statmech

I want a theory that predicts which features deep nets learn, when they learn them, and why. But neural nets are messy and hard to analyse, so we need a way to simplify them for analysis that still recovers the properties we care about.

Deep linear networks (DLNs) are one attempt: models that keep depth, nonconvexity, and hierarchical representation formation while remaining analytically tractable. In principle, they let me connect data geometry (singular values/vectors) to gradient-flow trajectories: which modes win first, how layers align, and why low-rank semantic structure emerges.

I haven’t been terribly convinced that they’re plausible models for the things I care about — they’re just linear functions written a weird way, right? But what about mixtures of linear functions? Gated neural networks (Li and Sompolinsky 2022; A. Saxe, Sodhani, and Lewallen 2022) are a weird type of mixture over a weird type of linear model that just about approximates a ReLU activation if we treat it right. I just saw Devon Jarvis present the work Jarvis et al. (2024), which pushes this quite a long way as a model for a kind of multi-modal network, so I’m now more receptive to the idea that these things could actually be useful

A dot-point curriculum that gets me from deep linear networks to their gated (and ReLU-equivalent) extensions.

The absolute essentials: SVD, eigendecompositions, matrix calculus, gradient flow/continuous-time limit. Note the classic non-convexity result (in weight space) and why depth changes dynamics (Baldi and Hornik 1989; Fukumizu 1998).
Next: the exact gradient-flow dynamics of deep linear networks from first principles, including the SVD change of variables, balanced solutions, mode decoupling, and closed-form singular-value trajectories (Andrew M. Saxe, McClelland, and Ganguli 2014).
Structured datasets and “semantic development” in deep linear nets—how hierarchies emerge mode-by-mode, and why the time course depends on Σ_yx singular values while the fixed point depends on Σ_x (Andrew M. Saxe, McClelland, and Ganguli (2019)). I’ll reproduce the hierarchy demo and plot singular-value races.
“low-rank simplicity bias” and its consequences for feature learning and generalization in linear (and near-linear) regimes (Huh et al. 2023) — sounds interesting.
Gated Deep Linear Network (GDLN) formalism and the neural race reduction—how gating induces pathway-specific effective datasets and why “fastest path wins” (A. Saxe, Sodhani, and Lewallen (2022)).
Compare GDLNs with globally-gated deep linear networks and understand the relationship and limits of each gating scheme (Li and Sompolinsky 2022).
After that I think I might have a chance at grokking the ReLU↔︎GDLN equivalence via Rectified Linear Networks (ReLNs) (Jarvis et al. 2024).
They seem to be used as a model in singular learning theory, once again for tractability. I should work out how that works.
Confusing things:
- “linear network” ≠ “linear training dynamics” because depth makes training non-convex and highly structured (Andrew M. Saxe, McClelland, and Ganguli 2014; Baldi and Hornik 1989).
- disentanglement isn’t necessarily the default inductive bias (Locatello et al. 2019; Jarvis et al. 2024).

1 References

Atanasov, Bordelon, and Pehlevan. 2021. “Neural Networks as Kernel Learners: The Silent Alignment Effect.” In International Conference on Learning Representations (ICLR).

Baldi, and Hornik. 1989. “Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima.” In Neural Netw.

Fukumizu. 1998. “Effect of Batch Learning in Multilayer Neural Networks.” In Proceedings of the 5th International Conference on Neural Information Processing.

Goldt, Mézard, Krzakala, et al. 2020. “Modeling the Influence of Data Structure on Learning in Neural Networks: The Hidden Manifold Model.” Physical Review X.

Huh, Mobahi, Zhang, et al. 2023. “The Low-Rank Simplicity Bias in Deep Networks.”

Jacot, Gabriel, and Hongler. 2018. “Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” In Advances in Neural Information Processing Systems. NIPS’18.

Jarvis, Klein, Rosman, et al. 2024. “Make Haste Slowly: A Theory of Emergent Structured Mixed Selectivity in Feature Learning ReLU Networks.” In.

Lee, Xiao, Schoenholz, et al. 2019. “Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent.” In Advances in Neural Information Processing Systems.

Li, and Sompolinsky. 2022. “Globally Gated Deep Linear Networks.” In Proceedings of the 36th International Conference on Neural Information Processing Systems. NIPS ’22.

Locatello, Bauer, Lucic, et al. 2019. “Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations.” In Proceedings of the 36th International Conference on Machine Learning.

Rigotti, Barak, Warden, et al. 2013. “The Importance of Mixed Selectivity in Complex Cognitive Tasks.” Nature.

Saxe, Andrew M, McClelland, and Ganguli. 2014. “Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks.” In International Conference on Learning Representations (ICLR).

Saxe, Andrew M., McClelland, and Ganguli. 2019. “A Mathematical Theory of Semantic Development in Deep Neural Networks.” Proceedings of the National Academy of Sciences.

Saxe, Andrew, Sodhani, and Lewallen. 2022. “The Neural Race Reduction: Dynamics of Abstraction in Gated Networks.” In Proceedings of the 39th International Conference on Machine Learning.

Thompson. 1972. “Principal Submatrices IX: Interlacing Inequalities for Singular Values of Submatrices.” Linear Algebra and Its Applications.