Figure 1

I want a theory that predicts which features deep nets learn, when they learn them, and why. But neural nets are messy and hard to analyse, so we need to find some way of simplifying them for analysis which still recovers the properties we care about.

Deep linear networks (DLNs) are one attempt at that: the models that keep depth, nonconvexity, and hierarchical representation formation while remaining analytically tractable. In principle, they let me connect data geometry (singular values/vectors) to gradient-flow trajectories: which modes win first, how layers align, and why low-rank, semantic structure emerges.

I haven’t been terribly convinced that they are that plausible models for things I care about—they are just linear functions written a weird way, right? But what about mixtures of linear functions? Gated neural networks (; ) are a weird type of mixture over a weird type of linear model that juuuuuuust about approximates a ReLU activation function if you treat it right. I just saw Devon Jarvis present the work Jarvis et al. () which pushes this quite a long way as model for a kind of multi-modal network, so I am now more receptive to the idea that these things could actually be useful

A dot-point curriculum that gets me from deep linear networks to their gated (and ReLU-equivalent) extensions.

1 References

Atanasov, Bordelon, and Pehlevan. 2021. Neural Networks as Kernel Learners: The Silent Alignment Effect.” In International Conference on Learning Representations (ICLR).
Baldi, and Hornik. 1989. Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima.” In Neural Netw.
Fukumizu. 1998. “Effect of Batch Learning in Multilayer Neural Networks.” In Proceedings of the 5th International Conference on Neural Information Processing.
Goldt, Mézard, Krzakala, et al. 2020. Modeling the Influence of Data Structure on Learning in Neural Networks: The Hidden Manifold Model.” Physical Review X.
Huh, Mobahi, Zhang, et al. 2023. The Low-Rank Simplicity Bias in Deep Networks.”
Jacot, Gabriel, and Hongler. 2018. Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” In Advances in Neural Information Processing Systems. NIPS’18.
Jarvis, Klein, Rosman, et al. 2024. Make Haste Slowly: A Theory of Emergent Structured Mixed Selectivity in Feature Learning ReLU Networks.” In.
Lee, Xiao, Schoenholz, et al. 2019. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent.” In Advances in Neural Information Processing Systems.
Li, and Sompolinsky. 2022. Globally Gated Deep Linear Networks.” In Proceedings of the 36th International Conference on Neural Information Processing Systems. NIPS ’22.
Locatello, Bauer, Lucic, et al. 2019. Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations.” In Proceedings of the 36th International Conference on Machine Learning.
Rigotti, Barak, Warden, et al. 2013. “The Importance of Mixed Selectivity in Complex Cognitive Tasks.” Nature.
Saxe, Andrew M, McClelland, and Ganguli. 2014. “Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks.” In International Conference on Learning Representations (ICLR).
Saxe, Andrew M., McClelland, and Ganguli. 2019. A Mathematical Theory of Semantic Development in Deep Neural Networks.” Proceedings of the National Academy of Sciences.
Saxe, Andrew, Sodhani, and Lewallen. 2022. The Neural Race Reduction: Dynamics of Abstraction in Gated Networks.” In Proceedings of the 39th International Conference on Machine Learning.
Thompson. 1972. “Principal Submatrices IX: Interlacing Inequalities for Singular Values of Submatrices.” Linear Algebra and Its Applications.