Infinite width limits of neural networks

2020-12-09 — 2021-05-11

Wherein the infinite-width asymptotics of single-hidden-layer networks are shown to yield Gaussian-process limits under iid Gaussian weights, and kernel/NTK viewpoints and implications for implicit regularisation are surveyed.

algebra

Bayes

feature construction

functional analysis

Gaussian

Hilbert space

kernel tricks

machine learning

metrics

model selection

neural nets

nonparametric

probabilistic algorithms

SDEs

stochastic processes

Large-width limits of neural nets. An interesting way of considering overparameterization.

A tractable case of NNs in function space.

1 Neural Network Gaussian Process

For now: See Neural network Gaussian process on Wikipedia.

The field sprang from the insight (Neal 1996a) that in the infinite limit, random neural nets with Gaussian weights and appropriate scaling asymptotically approach certain special Gaussian processes, and there are useful conclusions we can draw from that.

More generally, we might consider correlated and/or non-Gaussian weights and deep networks. Unless otherwise stated, though, I am thinking about i.i.d. Gaussian weights and a single hidden layer.

In this single-hidden-layer case, we get a tractable covariance structure. See NN kernels.

Figure 2: For some reason this evokes multi-layer wide NNs for me

2 Neural Network Tangent Kernel

NTK? See Neural Tangent Kernel.

3 Implicit regularization

Here’s one interesting perspective on wide nets (Zhang et al. 2017) which looks rather like the NTK model, but is it? To read.

The effective capacity of neural networks is large enough for a brute-force memorisation of the entire data set.

Even optimisation on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.

Randomising labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

[…] Explicit regularisation may improve generalisation performance, but is neither necessary nor by itself sufficient for controlling generalisation error. […] Appealing to linear models, we analyse how SGD acts as an implicit regulariser.

4 Dropout

Dropout is sometimes presumed to simulate from a certain kind of Gaussian process out of a neural net. See Dropout.

5 As stochastic DEs

We can find an SDE for a given NN-style kernel if we can find Green’s functions \(\sigma^2_\varepsilon \langle G_\cdot(\mathbf{x}_p), G_\cdot(\mathbf{x}_q)\rangle = \mathbb{E} \big[ \psi\big(Z_p\big) \psi\big(Z_q \big) \big].\) Russell Tsuchida observes: if you set \(G_\mathbf{s}(\mathbf{x}_p) = \psi(\mathbf{s}^\top \mathbf{x}_p) \sqrt{\phi(\mathbf{s})}\), where \(\phi\) is the pdf of an independent standard multivariate normal vector is a solution.

6 References

Adlam, Lee, Xiao, et al. 2020. “Exploring the Uncertainty Properties of Neural Networks’ Implicit Priors in the Infinite-Width Limit.” arXiv:2010.07355 [Cs, Stat].

Arora, Du, Hu, et al. 2019. “On Exact Computation with an Infinitely Wide Neural Net.” In Advances in Neural Information Processing Systems.

Atanasov, Bordelon, and Pehlevan. 2021. “Neural Networks as Kernel Learners: The Silent Alignment Effect.” In International Conference on Learning Representations (ICLR).

Bai, and Lee. 2020. “Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks.” arXiv:1910.01619 [Cs, Math, Stat].

Belkin, Ma, and Mandal. 2018. “To Understand Deep Learning We Need to Understand Kernel Learning.” In International Conference on Machine Learning.

Chen, Minshuo, Bai, Lee, et al. 2021. “Towards Understanding Hierarchical Learning: Benefits of Neural Representations.” arXiv:2006.13436 [Cs, Stat].

Chen, Lin, and Xu. 2020. “Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS.” arXiv:2009.10683 [Cs, Math, Stat].

Cho, and Saul. 2009. “Kernel Methods for Deep Learning.” In Proceedings of the 22nd International Conference on Neural Information Processing Systems. NIPS’09.

Domingos. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” arXiv:2012.00152 [Cs, Stat].

Dunbar, and Aaronson. 2025. “Wide Neural Networks at Initialization Can Be Random Functions.” In.

Dutordoir, Durrande, and Hensman. 2020. “Sparse Gaussian Processes with Spherical Harmonic Features.” In Proceedings of the 37th International Conference on Machine Learning. ICML’20.

Dutordoir, Hensman, van der Wilk, et al. 2021. “Deep Neural Networks as Point Estimates for Deep Gaussian Processes.” In arXiv:2105.04504 [Cs, Stat].

Fan, and Wang. 2020. “Spectra of the Conjugate Kernel and Neural Tangent Kernel for Linear-Width Neural Networks.” In Advances in Neural Information Processing Systems.

Fort, Dziugaite, Paul, et al. 2020. “Deep Learning Versus Kernel Learning: An Empirical Study of Loss Landscape Geometry and the Time Evolution of the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems.

Gal, and Ghahramani. 2015. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning (ICML-16).

———. 2016. “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.” In arXiv:1512.05287 [Stat].

Geifman, Yadav, Kasten, et al. 2020. “On the Similarity Between the Laplace and Neural Tangent Kernels.” In arXiv:2007.01580 [Cs, Stat].

Ghahramani. 2013. “Bayesian Non-Parametrics and the Probabilistic Approach to Modelling.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

Girosi, Jones, and Poggio. 1995. “Regularization Theory and Neural Networks Architectures.” Neural Computation.

Giryes, Sapiro, and Bronstein. 2016. “Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy?” IEEE Transactions on Signal Processing.

He, Lakshminarayanan, and Teh. 2020. “Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems.

Jacot, Gabriel, and Hongler. 2018. “Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” In Advances in Neural Information Processing Systems. NIPS’18.

Karakida, and Osawa. 2020. “Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks.” Advances in Neural Information Processing Systems.

Kristiadi, Hein, and Hennig. 2021. “An Infinite-Feature Extension for Bayesian ReLU Nets That Fixes Their Asymptotic Overconfidence.” Advances in Neural Information Processing Systems.

Lázaro-Gredilla, and Figueiras-Vidal. 2009. “Inter-Domain Gaussian Processes for Sparse Inference Using Inducing Features.” In Advances in Neural Information Processing Systems.

Lee, Bahri, Novak, et al. 2018. “Deep Neural Networks as Gaussian Processes.” In ICLR.

Lee, Xiao, Schoenholz, et al. 2019. “Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent.” In Advances in Neural Information Processing Systems.

Matthews, Rowland, Hron, et al. 2018. “Gaussian Process Behaviour in Wide Deep Neural Networks.” In arXiv:1804.11271 [Cs, Stat].

Meronen, Irwanto, and Solin. 2020. “Stationary Activations for Uncertainty Calibration in Deep Learning.” In Advances in Neural Information Processing Systems.

Neal. 1996a. “Bayesian Learning for Neural Networks.”

———. 1996b. “Priors for Infinite Networks.” In Bayesian Learning for Neural Networks. Lecture Notes in Statistics.

Novak, Xiao, Hron, et al. 2019. “Neural Tangents: Fast and Easy Infinite Neural Networks in Python.” arXiv:1912.02803 [Cs, Stat].

Novak, Xiao, Lee, et al. 2020. “Bayesian Deep Convolutional Networks with Many Channels Are Gaussian Processes.” In The International Conference on Learning Representations.

Pearce, Tsuchida, Zaki, et al. 2019. “Expressive Priors in Bayesian Neural Networks: Kernel Combinations and Periodic Functions.” In Uncertainty in Artificial Intelligence.

Roberts, Yaida, and Hanin. 2022. The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks.

Rossi, Heinonen, Bonilla, et al. 2021. “Sparse Gaussian Processes Revisited: Bayesian Approaches to Inducing-Variable Approximations.” In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics.

Sachdeva, Dhaliwal, Wu, et al. 2022. “Infinite Recommendation Networks: A Data-Centric Approach.”

Shi, Titsias, and Mnih. 2020. “Sparse Orthogonal Variational Inference for Gaussian Processes.” In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics.

Tiao, Dutordoir, and Picheny. 2023. “Spherical Inducing Features for Orthogonally-Decoupled Gaussian Processes.” In.

Williams. 1996. “Computing with Infinite Networks.” In Proceedings of the 9th International Conference on Neural Information Processing Systems. NIPS’96.

Yang. 2019. “Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture Are Gaussian Processes.” arXiv:1910.12478 [Cond-Mat, Physics:math-Ph].

Yang, and Hu. 2020. “Feature Learning in Infinite-Width Neural Networks.” arXiv:2011.14522 [Cond-Mat].

Zhang, Bengio, Hardt, et al. 2017. “Understanding Deep Learning Requires Rethinking Generalization.” In Proceedings of ICLR.