Large-width limits of neural nets. An interesting way of considering overparameterization.

## Neural Network Gaussian Process

For now: See Neural network Gaussian process on Wikipedia.

The field that sprang from the insight (Neal 1996a) that in the infinite limit, random neural nets with Gaussian weights and appropriate scaling asymptotically approach Gaussian processes, and there are useful conclusions we can draw from that.

More generally we might consider correlated and/or non-Gaussian weights, and deep networks. Unless otherwise stated though, I am thinking about i.i.d. Gaussian weights, and a single hidden layer.

In this single-hidden-layer case we get tractable covariance structure. See NN kernels.

## Neural Network Tangent Kernel

See Neural tangent kernel on Wikipedia. Ferenc HuszΓ‘r provides some Intuition on the Neural Tangent Kernel, i.e. the paper (Lee et al. 2019).

It turns out the neural tangent kernel becomes particularly useful when studying learning dynamics in infinitely wide feed-forward neural networks. Why? Because in this limit, two things happen:

- First: if we initialize \(ΞΈ_0\) randomly from appropriately chosen distributions, the initial NTK of the network \(k_{ΞΈ_0}\) approaches a deterministic kernel as the width increases. This means, that at initialization, \(k_{ΞΈ_0}\) doesnβt really depend on \(k_{ΞΈ_0}\) but is a fixed kernel independent of the specific initialization.-
- Second: in the infinite limit the kernel \(k_{ΞΈ_t}\) stays constant over time as we optimise \(\theta_t\). This removes the parameter dependence during training.
These two facts put together imply that gradient descent in the infinitely wide and infinitesimally small learning rate limit can be understood as a pretty simple algorithm called

kernel gradient descentwith a fixed kernel function that depends only on the architecture (number of layers, activations, etc).These results, taken together with an older known result (Neal 1996b), allows us to characterise the probability distribution of minima that gradient descent converges to in this infinite limit as a Gaussian process.

google/neural-tangents: Fast and Easy Infinite Neural Networks in Python

When are Neural Networks more powerful than Neural Tangent Kernels? introduces two recent interesting takes, Bai and Lee (2020); M. Chen et al. (2021) which consider quadratic approximations instead of merely linearization. Sequel to Ultra-Wide Deep Nets and Neural Tangent Kernel.

## Implicit regularization

Hereβs one interesting perspective on wide nets (Zhang et al. 2017) which looks rather like the NTK model, but is it? To read.

The effective capacity of neural networks is large enough for a brute-force memorization of the entire data set.

Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.

Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

[β¦] Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error. [β¦] Appealing to linear models, we analyze how SGD acts as an implicit regularizer.

## Dropout

Dropout is sometimes presumed to simulate from a certain kind of Gaussian process out of a neural net. See Dropout.

## As stochastic DEs

We can find an SDE for a given NN-style kernel if we can find Greenβs functions \(\sigma^2_\varepsilon \langle G_\cdot(\mathbf{x}_p), G_\cdot(\mathbf{x}_q)\rangle = \mathbb{E} \big[ \psi\big(Z_p\big) \psi\big(Z_q \big) \big].\) Russell Tsuchida observes: if you set \(G_\mathbf{s}(\mathbf{x}_p) = \psi(\mathbf{s}^\top \mathbf{x}_p) \sqrt{\phi(\mathbf{s})}\), where \(\phi\) is the pdf of an independent standard multivariate normal vector is a solution.

## To files

Steve Hsu said some provocative things in passing; I would like to read them.

## References

*arXiv:2010.07355 [Cs, Stat]*, October.

*Advances in Neural Information Processing Systems*, 10.

*arXiv:1910.01619 [Cs, Math, Stat]*, February.

*International Conference on Machine Learning*, 541β49.

*arXiv:2009.10683 [Cs, Math, Stat]*, October.

*arXiv:2006.13436 [Cs, Stat]*, March.

*Proceedings of the 22nd International Conference on Neural Information Processing Systems*, 22:342β50. NIPSβ09. Red Hook, NY, USA: Curran Associates Inc.

*arXiv:2012.00152 [Cs, Stat]*, November.

*Advances in Neural Information Processing Systems*, 33:12.

*Advances in Neural Information Processing Systems*. Vol. 33.

*Proceedings of the 33rd International Conference on Machine Learning (ICML-16)*.

*arXiv:1512.05287 [Stat]*.

*arXiv:2007.01580 [Cs, Stat]*.

*Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences*371 (1984): 20110553.

*Neural Computation*7 (2): 219β69.

*IEEE Transactions on Signal Processing*64 (13): 3444β57.

*Advances in Neural Information Processing Systems*. Vol. 33.

*Advances in Neural Information Processing Systems*, 31:8571β80. NIPSβ18. Red Hook, NY, USA: Curran Associates Inc.

*Advances in Neural Information Processing Systems*33.

*arXiv:2010.02709 [Cs, Stat]*, May.

*ICLR*.

*Advances in Neural Information Processing Systems*, 8570β81.

*arXiv:1804.11271 [Cs, Stat]*.

*Advances in Neural Information Processing Systems*. Vol. 33.

*Bayesian Learning for Neural Networks*, edited by Radford M. Neal, 29β53. Lecture Notes in Statistics. New York, NY: Springer.

*arXiv:1912.02803 [Cs, Stat]*, December.

*The International Conference on Learning Representations*.

*Uncertainty in Artificial Intelligence*, 11.

*arXiv:2006.10739 [Cs]*, June.

*Proceedings of the 9th International Conference on Neural Information Processing Systems*, 295β301. NIPSβ96. Cambridge, MA, USA: MIT Press.

*arXiv:1910.12478 [Cond-Mat, Physics:math-Ph]*, December.

*arXiv:2011.14522 [Cond-Mat]*, November.

*Proceedings of ICLR*.

## No comments yet. Why not leave one?