Ensemble Kalman methods for training neural networks
Data assimilation for network weights
2022-09-20 — 2024-11-29
Wherein neural-network training is approached via ensemble Kalman updates, a dynamical-perspective method is presented, and a connection to stochastic gradient descent is examined through Claudia Schilling’s filter.
\[\renewcommand{\var}{\operatorname{Var}} \renewcommand{\cov}{\operatorname{Cov}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\rv}[1]{\mathsf{#1}} \renewcommand{\vrv}[1]{\vv{\rv{#1}}} \renewcommand{\disteq}{\stackrel{d}{=}} \renewcommand{\gvn}{\mid} \renewcommand{\Ex}{\mathbb{E}} \renewcommand{\Pr}{\mathbb{P}} \renewcommand{\one}{\unicode{x1D7D9}}\]
Training neural networks by ensemble Kalman updates instead of SGD. Arises naturally from the dynamical perspective on neural networks. TBD.
Claudia Schilling’s filter (Schillings and Stuart 2017) is an elegant variant of the ensemble Kalman Filter which looks somehow more general than the original but also simpler and may be applicable. Haber, Lucka, and Ruthotto (2018) use it to train neural nets (!) and show a rather beautiful connection to stochastic gradient descent in section 3.2.