Recurrent neural networks
June 16, 2016 — September 6, 2021
Feedback networks structured to have memory and a notion of “current” and “past” states, which can encode time (or whatever). Many wheels are re-invented with these, but the essential model is that we have a heavily nonlinear state filter inferred by gradient descent.
The connection with these and convolutional neural networks is suggestive for the same reason.
Many different flavours and topologies. On the border with deep automata.
Here I mostly talk about RNNs which have what I would call an uninterpretable hidden state. If we are interested in actually learning about dynamics from some meaningful state, I think of those more as Neural networks that learn dynamics.
1 Intro
As someone who does a lot of signal processing for music, the notion that these generalize linear systems theory is suggestive of interesting DSP applications, e.g. generative music.
- A good overview is Lipton, Berkowitz, and Elkan (2015).
- Awesome RNN is a curated links list of implementations.
- Andrej Karpathy: The unreasonable effectiveness of RNN
- Christopher Olah: Understanding LTSM RNNs
- Jeff Donahue Long term recurrent NN
- Niu, Horesh, and Chuang (2019)
2 Flavours
2.1 Linear
If the NN has no nonlinear activations then it is simply a linear system, e.g. an ARIMA model. As seen in classical signal processing. Learning such models by classical gradient descent can be painful, but the tools to mitigate that problem are well-understood even if they are not always feasible. The essential insight is that the propagation of linear updates through a dynamical system can be explosive, but there are analyses of the system which mitigate this problem. See Stability of linear dynamical systems for some useful tricks in the general systems stability case. TBD: discussing this in the context of learning.
2.2 Vanilla non-linear
Imagine an ARIMA-type model, as above, but now with nonlinear activations in the gradient update step (Werbos 1990; Elman 1990). These can be even less reliable to train than classic linear models (Y. Bengio, Simard, and Frasconi 1994). The next few flavours are proposed solutions for that.
2.3 Long Short Term Memory (LSTM)
The workhorse.
As always, Christopher Olah wins the visual explanation prize: Understanding LSTM Networks. Also neat: LSTM Networks for Sentiment Analysis: Alex Graves (Graves 2013) generates handwriting.
In a traditional recurrent neural network, during the gradient back-propagation phase, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process. […]
These issues are the main motivation behind the LSTM model which introduces a new structure called a memory cell. […] A memory cell is composed of four main elements: an input gate, a neuron with a self-recurrent connection (a connection to itself), a forget gate and an output gate. […] The gates serve to modulate the interactions between the memory cell itself and its environment.
2.4 Gate Recurrent Unit (GRU)
Simpler than the LSTM, although you end up needing a couple more units, so I am told. wings and roundabouts. (Chung et al. 2014; Chung, Gulcehre, et al. 2015)
2.5 Unitary
Charming connection with my other research into acoustics, what I would call “Gerzon allpass” filters or orthonormal matrices are useful in neural networks because of favourable normalisation characteristics and general dynamical considerations.
2.6 Probabilistic
i.e. Kalman filters, but rebranded in the fine neural networks tradition of taking something uncontroversial from another field and putting the word “neural” in front. Practically these are usually variational, but there are some random sampling based ones.
🏗
2.7 Phased
Long story. Something I meant to follow up because I met a guy in a poster session (Neil, Pfeiffer, and Liu 2016). Possibly subsumed into attention mechanisms?
2.8 Attention
The current hotness in time series prediction is transformer-type methods which are a whole research area unto themselves.
2.9 Reservoir computing
Reservoir computing models seem to be a kooky type of RNN.
3 Connection with continuous time
Clearly related to NODEs. Some methods exploit both - e.g. Gu et al. (2021).
3.1 Other
TBD
4 recursive estimation
See recursive identification for generic theory of learning under the distribution shift induced by a moving parameter vector.
5 practicalities
5.1 Loading data
pytorch-forecasting has a utility class TimeSeriesDataSet which loads up examples for us, which is nice; However it seems to wish to have the prediction data for each time step be (comparatively) low dimensional, possibly tabular data, so it is not clear how to use it for predicting dense matrices or tensors. i.e. looks handy for predicting stock prices, but not so much for predicting video frames.