Feedback networks structured to have memory and a notion of “current” and “past” states, which can encode time (or whatever). Many wheels are re-invented with these, but the essential model is that we have a heavily nonlinear state filter inferred by gradient descent.

The connection with these and convolutional neural networks is suggestive for the same reason.

Many different flavours and topologies. On the border with deep automata.

## Intro

As someone who does a lot of signal processing for music, the notion that these generalise linear systems theory is suggestive of interesting DSP applications, e.g. generative music.

- A good overview is Lipton, Berkowitz, and Elkan (2015).
- Awesome RNN is a curated links list of implementations.
- Andrej Karpathy: The unreasonable effectiveness of RNN
- Christopher Olah: Understanding LTSM RNNs
- Jeff Donahue Long term recurrent NN
- Niu, Horesh, and Chuang (2019)

## Flavours

### Linear

If the NN has no nonlinear activations then it is simply a linear system, e.g. an ARIMA model. As seen in classical signal processing. Learning such models by classical gradient descent can be painful, but the tools to mitigate that problem are well-understood even if they are not always feasible. The essential insight is that the propagation of linear updates through a dynamical system can be explosive, but there are analyses of the system which mitigate this problem. See Stability of linear dynamical systems for some useful tricks in the general systems stability case. TBD: discussing this in the context of learning.

### Vanilla non-linear

Imagine an ARIMA-type model, as above, but now with nonlinear activations in the gradient update step (P. J. Werbos 1990; Elman 1990). These can be even less reliable to train than classic linear models (Y. Bengio, Simard, and Frasconi 1994). The next few flavours are proposed solutions for that.

### Long Short Term Memory (LSTM)

The workhorse.

As always, Christopher Olah wins the visual explanation prize: Understanding LSTM Networks. Also neat: LSTM Networks for Sentiment Analysis: Alex Graves (Graves 2013) generates handwriting.

In a traditional recurrent neural network, during the gradient back-propagation phase, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process. […]

These issues are the main motivation behind the LSTM model which introduces a new structure called a memory cell. […] A memory cell is composed of four main elements: an input gate, a neuron with a self-recurrent connection (a connection to itself), a forget gate and an output gate. […] The gates serve to modulate the interactions between the memory cell itself and its environment.

### Gate Recurrent Unit (GRU)

Simpler than the LSTM, although you end up needing a couple more units, so I am told. wings and roundabouts. (Chung et al. 2014; Chung, Gulcehre, et al. 2015)

### Unitary

Charming connection with my other research into acoustics, what I would call “Gerzon allpass” filters or orthonormal matrices are useful in neural networks because of favourable normalisation characteristics and general dynamical considerations.

### Probabilistic

i.e. Kalman fiters, but rebranded in the fine neural networks tradition of taking something uncontroversial from another field and putting the word “neural” in front. Practically these are usually variational, but there are some random sampling based ones.

🏗

### Phased

Long story. Something I meant to follow up because I met a guy in a poster session (Neil, Pfeiffer, and Liu 2016). Possibly subsumed into attention mechanisms?

### Attention

The current hotness in time series prediction is transformer-type methods which are a whole research area unto themselves.

### Reservoir computing

Reservoir computing models seem to be a kooky type of RNN.

## Connection with continuous time

Clearly related to NODEs. Some methods exploit both - e.g. Gu et al. (2021).

### Other

TBD

## Practicalities

### Stability of training

We can think of the problem of learning recurrent networks training as essentially a system identification problem with all the implied difficulties including stability problems.

It has its own special terminology for these, e.g. *vanishing/exploding gradients* (Y. Bengio, Simard, and Frasconi 1994; Pascanu, Mikolov, and Bengio 2013)
Also worth knowing: TBPTT (*truncated back propagation through time*), (Williams and Zipser 1989).

### Teacher forcing

TBD.

Side order of *professor forcing*, and *curriculum learning*.

### Loading data

pytorch-forecasting has a utility class TimeSeriesDataSet which loads up examples for us, which is nice; However it seems to wish to have the prediction data for each time step be (comparatively) low dimensional, possibly tabular data, so it is not clear how to use it for predicting dense matrices or tensors. This looks handy for predicting stock prices, but not so much for predicting video frames.

## References

*arXiv:1902.01028 [Cs, Math, Stat]*, February.

*arXiv:1705.07199 [Cs]*, May.

*Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-Gram Model? On the Future of Language Modeling for HLT*, 20–28. WLM ’12. Montreal, Canada: Association for Computational Linguistics.

*Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48*, 1120–28. ICML’16. New York, NY, USA: JMLR.org.

*Neural Networks*21 (5): 786–95.

*PMLR*, 342–50.

*IEEE transactions on neural networks and learning systems*27 (1): 62–76.

*Advances in Neural Information Processing Systems 28*, 1171–79. NIPS’15. Cambridge, MA, USA: Curran Associates, Inc.

*IEEE Transactions on Neural Networks*5 (2): 157–66.

*29th International Conference on Machine Learning*.

*Applications of Evolutionary Computing*, edited by Franz Rothlauf, Jürgen Branke, Stefano Cagnoni, Ernesto Costa, Carlos Cotta, Rolf Drechsler, Evelyne Lutton, et al., 652–63. Lecture Notes in Computer Science 3907. Springer Berlin Heidelberg.

*Nature Reviews Neuroscience*6 (10): 755–65.

*Proceedings of ICLR*.

*arXiv:1605.08346 [Cs, Math, Stat]*, May.

*Journal of Economic Surveys*21 (4): 746–85.

*EMNLP 2014*.

*arXiv Preprint arXiv:1409.1259*.

*arXiv:1609.01704 [Cs]*, September.

*NIPS*.

*Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37*, 2067–75. ICML’15. JMLR.org.

*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2980–88. Curran Associates, Inc.

*arXiv:1611.09913 [Cs, Stat]*.

*arXiv Preprint arXiv:1603.09025*.

*arXiv:1610.01989 [Cs, Stat]*, September.

*Proceedings of the National Academy of Sciences*112 (45): E6233–42.

*Cognitive Science*14: 179–211.

*arXiv:1704.02798 [Cs, Stat]*, April.

*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2199–2207. Curran Associates, Inc.

*arXiv:1512.05287 [Stat]*.

*Neural Computation*12 (10): 2451–71.

*Journal of Machine Learning Research*3 (Aug): 115–43.

*Proceedings of the 24th International Conference on Neural Information Processing Systems*, 2348–56. NIPS’11. USA: Curran Associates Inc.

*Supervised Sequence Labelling with Recurrent Neural Networks*. Studies in Computational Intelligence, v. 385. Heidelberg ; New York: Springer.

*arXiv:1308.0850 [Cs]*, August.

*arXiv:1502.04623 [Cs]*, February.

*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 4125–33. Curran Associates, Inc.

*2009 International Joint Conference on Neural Networks*, 1018–24.

*arXiv:2110.13985 [Cs]*, October.

*The Journal of Machine Learning Research*19 (1): 1025–68.

*NIPS*.

*Expert Systems with Applications*39 (2): 1597–1606.

*Advances in Neural Information Processing Systems*.

*IEEE Signal Processing Magazine*29 (6): 82–97.

*International Journal of Uncertainty Fuzziness and Knowledge Based Systems*6: 107–15.

*A Field Guide to Dynamical Recurrent Neural Networks*. IEEE Press.

*Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference*, 473–79.

*Neural Computation*9 (8): 1735–80.

*arXiv:1511.05101 [Cs, Math, Stat]*, November.

*Tutorial on Training Recurrent Neural Networks, Covering BPPT, RTRL, EKF and the” Echo State Network” Approach*. Vol. 5. GMD-Forschungszentrum Informationstechnik.

*PMLR*, 1733–41.

*Proceedings of the 32nd International Conference on Machine Learning (ICML-15)*, 2342–50.

*arXiv:1506.02078 [Cs]*, June.

*arXiv:2006.16236 [Cs, Stat]*, August.

*Advances in Neural Information Processing Systems 29*. Curran Associates, Inc.

*arXiv:1402.3511 [Cs]*, February.

*arXiv Preprint arXiv:1511.05121*.

*Advances In Neural Information Processing Systems*.

*arXiv:1612.06212 [Cs]*, December.

*Proceedings of the IEEE*86 (11): 2278–2324.

*Neural Computation*17 (11): 2337–82.

*arXiv:1506.00019 [Cs]*, May.

*Computer Science Review*3 (3): 127–49.

*Computational Neuroscience: A Comprehensive Approach*, 575–605. Chapman & Hall/CRC.

*Advances In Neural Information Processing Systems*.

*arXiv Preprint arXiv:1705.09279*.

*Proceedings of the 27th International Conference on International Conference on Machine Learning*, 735–42. ICML’10. USA: Omnipress.

*Proceedings of the 28th International Conference on International Conference on Machine Learning*, 1033–40. ICML’11. USA: Omnipress.

*Neural Networks: Tricks of the Trade*, 479–535. Lecture Notes in Computer Science. Springer.

*PMLR*, 2401–9.

*Eleventh Annual Conference of the International Speech Communication Association*.

*arXiv:1805.10369 [Cs, Stat]*, May.

*Nature*518: 529–33.

*IEEE Transactions on Audio, Speech, and Language Processing*20 (1): 14–22.

*Neural Networks*25 (January): 70–83.

*arXiv:1610.09513 [Cs]*, October.

*arXiv:1904.12933 [Quant-Ph, Stat]*, April.

*arXiv:1703.00381 [Cs, Stat]*, March.

*arXiv:1211.5063 [Cs]*, 1310–18.

*arXiv:1511.06309 [Cs]*, November.

*arXiv:1612.09158 [Cs, Stat]*, December.

*arXiv:1611.04500 [Cs, Stat]*.

*arXiv:1803.05428 [Cs, Eess, Stat]*, March.

*arXiv:1506.01698 [Cs]*, June.

*Nature*323 (6088): 533–36.

*arXiv:1802.03335 [Stat]*, February.

*Automatica*, Trends in System Identification, 31 (12): 1691–1724.

*arXiv:2002.03629 [Cs, Stat]*, February.

*2004 IEEE International Joint Conference on Neural Networks, 2004. Proceedings*, 2:843–848 vol.2.

*arXiv:1705.08209 [Cs]*, May.

*Advances in Neural Information Processing Systems*, 1345–52.

*arXiv:1506.03478 [Cs, Stat]*, June.

*arXiv:1505.00393 [Cs]*, May.

*arXiv:1711.11053 [Stat]*, November.

*Proceedings of the IEEE*78 (10): 1550–60.

*Neural Networks*1 (4): 339–56.

*Neural Computation*2 (4): 490–501.

*Neural Computation*1 (2): 270–80.

*Advances in Neural Information Processing Systems*, 4880–88.

*Advances in Neural Information Processing Systems 29*.

*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2856–64. Curran Associates, Inc.

*arXiv:1502.08029 [Cs, Stat]*, February.

## No comments yet. Why not leave one?