Recurrent neural networks

June 16, 2016 — September 6, 2021

Feedback networks structured to have memory and a notion of “current” and “past” states, which can encode time (or whatever). Many wheels are re-invented with these, but the essential model is that we have a heavily nonlinear state filter inferred by gradient descent.

The connection with these and convolutional neural networks is suggestive for the same reason.

Many different flavours and topologies. On the border with deep automata.

Here I mostly talk about RNNs which have what I would call an uninterpretable hidden state. If we are interested in actually learning about dynamics from some meaningful state, I think of those more as Neural networks that learn dynamics.

Figure 1

1 Intro

As someone who does a lot of signal processing for music, the notion that these generalise linear systems theory is suggestive of interesting DSP applications, e.g. generative music.

2 Flavours

2.1 Linear

If the NN has no nonlinear activations then it is simply a linear system, e.g. an ARIMA model. As seen in classical signal processing. Learning such models by classical gradient descent can be painful, but the tools to mitigate that problem are well-understood even if they are not always feasible. The essential insight is that the propagation of linear updates through a dynamical system can be explosive, but there are analyses of the system which mitigate this problem. See Stability of linear dynamical systems for some useful tricks in the general systems stability case. TBD: discussing this in the context of learning.

2.2 Vanilla non-linear

Imagine an ARIMA-type model, as above, but now with nonlinear activations in the gradient update step (Werbos 1990; Elman 1990). These can be even less reliable to train than classic linear models (Y. Bengio, Simard, and Frasconi 1994). The next few flavours are proposed solutions for that.

2.3 Long Short Term Memory (LSTM)

The workhorse.

As always, Christopher Olah wins the visual explanation prize: Understanding LSTM Networks. Also neat: LSTM Networks for Sentiment Analysis: Alex Graves (Graves 2013) generates handwriting.

In a traditional recurrent neural network, during the gradient back-propagation phase, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process. […]

These issues are the main motivation behind the LSTM model which introduces a new structure called a memory cell. […] A memory cell is composed of four main elements: an input gate, a neuron with a self-recurrent connection (a connection to itself), a forget gate and an output gate. […] The gates serve to modulate the interactions between the memory cell itself and its environment.

2.4 Gate Recurrent Unit (GRU)

Simpler than the LSTM, although you end up needing a couple more units, so I am told. wings and roundabouts. (Chung et al. 2014; Chung, Gulcehre, et al. 2015)

2.5 Unitary

Charming connection with my other research into acoustics, what I would call “Gerzon allpass” filters or orthonormal matrices are useful in neural networks because of favourable normalisation characteristics and general dynamical considerations.

2.6 Probabilistic

i.e. Kalman fiters, but rebranded in the fine neural networks tradition of taking something uncontroversial from another field and putting the word “neural” in front. Practically these are usually variational, but there are some random sampling based ones.


2.7 Phased

Long story. Something I meant to follow up because I met a guy in a poster session (Neil, Pfeiffer, and Liu 2016). Possibly subsumed into attention mechanisms?

2.8 Attention

The current hotness in time series prediction is transformer-type methods which are a whole research area unto themselves.

2.9 Reservoir computing

Reservoir computing models seem to be a kooky type of RNN.

3 Connection with continuous time

Clearly related to NODEs. Some methods exploit both - e.g. Gu et al. (2021).

3.1 Other


4 recursive estimation

See recursive identification for generic theory of learning under the distribution shift induced by a moving parameter vector.

5 practicalities

5.1 Loading data

pytorch-forecasting has a utility class TimeSeriesDataSet which loads up examples for us, which is nice; However it seems to wish to have the prediction data for each time step be (comparatively) low dimensional, possibly tabular data, so it is not clear how to use it for predicting dense matrices or tensors. i.e. looks handy for predicting stock prices, but not so much for predicting video frames.

6 Incoming

7 References

Aicher, Foti, and Fox. 2020. Adaptively Truncating Backpropagation Through Time to Control Gradient Bias.” In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference.
Allen-Zhu, and Li. 2019. Can SGD Learn Recurrent Neural Networks with Provable Generalization? arXiv:1902.01028 [Cs, Math, Stat].
Anderson, and Berg. 2017. The High-Dimensional Geometry of Binary Neural Networks.” arXiv:1705.07199 [Cs].
Arisoy, Sainath, Kingsbury, et al. 2012. “Deep Neural Network Language Models.” In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-Gram Model? On the Future of Language Modeling for HLT. WLM ’12.
Arjovsky, Shah, and Bengio. 2016. Unitary Evolution Recurrent Neural Networks.” In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. ICML’16.
Auer, Burgsteiner, and Maass. 2008. A Learning Rule for Very Simple Universal Approximators Consisting of a Single Layer of Perceptrons.” Neural Networks.
Balduzzi, Frean, Leary, et al. 2017. The Shattered Gradients Problem: If Resnets Are the Answer, Then What Is the Question? In PMLR.
Bazzani, Torresani, and Larochelle. 2017. “Recurrent Mixture Density Network for Spatiotemporal Visual Attention.”
Ben Taieb, and Atiya. 2016. A Bias and Variance Analysis for Multistep-Ahead Time Series Forecasting.” IEEE transactions on neural networks and learning systems.
Bengio, Y., Simard, and Frasconi. 1994. Learning Long-Term Dependencies with Gradient Descent Is Difficult.” IEEE Transactions on Neural Networks.
Bengio, Samy, Vinyals, Jaitly, et al. 2015. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks.” In Advances in Neural Information Processing Systems 28. NIPS’15.
Boulanger-Lewandowski, Bengio, and Vincent. 2012. Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription.” In 29th International Conference on Machine Learning.
Bown, and Lexer. 2006. Continuous-Time Recurrent Neural Networks for Generative and Interactive Musical Performance.” In Applications of Evolutionary Computing. Lecture Notes in Computer Science 3907.
Buhusi, and Meck. 2005. What Makes Us Tick? Functional and Neural Mechanisms of Interval Timing.” Nature Reviews Neuroscience.
Chang, Chen, Haber, et al. 2019. AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks.” In Proceedings of ICLR.
Charles, Yin, and Rozell. 2016. Distributed Sequence Memory of Multidimensional Inputs in Recurrent Networks.” arXiv:1605.08346 [Cs, Math, Stat].
Chevillon. 2007. Direct Multi-Step Estimation and Forecasting.” Journal of Economic Surveys.
Cho, van Merriënboer, Bahdanau, et al. 2014. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches.” arXiv Preprint arXiv:1409.1259.
Cho, van Merrienboer, Gulcehre, et al. 2014. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation.” In EMNLP 2014.
Chung, Ahn, and Bengio. 2016. Hierarchical Multiscale Recurrent Neural Networks.” arXiv:1609.01704 [Cs].
Chung, Gulcehre, Cho, et al. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.” In NIPS.
Chung, Gulcehre, Cho, et al. 2015. Gated Feedback Recurrent Neural Networks.” In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37. ICML’15.
Chung, Kastner, Dinh, et al. 2015. A Recurrent Latent Variable Model for Sequential Data.” In Advances in Neural Information Processing Systems 28.
Collins, Sohl-Dickstein, and Sussillo. 2016. Capacity and Trainability in Recurrent Neural Networks.” In arXiv:1611.09913 [Cs, Stat].
Cooijmans, Ballas, Laurent, et al. 2016. Recurrent Batch Normalization.” arXiv Preprint arXiv:1603.09025.
Dasgupta, Yoshizumi, and Osogami. 2016. Regularized Dynamic Boltzmann Machine with Delay Pruning for Unsupervised Learning of Temporal Sequences.” arXiv:1610.01989 [Cs, Stat].
Doelling, and Poeppel. 2015. Cortical Entrainment to Music and Its Modulation by Expertise.” Proceedings of the National Academy of Sciences.
Elman. 1990. Finding Structure in Time.” Cognitive Science.
Fortunato, Blundell, and Vinyals. 2017. Bayesian Recurrent Neural Networks.” arXiv:1704.02798 [Cs, Stat].
Fraccaro, Sø nderby, Paquet, et al. 2016. Sequential Neural Models with Stochastic Layers.” In Advances in Neural Information Processing Systems 29.
Gal, and Ghahramani. 2016. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.” In arXiv:1512.05287 [Stat].
Gers, Schmidhuber, and Cummins. 2000. Learning to Forget: Continual Prediction with LSTM.” Neural Computation.
Gers, Schraudolph, and Schmidhuber. 2002. Learning Precise Timing with LSTM Recurrent Networks.” Journal of Machine Learning Research.
Gilpin. 2023. Model Scale Versus Domain Knowledge in Statistical Forecasting of Chaotic Systems.” Physical Review Research.
Graves. 2011. Practical Variational Inference for Neural Networks.” In Proceedings of the 24th International Conference on Neural Information Processing Systems. NIPS’11.
———. 2012. Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, v. 385.
———. 2013. Generating Sequences With Recurrent Neural Networks.” arXiv:1308.0850 [Cs].
Gregor, Danihelka, Graves, et al. 2015. DRAW: A Recurrent Neural Network For Image Generation.” arXiv:1502.04623 [Cs].
Gruslys, Munos, Danihelka, et al. 2016. Memory-Efficient Backpropagation Through Time.” In Advances in Neural Information Processing Systems 29.
Grzyb, Chinellato, Wojcik, et al. 2009. Which Model to Use for the Liquid State Machine? In 2009 International Joint Conference on Neural Networks.
Gu, Johnson, Goel, et al. 2021. Combining Recurrent, Convolutional, and Continuous-Time Models with Linear State Space Layers.” In Advances in Neural Information Processing Systems.
Hardt, Ma, and Recht. 2018. Gradient Descent Learns Linear Dynamical Systems.” The Journal of Machine Learning Research.
Hazan, Hananel, and Manevitz. 2012. Topological Constraints and Robustness in Liquid State Machines.” Expert Systems with Applications.
Hazan, Elad, Singh, and Zhang. 2017. Learning Linear Dynamical Systems via Spectral Filtering.” In NIPS.
He, Wang, and Hopcroft. 2016. A Powerful Generative Model Using Random Weights for the Deep Image Representation.” In Advances in Neural Information Processing Systems.
Hinton, Deng, Yu, et al. 2012. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups.” IEEE Signal Processing Magazine.
Hochreiter. 1998. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions.” International Journal of Uncertainty Fuzziness and Knowledge Based Systems.
Hochreiter, Bengio, Frasconi, et al. 2001. Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies.” In A Field Guide to Dynamical Recurrent Neural Networks.
Hochreiter, and Schmidhuber. 1997a. LTSM Can Solve Hard Time Lag Problems.” In Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference.
Hochreiter, and Schmidhuber. 1997b. Long Short-Term Memory.” Neural Computation.
Huszár. 2015. How (Not) to Train Your Generative Model: Scheduled Sampling, Likelihood, Adversary? arXiv:1511.05101 [Cs, Math, Stat].
Jaeger. 2002. Tutorial on Training Recurrent Neural Networks, Covering BPPT, RTRL, EKF and the” Echo State Network” Approach.
Jing, Shen, Dubcek, et al. 2017. Tunable Efficient Unitary Neural Networks (EUNN) and Their Application to RNNs.” In PMLR.
Jozefowicz, Zaremba, and Sutskever. 2015. An Empirical Exploration of Recurrent Network Architectures.” In Proceedings of the 32nd International Conference on Machine Learning (ICML-15).
Karpathy, Johnson, and Fei-Fei. 2015. Visualizing and Understanding Recurrent Networks.” arXiv:1506.02078 [Cs].
Katharopoulos, Vyas, Pappas, et al. 2020. Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention.” arXiv:2006.16236 [Cs, Stat].
Kingma, Salimans, Jozefowicz, et al. 2016. Improving Variational Inference with Inverse Autoregressive Flow.” In Advances in Neural Information Processing Systems 29.
Koutník, Greff, Gomez, et al. 2014. A Clockwork RNN.” arXiv:1402.3511 [Cs].
Krishnamurthy, Can, and Schwab. 2022. Theory of Gating in Recurrent Neural Networks.” Physical Review. X.
Krishnan, Shalit, and Sontag. 2015. Deep Kalman Filters.” arXiv Preprint arXiv:1511.05121.
Lamb, Goyal, Zhang, et al. 2016. Professor Forcing: A New Algorithm for Training Recurrent Networks.” In Advances In Neural Information Processing Systems.
Laurent, and von Brecht. 2016. A Recurrent Neural Network Without Chaos.” arXiv:1612.06212 [Cs].
LeCun. 1998. Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE.
Legenstein, Naeger, and Maass. 2005. What Can a Neuron Learn with Spike-Timing-Dependent Plasticity? Neural Computation.
Lillicrap, and Santoro. 2019. Backpropagation Through Time and the Brain.” Current Opinion in Neurobiology, Machine Learning, Big Data, and Neuroscience,.
Lipton, Berkowitz, and Elkan. 2015. A Critical Review of Recurrent Neural Networks for Sequence Learning.” arXiv:1506.00019 [Cs].
Lukoševičius, and Jaeger. 2009. Reservoir Computing Approaches to Recurrent Neural Network Training.” Computer Science Review.
Maass, Natschläger, and Markram. 2004. Computational Models for Generic Cortical Microcircuits.” In Computational Neuroscience: A Comprehensive Approach.
MacKay, Vicol, Ba, et al. 2018. Reversible Recurrent Neural Networks.” In Advances In Neural Information Processing Systems.
Maddison, Lawson, Tucker, et al. 2017. Filtering Variational Objectives.” arXiv Preprint arXiv:1705.09279.
Martens. 2010. Deep Learning via Hessian-Free Optimization.” In Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML’10.
Martens, and Sutskever. 2011. Learning Recurrent Neural Networks with Hessian-Free Optimization.” In Proceedings of the 28th International Conference on International Conference on Machine Learning. ICML’11.
———. 2012. Training Deep and Recurrent Networks with Hessian-Free Optimization.” In Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science.
Mhammedi, Hellicar, Rahman, et al. 2017. Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections.” In PMLR.
Mikolov, Karafiát, Burget, et al. 2010. Recurrent Neural Network Based Language Model.” In Eleventh Annual Conference of the International Speech Communication Association.
Miller, and Hardt. 2018. When Recurrent Models Don’t Need To Be Recurrent.” arXiv:1805.10369 [Cs, Stat].
Mnih. 2015. Human-Level Control Through Deep Reinforcement Learning.” Nature.
Mohamed, Dahl, and Hinton. 2012. Acoustic Modeling Using Deep Belief Networks.” IEEE Transactions on Audio, Speech, and Language Processing.
Monner, and Reggia. 2012. A Generalized LSTM-Like Training Algorithm for Second-Order Recurrent Neural Networks.” Neural Networks.
Neil, Pfeiffer, and Liu. 2016. Phased LSTM: Accelerating Recurrent Network Training for Long or Event-Based Sequences.” arXiv:1610.09513 [Cs].
Niu, Horesh, and Chuang. 2019. Recurrent Neural Networks in the Eye of Differential Equations.” arXiv:1904.12933 [Quant-Ph, Stat].
Nussbaum-Thom, Cui, Ramabhadran, et al. 2016. Acoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units.” In.
Oliva, Poczos, and Schneider. 2017. The Statistical Recurrent Unit.” arXiv:1703.00381 [Cs, Stat].
Pascanu, Mikolov, and Bengio. 2013. On the Difficulty of Training Recurrent Neural Networks.” In arXiv:1211.5063 [Cs].
Patraucean, Handa, and Cipolla. 2015. Spatio-Temporal Video Autoencoder with Differentiable Memory.” arXiv:1511.06309 [Cs].
Pillonetto. 2016. The Interplay Between System Identification and Machine Learning.” arXiv:1612.09158 [Cs, Stat].
Ravanbakhsh, Schneider, and Poczos. 2016. Deep Learning with Sets and Point Clouds.” In arXiv:1611.04500 [Cs, Stat].
Roberts, Engel, Raffel, et al. 2018. A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music.” arXiv:1803.05428 [Cs, Eess, Stat].
Rohrbach, Rohrbach, and Schiele. 2015. The Long-Short Story of Movie Description.” arXiv:1506.01698 [Cs].
Rumelhart, Hinton, and Williams. 1986. Learning Representations by Back-Propagating Errors.” Nature.
Ryder, Golightly, McGough, et al. 2018. Black-Box Variational Inference for Stochastic Differential Equations.” arXiv:1802.03335 [Stat].
Sjöberg, Zhang, Ljung, et al. 1995. Nonlinear Black-Box Modeling in System Identification: A Unified Overview.” Automatica, Trends in System Identification,.
Sompolinsky, Crisanti, and Sommers. 1988. Chaos in Random Neural Networks.” Physical Review Letters.
Song, Meng, Liao, et al. 2020. Nonlinear Equation Solving: A Faster Alternative to Feedforward Computation.” arXiv:2002.03629 [Cs, Stat].
Steil. 2004. Backpropagation-Decorrelation: Online Recurrent Learning with O(N) Complexity.” In 2004 IEEE International Joint Conference on Neural Networks, 2004. Proceedings.
Surace, and Pfister. 2016. “Online Maximum Likelihood Estimation of the Parameters of Partially Observed Diffusion Processes.” In.
Sutskever. 2013. Training Recurrent Neural Networks.”
Takamoto, Praditia, Leiteritz, et al. 2022. PDEBench: An Extensive Benchmark for Scientific Machine Learning.” In.
Tallec, and Ollivier. 2017. Unbiasing Truncated Backpropagation Through Time.”
Taylor, Hinton, and Roweis. 2006. Modeling Human Motion Using Binary Latent Variables.” In Advances in Neural Information Processing Systems.
Theis, and Bethge. 2015. Generative Image Modeling Using Spatial LSTMs.” arXiv:1506.03478 [Cs, Stat].
Visin, Kastner, Cho, et al. 2015. ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks.” arXiv:1505.00393 [Cs].
Voelker, Kajic, and Eliasmith. n.d. “Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks.”
Wang, and Niepert. 2019. State-Regularized Recurrent Neural Networks.”
Wen, Torkkola, and Narayanaswamy. 2017. A Multi-Horizon Quantile Recurrent Forecaster.” arXiv:1711.11053 [Stat].
Werbos. 1988. Generalization of Backpropagation with Application to a Recurrent Gas Market Model.” Neural Networks.
———. 1990. Backpropagation Through Time: What It Does and How to Do It.” Proceedings of the IEEE.
Williams, and Peng. 1990. An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories.” Neural Computation.
Williams, and Zipser. 1989. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks.” Neural Computation.
Wisdom, Powers, Hershey, et al. 2016. Full-Capacity Unitary Recurrent Neural Networks.” In Advances in Neural Information Processing Systems.
Wisdom, Powers, Pitton, et al. 2016. Interpretable Recurrent Neural Networks Using Sequential Sparse Recovery.” In Advances in Neural Information Processing Systems 29.
Wu, Zhang, Zhang, et al. 2016. On Multiplicative Integration with Recurrent Neural Networks.” In Advances in Neural Information Processing Systems 29.
Yao, Torabi, Cho, et al. 2015. Describing Videos by Exploiting Temporal Structure.” arXiv:1502.08029 [Cs, Stat].