Recurrent neural networks

Feedback networks structured to have memory and a notion of “current” and “past” states, which can encode time (or whatever). Many wheels are re-invented with these, but the essential model is that we have a heavily nonlinear state filter inferred by gradient descent.

The connection with these and convolutional neural networks is suggestive for the same reason.

Many different flavours and topologies. On the border with deep automata.


As someone who does a lot of signal processing for music, the notion that these generalise linear systems theory is suggestive of interesting DSP applications, e.g. generative music.



As seen in normal signal processing/ The main problem here is that they are unstable in the training phase in many of the wild and weird NN SGD phases, unless you are clever.

Vanilla non-linear

The main problem here is that they are unstable in the training phase in many of the wild and weird NN SGD phases, unless you are clever. See (Bengio, Simard, and Frasconi 1994). The next three types are proposed solutions for that.

Long Short Term Memory (LSTM)

The workhorse.

As always, Christopher Olah wins the visual explanation prize: Understanding LSTM Networks. Also neat: LSTM Networks for Sentiment Analysis: Alex Graves (Graves 2013) generates handwriting.

In a traditional recurrent neural network, during the gradient back-propagation phase, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process. […]

These issues are the main motivation behind the LSTM model which introduces a new structure called a memory cell. […] A memory cell is composed of four main elements: an input gate, a neuron with a self-recurrent connection (a connection to itself), a forget gate and an output gate. […] The gates serve to modulate the interactions between the memory cell itself and its environment.

Gate Recurrent Unit (GRU)

Simpler than the LSTM, although you end up needed a couple more, so I am told. Swings and roundabouts. (Chung et al. 2014; Chung, Gulcehre, et al. 2015)


Charming connection with my other research into acoustics, what I would call “Gerzon allpass” filters or orthonormal matrices are useful in neural networks because of favourable normalisation characteristics and general dynamical considerations.


i.e. Kalman fiters, but rebranded in the fine neural networks tradition of taking something uncontroversial from another field and putting the word “neural” in front. Practically these are usually variational, but there are some random sampling based ones.



Long story, bro. Something I meant to follow up because I met a guy in a poster session. Possibly subsumed into attention mechanisms?

keras implementation by Francesco Ferroni. Tensorflow implementation by Enea Ceolini.

Lasagne implementation by Danny Neil.


It’s still the wild west. Invent a category, name it and stake a claim. There’s publications in them thar hills.



TBPTT (truncated back propagation through time), state filters, filter stability.

Allen-Zhu, Zeyuan, and Yuanzhi Li. 2019. “Can SGD Learn Recurrent Neural Networks with Provable Generalization?” February 3, 2019.

Anderson, Alexander G., and Cory P. Berg. 2017. “The High-Dimensional Geometry of Binary Neural Networks.” May 19, 2017.

Arisoy, Ebru, Tara N. Sainath, Brian Kingsbury, and Bhuvana Ramabhadran. 2012. “Deep Neural Network Language Models.” In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-Gram Model? On the Future of Language Modeling for HLT, 20–28. WLM ’12. Montreal, Canada: Association for Computational Linguistics.

Arjovsky, Martin, Amar Shah, and Yoshua Bengio. 2016. “Unitary Evolution Recurrent Neural Networks.” In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, 1120–8. ICML’16. New York, NY, USA:

Auer, Peter, Harald Burgsteiner, and Wolfgang Maass. 2008. “A Learning Rule for Very Simple Universal Approximators Consisting of a Single Layer of Perceptrons.” Neural Networks 21 (5): 786–95.

Balduzzi, David, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. 2017. “The Shattered Gradients Problem: If Resnets Are the Answer, Then What Is the Question?” In PMLR, 342–50.

Bengio, Samy, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks.” In Advances in Neural Information Processing Systems 28, 1171–9. NIPS’15. Cambridge, MA, USA: Curran Associates, Inc.

Bengio, Y., P. Simard, and P. Frasconi. 1994. “Learning Long-Term Dependencies with Gradient Descent Is Difficult.” IEEE Transactions on Neural Networks 5 (2): 157–66.

Ben Taieb, Souhaib, and Amir F. Atiya. 2016. “A Bias and Variance Analysis for Multistep-Ahead Time Series Forecasting.” IEEE Transactions on Neural Networks and Learning Systems 27 (1): 62–76.

Boulanger-Lewandowski, Nicolas, Yoshua Bengio, and Pascal Vincent. 2012. “Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription.” In 29th International Conference on Machine Learning.

Bown, Oliver, and Sebastian Lexer. 2006. “Continuous-Time Recurrent Neural Networks for Generative and Interactive Musical Performance.” In Applications of Evolutionary Computing, edited by Franz Rothlauf, Jürgen Branke, Stefano Cagnoni, Ernesto Costa, Carlos Cotta, Rolf Drechsler, Evelyne Lutton, et al., 652–63. Lecture Notes in Computer Science 3907. Springer Berlin Heidelberg.

Buhusi, Catalin V., and Warren H. Meck. 2005. “What Makes Us Tick? Functional and Neural Mechanisms of Interval Timing.” Nature Reviews Neuroscience 6 (10): 755–65.

Chang, Bo, Minmin Chen, Eldad Haber, and Ed H. Chi. 2019. “AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks.” In Proceedings of ICLR.

Charles, Adam, Dong Yin, and Christopher Rozell. 2016. “Distributed Sequence Memory of Multidimensional Inputs in Recurrent Networks.” May 26, 2016.

Chevillon, Guillaume. 2007. “Direct Multi-Step Estimation and Forecasting.” Journal of Economic Surveys 21 (4): 746–85.

Cho, Kyunghyun, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. “Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation.” In EMNLP 2014.

Cho, Kyunghyun, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches.” 2014.

Chung, Junyoung, Sungjin Ahn, and Yoshua Bengio. 2016. “Hierarchical Multiscale Recurrent Neural Networks.” September 6, 2016.

Chung, Junyoung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.” In NIPS.

Chung, Junyoung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2015. “Gated Feedback Recurrent Neural Networks.” In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, 2067–75. ICML’15.

Chung, Junyoung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. 2015. “A Recurrent Latent Variable Model for Sequential Data.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2980–8. Curran Associates, Inc.

Collins, Jasmine, Jascha Sohl-Dickstein, and David Sussillo. 2016. “Capacity and Trainability in Recurrent Neural Networks.” In.

Cooijmans, Tim, Nicolas Ballas, César Laurent, Çağlar Gülçehre, and Aaron Courville. 2016. “Recurrent Batch Normalization.” 2016.

Dasgupta, Sakyasingha, Takayuki Yoshizumi, and Takayuki Osogami. 2016. “Regularized Dynamic Boltzmann Machine with Delay Pruning for Unsupervised Learning of Temporal Sequences.” September 22, 2016.

Doelling, Keith B., and David Poeppel. 2015. “Cortical Entrainment to Music and Its Modulation by Expertise.” Proceedings of the National Academy of Sciences 112 (45): E6233–E6242.

Elman, Jeffrey L. 1990. “Finding Structure in Time.” Cognitive Science 14: 179–211.

Fortunato, Meire, Charles Blundell, and Oriol Vinyals. 2017. “Bayesian Recurrent Neural Networks.” April 10, 2017.

Fraccaro, Marco, Sø ren Kaae Sø nderby, Ulrich Paquet, and Ole Winther. 2016. “Sequential Neural Models with Stochastic Layers.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2199–2207. Curran Associates, Inc.

Gal, Yarin, and Zoubin Ghahramani. 2016. “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.” In.

Gers, Felix A., Jürgen Schmidhuber, and Fred Cummins. 2000. “Learning to Forget: Continual Prediction with LSTM.” Neural Computation 12 (10): 2451–71.

Gers, Felix A., Nicol N. Schraudolph, and Jürgen Schmidhuber. 2002. “Learning Precise Timing with LSTM Recurrent Networks.” Journal of Machine Learning Research 3 (Aug): 115–43.

Graves, Alex. 2011. “Practical Variational Inference for Neural Networks.” In Proceedings of the 24th International Conference on Neural Information Processing Systems, 2348–56. NIPS’11. USA: Curran Associates Inc.

———. 2012. Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, v. 385. Heidelberg ; New York: Springer.

———. 2013. “Generating Sequences with Recurrent Neural Networks.” August 4, 2013.

Gregor, Karol, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. 2015. “DRAW: A Recurrent Neural Network for Image Generation.” February 16, 2015.

Gruslys, Audrunas, Remi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. 2016. “Memory-Efficient Backpropagation Through Time.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 4125–33. Curran Associates, Inc.

Grzyb, B. J., E. Chinellato, G. M. Wojcik, and W. A. Kaminski. 2009. “Which Model to Use for the Liquid State Machine?” In 2009 International Joint Conference on Neural Networks, 1018–24.

Hazan, Elad, Karan Singh, and Cyril Zhang. 2017. “Learning Linear Dynamical Systems via Spectral Filtering.” In NIPS.

Hazan, Hananel, and Larry M. Manevitz. 2012. “Topological Constraints and Robustness in Liquid State Machines.” Expert Systems with Applications 39 (2): 1597–1606.

He, Kun, Yan Wang, and John Hopcroft. 2016. “A Powerful Generative Model Using Random Weights for the Deep Image Representation.” In Advances in Neural Information Processing Systems.

Hinton, G., Li Deng, Dong Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, et al. 2012. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups.” IEEE Signal Processing Magazine 29 (6): 82–97.

Hochreiter, Sepp. 1998. “The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions.” International Journal of Uncertainty Fuzziness and Knowledge Based Systems 6: 107–15.

Hochreiter, Sepp, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. 2001. “Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies.” In A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press.

Hochreiter, Sepp, and Jiirgen Schmidhuber. 1997a. “LTSM Can Solve Hard Time Lag Problems.” In Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference, 473–79.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997b. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–80.

Huszár, Ferenc. 2015. “How (Not) to Train Your Generative Model: Scheduled Sampling, Likelihood, Adversary?” November 16, 2015.

Jaeger, Herbert. 2002. Tutorial on Training Recurrent Neural Networks, Covering BPPT, RTRL, EKF and the" Echo State Network" Approach. Vol. 5. GMD-Forschungszentrum Informationstechnik.

Jing, Li, Yichen Shen, Tena Dubcek, John Peurifoy, Scott Skirlo, Yann LeCun, Max Tegmark, and Marin Soljačić. 2017. “Tunable Efficient Unitary Neural Networks (EUNN) and Their Application to RNNs.” In PMLR, 1733–41.

Jozefowicz, Rafal, Wojciech Zaremba, and Ilya Sutskever. 2015. “An Empirical Exploration of Recurrent Network Architectures.” In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 2342–50.

Karpathy, Andrej, Justin Johnson, and Li Fei-Fei. 2015. “Visualizing and Understanding Recurrent Networks.” June 5, 2015.

Katharopoulos, Angelos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. “Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention.” August 31, 2020.

Kingma, Diederik P., Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. “Improving Variational Inference with Inverse Autoregressive Flow.” In Advances in Neural Information Processing Systems 29. Curran Associates, Inc.

Koutník, Jan, Klaus Greff, Faustino Gomez, and Jürgen Schmidhuber. 2014. “A Clockwork RNN.” February 14, 2014.

Krishnan, Rahul G., Uri Shalit, and David Sontag. 2015. “Deep Kalman Filters.” 2015.

Lamb, Alex, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, and Yoshua Bengio. 2016. “Professor Forcing: A New Algorithm for Training Recurrent Networks.” In Advances in Neural Information Processing Systems.

Laurent, Thomas, and James von Brecht. 2016. “A Recurrent Neural Network Without Chaos.” December 19, 2016.

LeCun, Y. 1998. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 86 (11): 2278–2324.

Legenstein, Robert, Christian Naeger, and Wolfgang Maass. 2005. “What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?” Neural Computation 17 (11): 2337–82.

Lipton, Zachary C., John Berkowitz, and Charles Elkan. 2015. “A Critical Review of Recurrent Neural Networks for Sequence Learning.” May 29, 2015.

Lukoševičius, Mantas, and Herbert Jaeger. 2009. “Reservoir Computing Approaches to Recurrent Neural Network Training.” Computer Science Review 3 (3): 127–49.

Maass, W., T. Natschläger, and H. Markram. 2004. “Computational Models for Generic Cortical Microcircuits.” In Computational Neuroscience: A Comprehensive Approach, 575–605. Chapman & Hall/CRC.

MacKay, Matthew, Paul Vicol, Jimmy Ba, and Roger Grosse. 2018. “Reversible Recurrent Neural Networks.” In Advances in Neural Information Processing Systems.

Maddison, Chris J., Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Whye Teh. 2017. “Filtering Variational Objectives.” 2017.

Martens, James. 2010. “Deep Learning via Hessian-Free Optimization.” In Proceedings of the 27th International Conference on International Conference on Machine Learning, 735–42. ICML’10. USA: Omnipress.

Martens, James, and Ilya Sutskever. 2011. “Learning Recurrent Neural Networks with Hessian-Free Optimization.” In Proceedings of the 28th International Conference on International Conference on Machine Learning, 1033–40. ICML’11. USA: Omnipress.

———. 2012. “Training Deep and Recurrent Networks with Hessian-Free Optimization.” In Neural Networks: Tricks of the Trade, 479–535. Lecture Notes in Computer Science. Springer.

Mhammedi, Zakaria, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. 2017. “Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections.” In PMLR, 2401–9.

Mikolov, Tomáš, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. “Recurrent Neural Network Based Language Model.” In Eleventh Annual Conference of the International Speech Communication Association.

Miller, John, and Moritz Hardt. 2018. “When Recurrent Models Don’t Need to Be Recurrent.” May 25, 2018.

Mnih, V. 2015. “Human-Level Control Through Deep Reinforcement Learning.” Nature 518: 529–33.

Mohamed, A. r, G. E. Dahl, and G. Hinton. 2012. “Acoustic Modeling Using Deep Belief Networks.” IEEE Transactions on Audio, Speech, and Language Processing 20 (1): 14–22.

Monner, Derek, and James A. Reggia. 2012. “A Generalized LSTM-Like Training Algorithm for Second-Order Recurrent Neural Networks.” Neural Networks 25 (January): 70–83.

Nussbaum-Thom, Markus, Jia Cui, Bhuvana Ramabhadran, and Vaibhava Goel. 2016. “Acoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units.” In, 390–94.

Oliva, Junier B., Barnabas Poczos, and Jeff Schneider. 2017. “The Statistical Recurrent Unit.” March 1, 2017.

Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. 2013. “On the Difficulty of Training Recurrent Neural Networks.” In, 1310–8.

Patraucean, Viorica, Ankur Handa, and Roberto Cipolla. 2015. “Spatio-Temporal Video Autoencoder with Differentiable Memory.” November 19, 2015.

Pillonetto, Gianluigi. 2016. “The Interplay Between System Identification and Machine Learning.” December 29, 2016.

Ravanbakhsh, Siamak, Jeff Schneider, and Barnabas Poczos. 2016. “Deep Learning with Sets and Point Clouds.” In.

Roberts, Adam, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. 2018. “A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music.” March 13, 2018.

Rohrbach, Anna, Marcus Rohrbach, and Bernt Schiele. 2015. “The Long-Short Story of Movie Description.” June 4, 2015.

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36.

Ryder, Thomas, Andrew Golightly, A. Stephen McGough, and Dennis Prangle. 2018. “Black-Box Variational Inference for Stochastic Differential Equations.” February 9, 2018.

Sjöberg, Jonas, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard Delyon, Pierre-Yves Glorennec, Håkan Hjalmarsson, and Anatoli Juditsky. 1995. “Nonlinear Black-Box Modeling in System Identification: A Unified Overview.” Automatica, Trends in System Identification, 31 (12): 1691–1724.

Song, Yang, Chenlin Meng, Renjie Liao, and Stefano Ermon. 2020. “Nonlinear Equation Solving: A Faster Alternative to Feedforward Computation.” February 10, 2020.

Steil, J. J. 2004. “Backpropagation-Decorrelation: Online Recurrent Learning with O(N) Complexity.” In 2004 IEEE International Joint Conference on Neural Networks, 2004. Proceedings, 2:843–48 vol.2.

Surace, Simone Carlo, and Jean-Pascal Pfister. 2016. “Online Maximum Likelihood Estimation of the Parameters of Partially Observed Diffusion Processes.” In.

Tallec, Corentin, and Yann Ollivier. 2017. “Unbiasing Truncated Backpropagation Through Time.” May 23, 2017.

Taylor, Graham W., Geoffrey E. Hinton, and Sam T. Roweis. 2006. “Modeling Human Motion Using Binary Latent Variables.” In Advances in Neural Information Processing Systems, 1345–52.

Theis, Lucas, and Matthias Bethge. 2015. “Generative Image Modeling Using Spatial LSTMs.” June 10, 2015.

Visin, Francesco, Kyle Kastner, Kyunghyun Cho, Matteo Matteucci, Aaron Courville, and Yoshua Bengio. 2015. “ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks.” May 3, 2015.

Wen, Ruofeng, Kari Torkkola, and Balakrishnan Narayanaswamy. 2017. “A Multi-Horizon Quantile Recurrent Forecaster.” November 29, 2017.

Werbos, Paul J. 1988. “Generalization of Backpropagation with Application to a Recurrent Gas Market Model.” Neural Networks 1 (4): 339–56.

Werbos, P. J. 1990. “Backpropagation Through Time: What It Does and How to Do It.” Proceedings of the IEEE 78 (10): 1550–60.

Williams, Ronald J., and Jing Peng. 1990. “An Efficient Gradient-Based Algorithm for on-Line Training of Recurrent Network Trajectories.” Neural Computation 2 (4): 490–501.

Williams, Ronald J., and David Zipser. 1989. “A Learning Algorithm for Continually Running Fully Recurrent Neural Networks.” Neural Computation 1 (2): 270–80.

Wisdom, Scott, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. 2016. “Full-Capacity Unitary Recurrent Neural Networks.” In Advances in Neural Information Processing Systems, 4880–8.

Wisdom, Scott, Thomas Powers, James Pitton, and Les Atlas. 2016. “Interpretable Recurrent Neural Networks Using Sequential Sparse Recovery.” In Advances in Neural Information Processing Systems 29.

Wu, Yuhuai, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan R Salakhutdinov. 2016. “On Multiplicative Integration with Recurrent Neural Networks.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2856–64. Curran Associates, Inc.

Yao, Li, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. “Describing Videos by Exploiting Temporal Structure.” February 27, 2015.