Recurrent / convolutional / state-space

Translating between means of approximating time series dynamics

A meeting point for some related ideas from different fields. Perspectives on analysing systems in terms of a latent, noisy state, and/or their history of noisy observations. This notebook is dedicated to the possibly-surprising fact we can move between hidden-state-type representations, and observed-state-only representations, and indeed mix them together conveniently. I have many thoughts about this, but nothing concrete to write down at the moment.

State space models

Linear systems

See linear feedback systems and linear filter design. for stuff about FIR vs IIR filters.

Linear Time-Invariant systems

Let us talk about Fourier transforms and spectral properties.

Koopman operators

Learning state is pointless! infer directly from observations! See Koopmania.


Miller and Hardt (2018)

See RNNs.


Stability of learning

Hochreiter et al. (2001); Hochreiter (1998); Lamb et al. (2016);Hardt, Ma, and Recht (2018) etc

Stability of dynamics

Conversion between representations


Interesting package of tools from Christopher Ré’s lab, at the intersection of recurrent networks and . See HazyResearch/state-spaces: Sequence Modeling with Structured State Spaces. I find these aesthetically satisfying, because I spent 2 years of my PhD trying to solve the same problem, and failed. These folks did a better job, so I find it slightly validating that the idea was not stupid. Gu et al. (2021):

Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence u↦y by simply simulating a linear continuous-time state-space representation x˙=Ax+Bu,y=Cx+Du. Theoretically, we show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN heuristics, and share features of NDEs such as time-scale adaptation. We then incorporate and generalize recent theory on continuous-time memorization to introduce a trainable subset of structured matrices A that endow LSSLs with long-range memory. Empirically, stacking LSSL layers into a simple deep neural network obtains state-of-the-art results across time series benchmarks for long dependencies in sequential image classification, real-world healthcare regression tasks, and speech. On a difficult speech classification task with length-16000 sequences, LSSL outperforms prior approaches by 24 accuracy points, and even outperforms baselines that use hand-crafted features on 100x shorter sequences.

Gu, Goel, and Ré (2021):

A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of 10000 or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \(x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t)\), and showed that for appropriate choices of the state matrix \(A\), this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning \(A\) with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation 60× faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.

Related? Li et al. (2022)

Interesting parallel to the recursive/non-recursive transformer duality in How the RWKV language models. Question: Can they do the jobs of transformers? Nearly (Vardasbi et al. 2023).


Arjovsky, Martin, Amar Shah, and Yoshua Bengio. 2016. Unitary Evolution Recurrent Neural Networks.” In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, 1120–28. ICML’16. New York, NY, USA:
Atal, B. S. 2006. The History of Linear Prediction.” IEEE Signal Processing Magazine 23 (2): 154–61.
Ben Taieb, Souhaib, and Amir F. Atiya. 2016. A Bias and Variance Analysis for Multistep-Ahead Time Series Forecasting.” IEEE transactions on neural networks and learning systems 27 (1): 62–76.
Bengio, Y., P. Simard, and P. Frasconi. 1994. Learning Long-Term Dependencies with Gradient Descent Is Difficult.” IEEE Transactions on Neural Networks 5 (2): 157–66.
Bordes, Antoine, Léon Bottou, and Patrick Gallinari. 2009. SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent.” Journal of Machine Learning Research 10 (December): 1737–54.
Cakir, Emre, Ezgi Can Ozan, and Tuomas Virtanen. 2016. Filterbank Learning for Deep Neural Network Based Polyphonic Sound Event Detection.” In Neural Networks (IJCNN), 2016 International Joint Conference on, 3399–3406. IEEE.
Chang, Bo, Minmin Chen, Eldad Haber, and Ed H. Chi. 2019. AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks.” In Proceedings of ICLR.
Chang, Bo, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. 2018. Reversible Architectures for Arbitrarily Deep Residual Neural Networks.” In arXiv:1709.03698 [Cs, Stat].
Chang, Bo, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. 2018. Multi-Level Residual Networks from Dynamical Systems View.” In PRoceedings of ICLR.
Chung, Junyoung, Sungjin Ahn, and Yoshua Bengio. 2016. Hierarchical Multiscale Recurrent Neural Networks.” arXiv:1609.01704 [Cs], September.
Chung, Junyoung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. 2015. A Recurrent Latent Variable Model for Sequential Data.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2980–88. Curran Associates, Inc.
Collins, Jasmine, Jascha Sohl-Dickstein, and David Sussillo. 2016. Capacity and Trainability in Recurrent Neural Networks.” In arXiv:1611.09913 [Cs, Stat].
Cooijmans, Tim, Nicolas Ballas, César Laurent, Çağlar Gülçehre, and Aaron Courville. 2016. Recurrent Batch Normalization.” arXiv Preprint arXiv:1603.09025.
Doucet, Arnaud, Nando Freitas, and Neil Gordon. 2001. Sequential Monte Carlo Methods in Practice. New York, NY: Springer New York.
Fraccaro, Marco, Sø ren Kaae Sø nderby, Ulrich Paquet, and Ole Winther. 2016. Sequential Neural Models with Stochastic Layers.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2199–2207. Curran Associates, Inc.
Goodwin, M M, and M Vetterli. 1999. Matching Pursuit and Atomic Signal Models Based on Recursive Filter Banks.” IEEE Transactions on Signal Processing 47 (7): 1890–1902.
Grosse, Roger, Rajat Raina, Helen Kwong, and Andrew Y. Ng. 2007. Shift-Invariant Sparse Coding for Audio Classification.” In The Twenty-Third Conference on Uncertainty in Artificial Intelligence (UAI2007), 9:8.
Gu, Albert, Karan Goel, and Christopher Ré. 2021. Efficiently Modeling Long Sequences with Structured State Spaces.”
Gu, Albert, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. 2021. Combining Recurrent, Convolutional, and Continuous-Time Models with Linear State Space Layers.” In Advances in Neural Information Processing Systems, 34:572–85. Curran Associates, Inc.
Haber, Eldad, and Lars Ruthotto. 2018. Stable Architectures for Deep Neural Networks.” Inverse Problems 34 (1): 014004.
Hardt, Moritz, Tengyu Ma, and Benjamin Recht. 2018. Gradient Descent Learns Linear Dynamical Systems.” The Journal of Machine Learning Research 19 (1): 1025–68.
Haykin, Simon S., ed. 2001. Kalman Filtering and Neural Networks. Adaptive and Learning Systems for Signal Processing, Communications, and Control. New York: Wiley.
Hazan, Elad, Karan Singh, and Cyril Zhang. 2017. Learning Linear Dynamical Systems via Spectral Filtering.” In NIPS.
Heaps, Sarah E. 2020. Enforcing Stationarity Through the Prior in Vector Autoregressions.” arXiv:2004.09455 [Stat], April.
Hochreiter, Sepp. 1998. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions.” International Journal of Uncertainty Fuzziness and Knowledge Based Systems 6: 107–15.
Hochreiter, Sepp, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. 2001. Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies.” In A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press.
Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. Long Short-Term Memory.” Neural Computation 9 (8): 1735–80.
Hürzeler, Markus, and Hans R. Künsch. 2001. Approximating and Maximising the Likelihood for a General State-Space Model.” In Sequential Monte Carlo Methods in Practice, 159–75. Statistics for Engineering and Information Science. Springer, New York, NY.
Ionides, E. L., C. Bretó, and A. A. King. 2006. Inference for Nonlinear Dynamical Systems.” Proceedings of the National Academy of Sciences 103 (49): 18438–43.
Ionides, Edward L., Anindya Bhadra, Yves Atchadé, and Aaron King. 2011. Iterated Filtering.” The Annals of Statistics 39 (3): 1776–1802.
Jaeger, Herbert. 2002. Tutorial on Training Recurrent Neural Networks, Covering BPPT, RTRL, EKF and the” Echo State Network” Approach. Vol. 5. GMD-Forschungszentrum Informationstechnik.
Jing, Li, Yichen Shen, Tena Dubcek, John Peurifoy, Scott Skirlo, Yann LeCun, Max Tegmark, and Marin Soljačić. 2017. Tunable Efficient Unitary Neural Networks (EUNN) and Their Application to RNNs.” In PMLR, 1733–41.
Kailath, Thomas. 1980. Linear Systems. Prentice-Hall Information and System Science Series. Englewood Cliffs, N.J: Prentice-Hall.
Kailath, Thomas, Ali H. Sayed, and Babak Hassibi. 2000. Linear Estimation. Prentice Hall Information and System Sciences Series. Upper Saddle River, N.J: Prentice Hall.
Kaul, Shiva. 2020. Linear Dynamical Systems as a Core Computational Primitive.” In Advances in Neural Information Processing Systems. Vol. 33.
Kingma, Diederik P., Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. Improving Variational Inference with Inverse Autoregressive Flow.” In Advances in Neural Information Processing Systems 29. Curran Associates, Inc.
Krishnan, Rahul G., Uri Shalit, and David Sontag. 2015. Deep Kalman Filters.” arXiv Preprint arXiv:1511.05121.
———. 2017. Structured Inference Networks for Nonlinear State Space Models.” In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2101–9.
Kutschireiter, Anna, Simone Carlo Surace, Henning Sprekeler, and Jean-Pascal Pfister. 2015a. “A Neural Implementation for Nonlinear Filtering.” arXiv Preprint arXiv:1508.06818.
Kutschireiter, Anna, Simone C Surace, Henning Sprekeler, and Jean-Pascal Pfister. 2015b. Approximate Nonlinear Filtering with a Recurrent Neural Network.” BMC Neuroscience 16 (Suppl 1): P196.
Lamb, Alex, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, and Yoshua Bengio. 2016. Professor Forcing: A New Algorithm for Training Recurrent Networks.” In Advances In Neural Information Processing Systems.
Laurent, Thomas, and James von Brecht. 2016. A Recurrent Neural Network Without Chaos.” arXiv:1612.06212 [Cs], December.
Li, Yuhong, Tianle Cai, Yi Zhang, Deming Chen, and Debadeepta Dey. 2022. What Makes Convolutional Models Great on Long Sequence Modeling? arXiv.
Lipton, Zachary C. 2016. Stuck in a What? Adventures in Weight Space.” arXiv:1602.07320 [Cs], February.
Ljung, Lennart. 1999. System Identification: Theory for the User. 2nd ed. Prentice Hall Information and System Sciences Series. Upper Saddle River, NJ: Prentice Hall PTR.
Ljung, Lennart, and Torsten Söderström. 1983. Theory and Practice of Recursive Identification. The MIT Press Series in Signal Processing, Optimization, and Control 4. Cambridge, Mass: MIT Press.
MacKay, Matthew, Paul Vicol, Jimmy Ba, and Roger Grosse. 2018. Reversible Recurrent Neural Networks.” In Advances In Neural Information Processing Systems.
Marelli, D., and Minyue Fu. 2010. A Recursive Method for the Approximation of LTI Systems Using Subband Processing.” IEEE Transactions on Signal Processing 58 (3): 1025–34.
Martens, James, and Ilya Sutskever. 2011. Learning Recurrent Neural Networks with Hessian-Free Optimization.” In Proceedings of the 28th International Conference on International Conference on Machine Learning, 1033–40. ICML’11. USA: Omnipress.
Mattingley, J., and S. Boyd. 2010. Real-Time Convex Optimization in Signal Processing.” IEEE Signal Processing Magazine 27 (3): 50–61.
Megretski, A. 2003. Positivity of Trigonometric Polynomials.” In 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475), 4:3814–3817 vol.4.
Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. 2017. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In Proceedings of International Conference on Learning Representations (ICLR) 2017.
Mhammedi, Zakaria, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. 2017. Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections.” In PMLR, 2401–9.
Miller, John, and Moritz Hardt. 2018. When Recurrent Models Don’t Need To Be Recurrent.” arXiv:1805.10369 [Cs, Stat], May.
Moradkhani, Hamid, Soroosh Sorooshian, Hoshin V. Gupta, and Paul R. Houser. 2005. Dual State–Parameter Estimation of Hydrological Models Using Ensemble Kalman Filter.” Advances in Water Resources 28 (2): 135–47.
Nerrand, O., P. Roussel-Ragot, L. Personnaz, G. Dreyfus, and S. Marcos. 1993. Neural Networks and Nonlinear Adaptive Filtering: Unifying Concepts and New Algorithms.” Neural Computation 5 (2): 165–99.
Oliveira, Maurício C. de, and Robert E. Skelton. 2001. Stability Tests for Constrained Linear Systems.” In Perspectives in Robust Control, 241–57. Lecture Notes in Control and Information Sciences. Springer, London.
Roberts, Adam, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. 2018. A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music.” arXiv:1803.05428 [Cs, Eess, Stat], March.
Routtenberg, Tirza, and Joseph Tabrikian. 2010. Blind MIMO-AR System Identification and Source Separation with Finite-Alphabet.” IEEE Transactions on Signal Processing 58 (3): 990–1000.
Seuret, Alexandre, and Frédéric Gouaisbaut. 2013. Wirtinger-Based Integral Inequality: Application to Time-Delay Systems.” Automatica 49 (9): 2860–66.
Sjöberg, Jonas, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard Delyon, Pierre-Yves Glorennec, Håkan Hjalmarsson, and Anatoli Juditsky. 1995. Nonlinear Black-Box Modeling in System Identification: A Unified Overview.” Automatica, Trends in System Identification, 31 (12): 1691–1724.
Smith, Leonard A. 2000. “Disentangling Uncertainty and Error: On the Predictability of Nonlinear Systems.” In Nonlinear Dynamics and Statistics.
Söderström, T., and P. Stoica, eds. 1988. System Identification. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.
Stepleton, Thomas, Razvan Pascanu, Will Dabney, Siddhant M. Jayakumar, Hubert Soyer, and Remi Munos. 2018. Low-Pass Recurrent Neural Networks - A Memory Architecture for Longer-Term Correlation Discovery.” arXiv:1805.04955 [Cs, Stat], May.
Sutskever, Ilya. 2013. Training Recurrent Neural Networks.” PhD Thesis, Toronto, Ont., Canada, Canada: University of Toronto.
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9.
Telgarsky, Matus. 2017. Neural Networks and Rational Functions.” In PMLR, 3387–93.
Thickstun, John, Zaid Harchaoui, and Sham Kakade. 2017. Learning Features of Music from Scratch.” In Proceedings of International Conference on Learning Representations (ICLR) 2017.
Vardasbi, Ali, Telmo Pessoa Pires, Robin M. Schmidt, and Stephan Peitz. 2023. State Spaces Aren’t Enough: Machine Translation Needs Attention.” arXiv.
Welch, Peter D. 1967. The Use of Fast Fourier Transform for the Estimation of Power Spectra: A Method Based on Time Averaging over Short, Modified Periodograms.” IEEE Transactions on Audio and Electroacoustics 15 (2): 70–73.
Werbos, Paul J. 1988. Generalization of Backpropagation with Application to a Recurrent Gas Market Model.” Neural Networks 1 (4): 339–56.
———. 1990. Backpropagation Through Time: What It Does and How to Do It.” Proceedings of the IEEE 78 (10): 1550–60.
Wiatowski, Thomas, Philipp Grohs, and Helmut Bölcskei. 2018. Energy Propagation in Deep Convolutional Neural Networks.” IEEE Transactions on Information Theory 64 (7): 1–1.
Williams, Ronald J., and Jing Peng. 1990. An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories.” Neural Computation 2 (4): 490–501.
Wisdom, Scott, Thomas Powers, James Pitton, and Les Atlas. 2016. Interpretable Recurrent Neural Networks Using Sequential Sparse Recovery.” In Advances in Neural Information Processing Systems 29.
Yu, D., and L. Deng. 2011. Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP].” IEEE Signal Processing Magazine 28 (1): 145–54.
Zinkevich, Martin. 2003. Online Convex Programming and Generalized Infinitesimal Gradient Ascent.” In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 928–35. ICML’03. Washington, DC, USA: AAAI Press.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.