# Recurrent / convolutional / state-space

Translating between means of approximating time series dynamics

April 5, 2016 — March 5, 2024

Bayes
convolution
dynamical systems
functional analysis
linear algebra
machine learning
making things
music
networks
neural nets
nonparametric
probability
signal processing
sparser than thou
state space models
statistics
time series

A meeting point for some related ideas from different fields. Perspectives on analysing systems in terms of a latent, noisy state, and/or their history of noisy observations. This notebook is dedicated to the possibly-surprising fact we can move between hidden-state-type representations, and observed-state-only representations, and indeed mix them together conveniently. I have had many thoughts about this, but they are largely irrelevant now since the S4 family came along and effectively actioned all of them

## 1 Linear systems

See linear feedback systems and linear filter design. for stuff about FIR vs IIR filters.

### 1.1 Linear Time-Invariant systems

Let us talk about Fourier transforms and spectral properties.

## 2 Koopman operators

Learning state is pointless! infer directly from observations! See Koopmania.

## 3 RNNs

Miller and Hardt (2018)

See RNNs.

## 4 Stability of learning

Hochreiter et al. (2001); Hochreiter (1998);Lamb et al. (2016);Hardt, Ma, and Recht (2018) etc

## 6 S4

Interesting package of tools from Christopher Ré’s lab, at the intersection of recurrent networks and . See HazyResearch/state-spaces: Sequence Modeling with Structured State Spaces. I find these aesthetically satisfying, because I spent 2 years of my PhD trying to solve the same problem, and failed. These folks did a better job, so I find it slightly validating that the idea was not stupid. Gu et al. (2021):

Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence u↦y by simply simulating a linear continuous-time state-space representation x˙=Ax+Bu,y=Cx+Du. Theoretically, we show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN heuristics, and share features of NDEs such as time-scale adaptation. We then incorporate and generalize recent theory on continuous-time memorization to introduce a trainable subset of structured matrices A that endow LSSLs with long-range memory. Empirically, stacking LSSL layers into a simple deep neural network obtains state-of-the-art results across time series benchmarks for long dependencies in sequential image classification, real-world healthcare regression tasks, and speech. On a difficult speech classification task with length-16000 sequences, LSSL outperforms prior approaches by 24 accuracy points, and even outperforms baselines that use hand-crafted features on 100x shorter sequences.

Gu, Goel, and Ré (2021):

A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of 10000 or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \$ x’(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) \$, and showed that for appropriate choices of the state matrix \$ A \$, this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning \$ A \$ with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation 60× faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.

Related? Li et al. (2022) Interesting parallel to the recursive/non-recursive transformer duality in How the RWKV language models. Question: Can they do the jobs of transformers?

But actually, yes. See

See Mamba

## 7 Incoming

• Simchowitz, Boczar, and Recht (2019)

We analyze a simple prefiltered variation of the least squares estimator for the problem of estimation with biased, semi-parametric noise, an error model studied more broadly in causal statistics and active learning. We prove an oracle inequality which demonstrates that this procedure provably mitigates the variance introduced by long-term dependencies. We then demonstrate that prefiltered least squares yields, to our knowledge, the first algorithm that provably estimates the parameters of partially-observed linear systems that attains rates which do not not incur a worst-case dependence on the rate at which these dependencies decay. The algorithm is provably consistent even for systems which satisfy the weaker marginal stability condition obeyed by many classical models based on Newtonian mechanics. In this context, our semi-parametric framework yields guarantees for both stochastic and worst-case noise.

## 8 References

Arjovsky, Shah, and Bengio. 2016. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. ICML’16.
Atal. 2006. IEEE Signal Processing Magazine.
Ben Taieb, and Atiya. 2016. IEEE transactions on neural networks and learning systems.
Bengio, Simard, and Frasconi. 1994. IEEE Transactions on Neural Networks.
Bordes, Bottou, and Gallinari. 2009. Journal of Machine Learning Research.
Cai, Zhu, Wang, et al. 2024.
Cakir, Ozan, and Virtanen. 2016. In Neural Networks (IJCNN), 2016 International Joint Conference on.
Chang, Chen, Haber, et al. 2019. In Proceedings of ICLR.
Chang, Meng, Haber, Tung, et al. 2018. In PRoceedings of ICLR.
Chang, Meng, Haber, Ruthotto, et al. 2018. In arXiv:1709.03698 [Cs, Stat].
Chung, Ahn, and Bengio. 2016. arXiv:1609.01704 [Cs].
Chung, Kastner, Dinh, et al. 2015. In Advances in Neural Information Processing Systems 28.
Collins, Sohl-Dickstein, and Sussillo. 2016. In arXiv:1611.09913 [Cs, Stat].
Cooijmans, Ballas, Laurent, et al. 2016. arXiv Preprint arXiv:1603.09025.
Dai, Lai, Yang, et al. 2019. arXiv:1902.01388 [Cs, Stat].
Doucet, Freitas, and Gordon. 2001. Sequential Monte Carlo Methods in Practice.
Fraccaro, Sø nderby, Paquet, et al. 2016. In Advances in Neural Information Processing Systems 29.
Goodwin, and Vetterli. 1999. IEEE Transactions on Signal Processing.
Grosse, Raina, Kwong, et al. 2007. In The Twenty-Third Conference on Uncertainty in Artificial Intelligence (UAI2007).
Gu, and Dao. 2023.
Gu, Goel, and Ré. 2021.
Gu, Johnson, Goel, et al. 2021. In Advances in Neural Information Processing Systems.
Haber, and Ruthotto. 2018. Inverse Problems.
Hardt, Ma, and Recht. 2018. The Journal of Machine Learning Research.
Haykin, ed. 2001. Kalman Filtering and Neural Networks. Adaptive and Learning Systems for Signal Processing, Communications, and Control.
Hazan, Singh, and Zhang. 2017. In NIPS.
Heaps. 2020. arXiv:2004.09455 [Stat].
Hochreiter. 1998. International Journal of Uncertainty Fuzziness and Knowledge Based Systems.
Hochreiter, Bengio, Frasconi, et al. 2001. In A Field Guide to Dynamical Recurrent Neural Networks.
Hochreiter, and Schmidhuber. 1997. Neural Computation.
Hu, Baumann, Gui, et al. 2024.
Hürzeler, and Künsch. 2001. In Sequential Monte Carlo Methods in Practice. Statistics for Engineering and Information Science.
Ionides, Edward L., Bhadra, Atchadé, et al. 2011. The Annals of Statistics.
Ionides, E. L., Bretó, and King. 2006. Proceedings of the National Academy of Sciences.
Jing, Shen, Dubcek, et al. 2017. In PMLR.
Kailath. 1980. Linear Systems. Prentice-Hall Information and System Science Series.
Kailath, Sayed, and Hassibi. 2000. Linear Estimation. Prentice Hall Information and System Sciences Series.
Kaul. 2020. In Advances in Neural Information Processing Systems.
Kingma, Salimans, Jozefowicz, et al. 2016. In Advances in Neural Information Processing Systems 29.
Kolter, and Manek. 2019. In Advances in Neural Information Processing Systems.
Krishnamurthy, Can, and Schwab. 2022. Physical Review. X.
Krishnan, Shalit, and Sontag. 2015. arXiv Preprint arXiv:1511.05121.
———. 2017. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.
Kutschireiter, Surace, Sprekeler, et al. 2015a. “A Neural Implementation for Nonlinear Filtering.” arXiv Preprint arXiv:1508.06818.
Kutschireiter, Surace, Sprekeler, et al. 2015b. BMC Neuroscience.
Lamb, Goyal, Zhang, et al. 2016. In Advances In Neural Information Processing Systems.
Laurent, and von Brecht. 2016. arXiv:1612.06212 [Cs].
Li, Cai, Zhang, et al. 2022.
Lipton. 2016. arXiv:1602.07320 [Cs].
Ljung. 1999. System Identification: Theory for the User. Prentice Hall Information and System Sciences Series.
Ljung, and Söderström. 1983. Theory and Practice of Recursive Identification. The MIT Press Series in Signal Processing, Optimization, and Control 4.
MacKay, Vicol, Ba, et al. 2018. In Advances In Neural Information Processing Systems.
Marelli, and Fu. 2010. IEEE Transactions on Signal Processing.
Martens, and Sutskever. 2011. In Proceedings of the 28th International Conference on International Conference on Machine Learning. ICML’11.
Mattingley, and Boyd. 2010. IEEE Signal Processing Magazine.
Megretski. 2003. In 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).
Mehri, Kumar, Gulrajani, et al. 2017. In Proceedings of International Conference on Learning Representations (ICLR) 2017.
Mhammedi, Hellicar, Rahman, et al. 2017. In PMLR.
Miller, and Hardt. 2018. arXiv:1805.10369 [Cs, Stat].
Nerrand, Roussel-Ragot, Personnaz, et al. 1993. Neural Computation.
Nishikawa, and Suzuki. 2024.
Oliveira, and Skelton. 2001. In Perspectives in Robust Control. Lecture Notes in Control and Information Sciences.
Patro, and Agneeswaran. 2024.
Roberts, Engel, Raffel, et al. 2018. arXiv:1803.05428 [Cs, Eess, Stat].
Routtenberg, and Tabrikian. 2010. IEEE Transactions on Signal Processing.
Seuret, and Gouaisbaut. 2013. Automatica.
Simchowitz, Boczar, and Recht. 2019. arXiv:1902.00768 [Cs, Math, Stat].
Sjöberg, Zhang, Ljung, et al. 1995. Automatica, Trends in System Identification,.
Smith. 2000. “Disentangling Uncertainty and Error: On the Predictability of Nonlinear Systems.” In Nonlinear Dynamics and Statistics.
Söderström, and Stoica, eds. 1988. System Identification.
Stepleton, Pascanu, Dabney, et al. 2018. arXiv:1805.04955 [Cs, Stat].
Sutskever. 2013.
Szegedy, Liu, Jia, et al. 2015. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Telgarsky. 2017. In PMLR.
Thickstun, Harchaoui, and Kakade. 2017. In Proceedings of International Conference on Learning Representations (ICLR) 2017.
Vardasbi, Pires, Schmidt, et al. 2023.
Welch. 1967. IEEE Transactions on Audio and Electroacoustics.
Werbos. 1988. Neural Networks.
———. 1990. Proceedings of the IEEE.
Wiatowski, Grohs, and Bölcskei. 2018. IEEE Transactions on Information Theory.
Williams, and Peng. 1990. Neural Computation.
Wisdom, Powers, Pitton, et al. 2016. In Advances in Neural Information Processing Systems 29.
Yu, and Deng. 2011. IEEE Signal Processing Magazine.
Zinkevich. 2003. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning. ICML’03.