# Recurrent / convolutional / state-space

Translating between means of approximating time series dynamics

April 5, 2016 — March 5, 2024

A meeting point for some related ideas from different fields. Perspectives on analysing systems in terms of a latent, noisy state, and/or their history of noisy observations. This notebook is dedicated to the possibly-surprising fact we can move between *hidden-state*-type representations, and *observed-state-only* representations, and indeed mix them together conveniently. I ~~have~~ had many thoughts about this, but they are largely irrelevant now since the S4 family came along and effectively actioned all of them

## 1 Linear systems

See linear feedback systems and linear filter design. for stuff about FIR vs IIR filters.

### 1.1 Linear Time-Invariant systems

Let us talk about Fourier transforms and spectral properties.

## 2 Koopman operators

Learning state is pointless! infer directly from observations! See Koopmania.

## 3 RNNs

Miller and Hardt (2018)

See RNNs.

## 4 Stability of learning

Hochreiter et al. (2001); Hochreiter (1998);Lamb et al. (2016);Hardt, Ma, and Recht (2018) etc

## 5 Conversion between representations

## 6 S4

Interesting package of tools from Christopher Ré’s lab, at the intersection of recurrent networks and . See HazyResearch/state-spaces: Sequence Modeling with Structured State Spaces. I find these aesthetically satisfying, because I spent 2 years of my PhD trying to solve the same problem, and failed. These folks did a better job, so I find it slightly validating that the idea was not stupid. Gu et al. (2021):

Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence u↦y by simply simulating a linear continuous-time state-space representation x˙=Ax+Bu,y=Cx+Du. Theoretically, we show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN heuristics, and share features of NDEs such as time-scale adaptation. We then incorporate and generalize recent theory on continuous-time memorization to introduce a trainable subset of structured matrices A that endow LSSLs with long-range memory. Empirically, stacking LSSL layers into a simple deep neural network obtains state-of-the-art results across time series benchmarks for long dependencies in sequential image classification, real-world healthcare regression tasks, and speech. On a difficult speech classification task with length-16000 sequences, LSSL outperforms prior approaches by 24 accuracy points, and even outperforms baselines that use hand-crafted features on 100x shorter sequences.

Gu, Goel, and Ré (2021):

A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of 10000 or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) $ x’(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) $, and showed that for appropriate choices of the state matrix $ A $, this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning $ A $ with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation 60× faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.

Related? Li et al. (2022) Interesting parallel to the recursive/non-recursive transformer duality in How the RWKV language models. Question: Can they do the jobs of transformers?

Nearly (Vardasbi et al. 2023; Nishikawa and Suzuki 2024).

But actually, yes. See

### 6.1 Mamba

See Mamba

## 7 Incoming

Simchowitz, Boczar, and Recht (2019)

We analyze a simple prefiltered variation of the least squares estimator for the problem of estimation with biased, semi-parametric noise, an error model studied more broadly in causal statistics and active learning. We prove an oracle inequality which demonstrates that this procedure provably mitigates the variance introduced by long-term dependencies. We then demonstrate that prefiltered least squares yields, to our knowledge, the first algorithm that provably estimates the parameters of partially-observed linear systems that attains rates which do not not incur a worst-case dependence on the rate at which these dependencies decay. The algorithm is provably consistent even for systems which satisfy the weaker marginal stability condition obeyed by many classical models based on Newtonian mechanics. In this context, our semi-parametric framework yields guarantees for both stochastic and worst-case noise.

## 8 References

*Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48*. ICML’16.

*IEEE Signal Processing Magazine*.

*IEEE transactions on neural networks and learning systems*.

*IEEE Transactions on Neural Networks*.

*Journal of Machine Learning Research*.

*Neural Networks (IJCNN), 2016 International Joint Conference on*.

*Proceedings of ICLR*.

*PRoceedings of ICLR*.

*arXiv:1709.03698 [Cs, Stat]*.

*arXiv:1609.01704 [Cs]*.

*Advances in Neural Information Processing Systems 28*.

*arXiv:1611.09913 [Cs, Stat]*.

*arXiv Preprint arXiv:1603.09025*.

*arXiv:1902.01388 [Cs, Stat]*.

*Sequential Monte Carlo Methods in Practice*.

*Advances in Neural Information Processing Systems 29*.

*IEEE Transactions on Signal Processing*.

*The Twenty-Third Conference on Uncertainty in Artificial Intelligence (UAI2007)*.

*Advances in Neural Information Processing Systems*.

*Inverse Problems*.

*The Journal of Machine Learning Research*.

*Kalman Filtering and Neural Networks*. Adaptive and Learning Systems for Signal Processing, Communications, and Control.

*NIPS*.

*arXiv:2004.09455 [Stat]*.

*International Journal of Uncertainty Fuzziness and Knowledge Based Systems*.

*A Field Guide to Dynamical Recurrent Neural Networks*.

*Neural Computation*.

*Sequential Monte Carlo Methods in Practice*. Statistics for Engineering and Information Science.

*The Annals of Statistics*.

*Proceedings of the National Academy of Sciences*.

*Tutorial on Training Recurrent Neural Networks, Covering BPPT, RTRL, EKF and the” Echo State Network” Approach*.

*PMLR*.

*Linear Systems*. Prentice-Hall Information and System Science Series.

*Linear Estimation*. Prentice Hall Information and System Sciences Series.

*Advances in Neural Information Processing Systems*.

*Advances in Neural Information Processing Systems 29*.

*Advances in Neural Information Processing Systems*.

*Physical Review. X*.

*arXiv Preprint arXiv:1511.05121*.

*Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence*.

*arXiv Preprint arXiv:1508.06818*.

*BMC Neuroscience*.

*Advances In Neural Information Processing Systems*.

*arXiv:1612.06212 [Cs]*.

*arXiv:1602.07320 [Cs]*.

*System Identification: Theory for the User*. Prentice Hall Information and System Sciences Series.

*Theory and Practice of Recursive Identification*. The MIT Press Series in Signal Processing, Optimization, and Control 4.

*Advances In Neural Information Processing Systems*.

*IEEE Transactions on Signal Processing*.

*Proceedings of the 28th International Conference on International Conference on Machine Learning*. ICML’11.

*IEEE Signal Processing Magazine*.

*42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475)*.

*Proceedings of International Conference on Learning Representations (ICLR) 2017*.

*PMLR*.

*arXiv:1805.10369 [Cs, Stat]*.

*Advances in Water Resources*.

*Neural Computation*.

*Perspectives in Robust Control*. Lecture Notes in Control and Information Sciences.

*arXiv:1803.05428 [Cs, Eess, Stat]*.

*IEEE Transactions on Signal Processing*.

*Automatica*.

*arXiv:1902.00768 [Cs, Math, Stat]*.

*Automatica*, Trends in System Identification,.

*Nonlinear Dynamics and Statistics*.

*System Identification*.

*arXiv:1805.04955 [Cs, Stat]*.

*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*.

*PMLR*.

*Proceedings of International Conference on Learning Representations (ICLR) 2017*.

*IEEE Transactions on Audio and Electroacoustics*.

*Neural Networks*.

*Proceedings of the IEEE*.

*IEEE Transactions on Information Theory*.

*Neural Computation*.

*Advances in Neural Information Processing Systems 29*.

*IEEE Signal Processing Magazine*.

*Proceedings of the Twentieth International Conference on International Conference on Machine Learning*. ICML’03.