A grab-bag of perspectives and tricks for recursive identification of dynamical systems, i.e. updating a model which produces the correct forward predictions, given the past.

Keywords: *multi-step prediction*, *time horizon*, *teacher forcing*.
The various things that are meant by “autoregressive”.

A common core of ideas pop up here in forecasting and state filtering system identification (including particle version), RNNs and forward operator learning. We could describe Koopman operator as an alternative perspective.

## Classic systems learning

Landmark papers according to Lindström et al. (2012):

Augmenting the unobserved state vector is a well known technique, used in the system identification community for decades, see e.g. Ljung (L. Ljung 1979; Lindström et al. 2008; Söderström and Stoica 1988). Similar ideas, using Sequential Monte Carlos methods, were suggested by (Kitagawa 1998; Liu and West 2001). Combined state and parameter estimation is also the standard technique for data assimilation in high-dimensional systems, see Moradkhani et al. (Evensen 2009a, 2009b; Moradkhani et al. 2005)

However, introducing random walk dynamics to the parameters with fixed variance leads to a new dynamical stochastic system with properties that may be different from the properties of the original system. That implies that the variance of the random walk should be decreased, when the method is used for offline parameter estimation, cf. (Hürzeler and Künsch 2001).

## The pushforward trick

When writing Takamoto et al. (2022) we learned a useful way of thinking about this problem from Brandstetter, Worrall, and Welling (2022), which solved many difficulties at once for us.
They think about it as a distribution shift problem, but one where we can reduce the magnitude of the implied distribution shift, which they call the *pushforward trick*.

We approach the problem in probabilistic terms. The solver maps \(p_k \mapsto\) \(\mathcal{A}_{\sharp} p_k\) at iteration \(k+1\), where \(\mathcal{A}_{\sharp}: \mathbb{P}(X) \rightarrow \mathbb{P}(X)\) is the pushforward operator for \(\mathcal{A}\) and \(\mathbb{P}(X)\) is the space of distributions on \(X\). After a single test time iteration, the solver sees samples from \(\mathcal{A}_{\sharp} p_k\) instead of the distribution \(p_{k+1}\), and unfortunately \(\mathcal{A}_{\sharp} p_k \neq p_{k+1}\) because errors always survive training. The test time distribution is thus shifted, which we refer to as the distribution shift problem. This is a domain adaptation problem. We mitigate the distribution shift problem by adding a stability loss term, accounting for the distribution shift. A natural candidate is an adversarial-style loss \[ L_{\text {stability }}=\mathbb{E}_k \mathbb{E}_{\mathbf{u}^{k+1} \mid \mathbf{u}^k, \mathbf{u}^k \sim p_k}\left[\mathbb{E}_{\boldsymbol{\epsilon} \mid \mathbf{u}^k}\left[\mathcal{L}\left(\mathcal{A}\left(\mathbf{u}^k+\boldsymbol{\epsilon}\right), \mathbf{u}^{k+1}\right)\right]\right] \] where \(\epsilon \mid \mathbf{u}^k\) is an adversarial perturbation sampled from an appropriate distribution. For the perturbation distribution, we choose \(\epsilon\) such that \(\left(\mathbf{u}^k+\boldsymbol{\epsilon}\right) \sim \mathcal{A}_{\sharp} p_k\). This can be easily achieved by using \(\left(\mathbf{u}^k+\boldsymbol{\epsilon}\right)=\mathcal{A}\left(\mathbf{u}^{k-1}\right)\) for \(\mathbf{u}^{k-1}\) one step causally preceding \(\mathbf{u}^k\). Our total loss is then \(L_{\text {one-step }}+L_{\text {stability. }}\) We call this the pushforward trick. We implement this by unrolling the solver for 2 steps but only backpropagating errors on the last unroll step, as shown in Figure. … This is not only faster, it also seems to be more stable. Exactly why, we are not sure, but we think it may be to ensure the perturbations are large enough. Training the adversarial distribution itself to minimize the error, defeats the purpose of using it as an adversarial distribution. Adversarial losses were also introduced in Sanchez-Gonzalez et al. (2020) and later used in Mayr et al. (2023), where Brownian motion noise is used for \(\epsilon\) and there is some similarity to Noisy Nodes (Godwin et al. 2022), where noise injection is found to stabilize training of deep graph neural networks. There are also connections with zero-stability (Hairer et all, 1993) from the ODE solver literature. Zero-stability is the condition that perturbations in the input conditions are damped out sublinearly in time, that is \(\left\|\mathcal{A}\left(\mathbf{u}^0+\epsilon\right)-\mathbf{u}^1\right\|<\kappa\|\boldsymbol{\epsilon}\|\), for appropriate norm and small \(\kappa\). The pushforward trick can be seen to minimize \(\kappa\) directly.

That is an interesting justification for a very simple trick; we train better by using a *two steps forward-one step back* approach, where the forward step is a *pushforward* of the previous step.

.

## Backpropagation through time

How we discuss learning parameters with classic recurrent neural networks (Werbos 1990, 1988).

We can think of the problem of learning recurrent networks as essentially a system identification problem with all the implied difficulties including stability problems.

RNN research has its own special terminology, e.g. *vanishing/exploding gradients* (Bengio, Simard, and Frasconi 1994; Pascanu, Mikolov, and Bengio 2013).
TBPTT (*truncated back propagation through time*), (Williams and Zipser 1989) which makes explicit *with respect to when* gradients are taken.

## Method of adjoints

See method of adjoints.

## References

*Proceedings of The 35th Uncertainty in Artificial Intelligence Conference*, 799–808. PMLR.

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*72 (3): 269–342.

*arXiv:1511.07367 [Stat]*, November.

*Proceedings of the National Academy of Sciences*111 (52): 18507–12.

*arXiv:1707.01069 [Cs, Stat]*, July.

*International Conference on Machine Learning*, 544–52.

*IEEE Transactions on Neural Networks*5 (2): 157–66.

*Time Series Analysis: Forecasting and Control*. Fifth edition. Wiley Series in Probability and Statistics. Hoboken, New Jersey: John Wiley & Sons, Inc.

*International Conference on Learning Representations*.

*The Annals of Applied Statistics*3 (1): 319–48.

*Proceedings of the National Academy of Sciences*113 (15): 3932–37.

*SIAM Journal on Scientific Computing*24 (3): 1076–89.

*Journal of Economic Surveys*21 (4): 746–85.

*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2980–88. Curran Associates, Inc.

*arXiv:2102.07850 [Cs, Stat]*, June.

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*68 (3): 411–36.

*Statistics and Computing*22 (5): 1009–20.

*Sequential Monte Carlo Methods in Practice*. New York, NY: Springer New York.

*arXiv:1304.5768 [Stat]*, April.

*Bayesian Analysis*11 (2): 325–52.

*Time Series Analysis by State Space Methods*. 2nd ed. Oxford Statistical Science Series 38. Oxford: Oxford University Press.

*Bulletin of the American Meteorological Society*78 (11): 2577–92.

*Ocean Dynamics*53 (4): 343–67.

*Data Assimilation - The Ensemble Kalman Filter*. Berlin; Heidelberg: Springer.

*IEEE Control Systems*29 (3): 83–104.

*Annual Review of Statistics and Its Application*5 (1): 421–49.

*Journal of The Royal Society Interface*7 (43): 271–83.

*arXiv:1411.5172 [Cs, Stat]*, November.

*International Journal of Forecasting*, Forecasting Long Memory Processes, 18 (2): 167–79.

*Sequential Monte Carlo Methods in Practice*, 159–75. Statistics for Engineering and Information Science. Springer, New York, NY.

*PMLR*, 1607–16.

*arXiv:1810.07951 [Cs]*, October.

*Proceedings of the National Academy of Sciences*103 (49): 18438–43.

*The Annals of Statistics*39 (3): 1776–1802.

*Proceedings of the National Academy of Sciences*112 (3): 719–24.

*IFAC Proceedings Volumes*, 15th IFAC Symposium on System Identification, 42 (10): 774–85.

*Statistical Science*30 (3): 328–51.

*Proceedings of the 38th International Conference on Machine Learning*, 5443–52. PMLR.

*arXiv:2005.08926 [Cs, Stat]*, November.

*Journal of the American Statistical Association*, 1203–15.

*arXiv Preprint arXiv:1511.05121*.

*Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence*, 2101–9.

*Advances In Neural Information Processing Systems*.

*arXiv Preprint arXiv:1705.10306*.

*Neural Computation*17 (11): 2337–82.

*Ecology Letters*10 (7): 551.

*Journal of the American Statistical Association*105 (492): 1617–25.

*International Conference on Artificial Intelligence and Statistics*, 3870–82. PMLR.

*Current Opinion in Neurobiology*, Machine Learning, Big Data, and Neuroscience, 55 (April): 82–89.

*IFAC-PapersOnLine (System Identification, Volume 16)*, 45:1785–90. 16th IFAC Symposium on System Identification. IFAC & Elsevier Ltd.

*Computational Statistics & Data Analysis*52 (6): 2877–91.

*Sequential Monte Carlo Methods in Practice*, 197–223. Statistics for Engineering and Information Science. Springer, New York, NY.

*IEEE Transactions on Automatic Control*24 (1): 36–50.

*Stochastic Approximation and Optimization of Random Systems*. Vol. 17. Birkhäuser.

*Theory and Practice of Recursive Identification*. The MIT Press Series in Signal Processing, Optimization, and Control 4. Cambridge, Mass: MIT Press.

*arXiv Preprint arXiv:1705.09279*.

*arXiv:2004.12550 [Stat]*, October.

*Journal of Open Source Software*4 (38): 1292.

*Advances in Water Resources*28 (2): 135–47.

*arXiv Preprint arXiv:1705.11140*.

*arXiv:1703.00381 [Cs, Stat]*, March.

*arXiv:1211.5063 [Cs]*, 1310–18.

*arXiv:1812.01892 [Cs]*, December.

*Proceedings of the 37th International Conference on Machine Learning*, 8459–68. PMLR.

*Automatica*, Trends in System Identification, 31 (12): 1691–1724.

*System Identification*. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.

*bioRxiv*, February, 272005.

*Monthly Weather Review*131 (7): 1485–90.

*arXiv:1711.11053 [Stat]*, November.

*Neural Networks*1 (4): 339–56.

*Proceedings of the IEEE*78 (10): 1550–60.

*Neural Computation*2 (4): 490–501.

*Neural Computation*1 (2): 270–80.

## No comments yet. Why not leave one?