Recursive identification

Learning forward dynamics by looking at time series.

A grab-bag of perspectives and tricks for recursive identification of dynamical systems, i.e. updating a model which produces the correct forward predictions, given the past.

Keywords: multi-step prediction, time horizon, teacher forcing. The various things that are meant by “autoregressive”.

A common core of ideas pop up here in forecasting and state filtering system identification (including particle version), RNNs and forward operator learning. We could describe Koopman operator as an alternative perspective.

Classic systems learning

Landmark papers according to Lindström et al. (2012):

Augmenting the unobserved state vector is a well known technique, used in the system identification community for decades, see e.g. Ljung (L. Ljung 1979; Lindström et al. 2008; Söderström and Stoica 1988). Similar ideas, using Sequential Monte Carlos methods, were suggested by (Kitagawa 1998; Liu and West 2001). Combined state and parameter estimation is also the standard technique for data assimilation in high-dimensional systems, see Moradkhani et al. (Evensen 2009a, 2009b; Moradkhani et al. 2005)

However, introducing random walk dynamics to the parameters with fixed variance leads to a new dynamical stochastic system with properties that may be different from the properties of the original system. That implies that the variance of the random walk should be decreased, when the method is used for offline parameter estimation, cf. (Hürzeler and Künsch 2001).

The pushforward trick

When writing Takamoto et al. (2022) we learned a useful way of thinking about this problem from Brandstetter, Worrall, and Welling (2022), which solved many difficulties at once for us. They think about it as a distribution shift problem, but one where we can reduce the magnitude of the implied distribution shift, which they call the pushforward trick.

We approach the problem in probabilistic terms. The solver maps \(p_k \mapsto\) \(\mathcal{A}_{\sharp} p_k\) at iteration \(k+1\), where \(\mathcal{A}_{\sharp}: \mathbb{P}(X) \rightarrow \mathbb{P}(X)\) is the pushforward operator for \(\mathcal{A}\) and \(\mathbb{P}(X)\) is the space of distributions on \(X\). After a single test time iteration, the solver sees samples from \(\mathcal{A}_{\sharp} p_k\) instead of the distribution \(p_{k+1}\), and unfortunately \(\mathcal{A}_{\sharp} p_k \neq p_{k+1}\) because errors always survive training. The test time distribution is thus shifted, which we refer to as the distribution shift problem. This is a domain adaptation problem. We mitigate the distribution shift problem by adding a stability loss term, accounting for the distribution shift. A natural candidate is an adversarial-style loss \[ L_{\text {stability }}=\mathbb{E}_k \mathbb{E}_{\mathbf{u}^{k+1} \mid \mathbf{u}^k, \mathbf{u}^k \sim p_k}\left[\mathbb{E}_{\boldsymbol{\epsilon} \mid \mathbf{u}^k}\left[\mathcal{L}\left(\mathcal{A}\left(\mathbf{u}^k+\boldsymbol{\epsilon}\right), \mathbf{u}^{k+1}\right)\right]\right] \] where \(\epsilon \mid \mathbf{u}^k\) is an adversarial perturbation sampled from an appropriate distribution. For the perturbation distribution, we choose \(\epsilon\) such that \(\left(\mathbf{u}^k+\boldsymbol{\epsilon}\right) \sim \mathcal{A}_{\sharp} p_k\). This can be easily achieved by using \(\left(\mathbf{u}^k+\boldsymbol{\epsilon}\right)=\mathcal{A}\left(\mathbf{u}^{k-1}\right)\) for \(\mathbf{u}^{k-1}\) one step causally preceding \(\mathbf{u}^k\). Our total loss is then \(L_{\text {one-step }}+L_{\text {stability. }}\) We call this the pushforward trick. We implement this by unrolling the solver for 2 steps but only backpropagating errors on the last unroll step, as shown in Figure. … This is not only faster, it also seems to be more stable. Exactly why, we are not sure, but we think it may be to ensure the perturbations are large enough. Training the adversarial distribution itself to minimize the error, defeats the purpose of using it as an adversarial distribution. Adversarial losses were also introduced in Sanchez-Gonzalez et al. (2020) and later used in Mayr et al. (2023), where Brownian motion noise is used for \(\epsilon\) and there is some similarity to Noisy Nodes (Godwin et al. 2022), where noise injection is found to stabilize training of deep graph neural networks. There are also connections with zero-stability (Hairer et all, 1993) from the ODE solver literature. Zero-stability is the condition that perturbations in the input conditions are damped out sublinearly in time, that is \(\left\|\mathcal{A}\left(\mathbf{u}^0+\epsilon\right)-\mathbf{u}^1\right\|<\kappa\|\boldsymbol{\epsilon}\|\), for appropriate norm and small \(\kappa\). The pushforward trick can be seen to minimize \(\kappa\) directly.

That is an interesting justification for a very simple trick; we train better by using a two steps forward-one step back approach, where the forward step is a pushforward of the previous step.

Brandstetter, Worrall, and Welling (2022)’s answer to the distribution shift problem.

Backpropagation through time

How we discuss learning parameters with classic recurrent neural networks (Werbos 1990, 1988).

We can think of the problem of learning recurrent networks as essentially a system identification problem with all the implied difficulties including stability problems.

RNN research has its own special terminology, e.g. vanishing/exploding gradients (Bengio, Simard, and Frasconi 1994; Pascanu, Mikolov, and Bengio 2013). TBPTT (truncated back propagation through time), (Williams and Zipser 1989) which makes explicit with respect to when gradients are taken.

Method of adjoints

See method of adjoints.


Aicher, Christopher, Nicholas J. Foti, and Emily B. Fox. 2020. Adaptively Truncating Backpropagation Through Time to Control Gradient Bias.” In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, 799–808. PMLR.
Andrieu, Christophe, Arnaud Doucet, and Roman Holenstein. 2010. Particle Markov Chain Monte Carlo Methods.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 (3): 269–342.
Archer, Evan, Il Memming Park, Lars Buesing, John Cunningham, and Liam Paninski. 2015. Black Box Variational Inference for State Space Models.” arXiv:1511.07367 [Stat], November.
Babtie, Ann C., Paul Kirk, and Michael P. H. Stumpf. 2014. Topological Sensitivity Analysis for Systems Biology.” Proceedings of the National Academy of Sciences 111 (52): 18507–12.
Bamler, Robert, and Stephan Mandt. 2017. Structured Black Box Variational Inference for Latent Time Series Models.” arXiv:1707.01069 [Cs, Stat], July.
Becker, Philipp, Harit Pandya, Gregor Gebhardt, Cheng Zhao, C. James Taylor, and Gerhard Neumann. 2019. Recurrent Kalman Networks: Factorized Inference in High-Dimensional Deep Feature Spaces.” In International Conference on Machine Learning, 544–52.
Bengio, Y., P. Simard, and P. Frasconi. 1994. Learning Long-Term Dependencies with Gradient Descent Is Difficult.” IEEE Transactions on Neural Networks 5 (2): 157–66.
Box, George E. P., Gwilym M. Jenkins, Gregory C. Reinsel, and Greta M. Ljung. 2016. Time Series Analysis: Forecasting and Control. Fifth edition. Wiley Series in Probability and Statistics. Hoboken, New Jersey: John Wiley & Sons, Inc.
Brandstetter, Johannes, Daniel Worrall, and Max Welling. 2022. Message Passing Neural PDE Solvers.” In International Conference on Learning Representations.
Bretó, Carles, Daihai He, Edward L. Ionides, and Aaron A. King. 2009. Time Series Analysis via Mechanistic Models.” The Annals of Applied Statistics 3 (1): 319–48.
Brunton, Steven L., Joshua L. Proctor, and J. Nathan Kutz. 2016. Discovering Governing Equations from Data by Sparse Identification of Nonlinear Dynamical Systems.” Proceedings of the National Academy of Sciences 113 (15): 3932–37.
Cao, Y., S. Li, L. Petzold, and R. Serban. 2003. Adjoint Sensitivity Analysis for Differential-Algebraic Equations: The Adjoint DAE System and Its Numerical Solution.” SIAM Journal on Scientific Computing 24 (3): 1076–89.
Chevillon, Guillaume. 2007. Direct Multi-Step Estimation and Forecasting.” Journal of Economic Surveys 21 (4): 746–85.
Chung, Junyoung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. 2015. A Recurrent Latent Variable Model for Sequential Data.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2980–88. Curran Associates, Inc.
Corenflos, Adrien, James Thornton, George Deligiannidis, and Arnaud Doucet. 2021. Differentiable Particle Filtering via Entropy-Regularized Optimal Transport.” arXiv:2102.07850 [Cs, Stat], June.
Del Moral, Pierre, Arnaud Doucet, and Ajay Jasra. 2006. Sequential Monte Carlo Samplers.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (3): 411–36.
———. 2011. An Adaptive Sequential Monte Carlo Method for Approximate Bayesian Computation.” Statistics and Computing 22 (5): 1009–20.
Doucet, Arnaud, Nando Freitas, and Neil Gordon. 2001. Sequential Monte Carlo Methods in Practice. New York, NY: Springer New York.
Doucet, Arnaud, Pierre E. Jacob, and Sylvain Rubenthaler. 2013. Derivative-Free Estimation of the Score Vector and Observed Information Matrix with Application to State-Space Models.” arXiv:1304.5768 [Stat], April.
Drovandi, Christopher C., Anthony N. Pettitt, and Roy A. McCutchan. 2016. Exact and Approximate Bayesian Inference for Low Integer-Valued Time Series Models with Intractable Likelihoods.” Bayesian Analysis 11 (2): 325–52.
Durbin, J., and S. J. Koopman. 2012. Time Series Analysis by State Space Methods. 2nd ed. Oxford Statistical Science Series 38. Oxford: Oxford University Press.
Errico, Ronald M. 1997. What Is an Adjoint Model? Bulletin of the American Meteorological Society 78 (11): 2577–92.
Evensen, Geir. 2003. The Ensemble Kalman Filter: Theoretical Formulation and Practical Implementation.” Ocean Dynamics 53 (4): 343–67.
———. 2009a. Data Assimilation - The Ensemble Kalman Filter. Berlin; Heidelberg: Springer.
———. 2009b. The Ensemble Kalman Filter for Combined State and Parameter Estimation.” IEEE Control Systems 29 (3): 83–104.
Fearnhead, Paul, and Hans R. Künsch. 2018. Particle Filters and Data Assimilation.” Annual Review of Statistics and Its Application 5 (1): 421–49.
Gahungu, Paterne, Christopher W. Lanyon, Mauricio A. Álvarez, Engineer Bainomugisha, Michael Thomas Smith, and Richard David Wilkinson. 2022. Adjoint-Aided Inference of Gaussian Process Driven Differential Equations.” In.
Godwin, Jonathan, Michael Schaarschmidt, Alexander Gaunt, Alvaro Sanchez-Gonzalez, Yulia Rubanova, Petar Veličković, James Kirkpatrick, and Peter Battaglia. 2022. Simple GNN Regularisation for 3D Molecular Property Prediction & Beyond.” arXiv.
He, Daihai, Edward L. Ionides, and Aaron A. King. 2010. Plug-and-Play Inference for Disease Dynamics: Measles in Large and Small Populations as a Case Study.” Journal of The Royal Society Interface 7 (43): 271–83.
Heinonen, Markus, and Florence d’Alché-Buc. 2014. Learning Nonparametric Differential Equations with Operator-Valued Kernels and Gradient Matching.” arXiv:1411.5172 [Cs, Stat], November.
Hurvich, Clifford M. 2002. Multistep Forecasting of Long Memory Series Using Fractional Exponential Models.” International Journal of Forecasting, Forecasting Long Memory Processes, 18 (2): 167–79.
Hürzeler, Markus, and Hans R. Künsch. 2001. Approximating and Maximising the Likelihood for a General State-Space Model.” In Sequential Monte Carlo Methods in Practice, 159–75. Statistics for Engineering and Information Science. Springer, New York, NY.
Ingraham, John, and Debora Marks. 2017. Variational Inference for Sparse and Undirected Models.” In PMLR, 1607–16.
Innes, Michael. 2018. Don’t Unroll Adjoint: Differentiating SSA-Form Programs.” arXiv:1810.07951 [Cs], October.
Ionides, E. L., C. Bretó, and A. A. King. 2006. Inference for Nonlinear Dynamical Systems.” Proceedings of the National Academy of Sciences 103 (49): 18438–43.
Ionides, Edward L., Anindya Bhadra, Yves Atchadé, and Aaron King. 2011. Iterated Filtering.” The Annals of Statistics 39 (3): 1776–1802.
Ionides, Edward L., Dao Nguyen, Yves Atchadé, Stilian Stoev, and Aaron A. King. 2015. Inference for Dynamic and Latent Variable Models via Iterated, Perturbed Bayes Maps.” Proceedings of the National Academy of Sciences 112 (3): 719–24.
Johnson, Steven G. 2012. “Notes on Adjoint Methods for 18.335,” 6.
Kantas, N., A. Doucet, S. S. Singh, and J. M. Maciejowski. 2009. An Overview of Sequential Monte Carlo Methods for Parameter Estimation in General State-Space Models.” IFAC Proceedings Volumes, 15th IFAC Symposium on System Identification, 42 (10): 774–85.
Kantas, Nikolas, Arnaud Doucet, Sumeetpal S. Singh, Jan Maciejowski, and Nicolas Chopin. 2015. On Particle Methods for Parameter Estimation in State-Space Models.” Statistical Science 30 (3): 328–51.
Kidger, Patrick, Ricky T. Q. Chen, and Terry J. Lyons. 2021. ‘Hey, That’s Not an ODE’: Faster ODE Adjoints via Seminorms.” In Proceedings of the 38th International Conference on Machine Learning, 5443–52. PMLR.
Kidger, Patrick, James Morrill, James Foster, and Terry Lyons. 2020. Neural Controlled Differential Equations for Irregular Time Series.” arXiv:2005.08926 [Cs, Stat], November.
Kitagawa, Genshiro. 1998. A Self-Organizing State-Space Model.” Journal of the American Statistical Association, 1203–15.
Krishnan, Rahul G., Uri Shalit, and David Sontag. 2015. Deep Kalman Filters.” arXiv Preprint arXiv:1511.05121.
———. 2017. Structured Inference Networks for Nonlinear State Space Models.” In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2101–9.
Lamb, Alex, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, and Yoshua Bengio. 2016. Professor Forcing: A New Algorithm for Training Recurrent Networks.” In Advances In Neural Information Processing Systems.
Le, Tuan Anh, Maximilian Igl, Tom Jin, Tom Rainforth, and Frank Wood. 2017. Auto-Encoding Sequential Monte Carlo.” arXiv Preprint arXiv:1705.10306.
Legenstein, Robert, Christian Naeger, and Wolfgang Maass. 2005. What Can a Neuron Learn with Spike-Timing-Dependent Plasticity? Neural Computation 17 (11): 2337–82.
Lele, S. R., B. Dennis, and F. Lutscher. 2007. Data Cloning: Easy Maximum Likelihood Estimation for Complex Ecological Models Using Bayesian Markov Chain Monte Carlo Methods. Ecology Letters 10 (7): 551.
Lele, Subhash R., Khurram Nadeem, and Byron Schmuland. 2010. Estimability and Likelihood Inference for Generalized Linear Mixed Models Using Data Cloning.” Journal of the American Statistical Association 105 (492): 1617–25.
Li, Xuechen, Ting-Kam Leonard Wong, Ricky T. Q. Chen, and David Duvenaud. 2020. Scalable Gradients for Stochastic Differential Equations.” In International Conference on Artificial Intelligence and Statistics, 3870–82. PMLR.
Lillicrap, Timothy P, and Adam Santoro. 2019. Backpropagation Through Time and the Brain.” Current Opinion in Neurobiology, Machine Learning, Big Data, and Neuroscience, 55 (April): 82–89.
Lindström, Erik, Edward Ionides, Jan Frydendall, and Henrik Madsen. 2012. Efficient Iterated Filtering.” In IFAC-PapersOnLine (System Identification, Volume 16), 45:1785–90. 16th IFAC Symposium on System Identification. IFAC & Elsevier Ltd.
Lindström, Erik, Jonas Ströjby, Mats Brodén, Magnus Wiktorsson, and Jan Holst. 2008. Sequential Calibration of Options.” Computational Statistics & Data Analysis 52 (6): 2877–91.
Liu, Jane, and Mike West. 2001. Combined Parameter and State Estimation in Simulation-Based Filtering.” In Sequential Monte Carlo Methods in Practice, 197–223. Statistics for Engineering and Information Science. Springer, New York, NY.
Ljung, L. 1979. Asymptotic Behavior of the Extended Kalman Filter as a Parameter Estimator for Linear Systems.” IEEE Transactions on Automatic Control 24 (1): 36–50.
Ljung, Lennart, Georg Ch Pflug, and Harro Walk. 2012. Stochastic Approximation and Optimization of Random Systems. Vol. 17. Birkhäuser.
Ljung, Lennart, and Torsten Söderström. 1983. Theory and Practice of Recursive Identification. The MIT Press Series in Signal Processing, Optimization, and Control 4. Cambridge, Mass: MIT Press.
Maddison, Chris J., Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Whye Teh. 2017. Filtering Variational Objectives.” arXiv Preprint arXiv:1705.09279.
Margossian, Charles C., Aki Vehtari, Daniel Simpson, and Raj Agrawal. 2020. Hamiltonian Monte Carlo Using an Adjoint-Differentiated Laplace Approximation: Bayesian Inference for Latent Gaussian Models and Beyond.” arXiv:2004.12550 [Stat], October.
Mayr, Andreas, Sebastian Lehner, Arno Mayrhofer, Christoph Kloss, Sepp Hochreiter, and Johannes Brandstetter. 2023. Boundary Graph Neural Networks for 3D Simulations.” arXiv.
Mitusch, Sebastian K., Simon W. Funke, and Jørgen S. Dokken. 2019. Dolfin-Adjoint 2018.1: Automated Adjoints for FEniCS and Firedrake.” Journal of Open Source Software 4 (38): 1292.
Moradkhani, Hamid, Soroosh Sorooshian, Hoshin V. Gupta, and Paul R. Houser. 2005. Dual State–Parameter Estimation of Hydrological Models Using Ensemble Kalman Filter.” Advances in Water Resources 28 (2): 135–47.
Naesseth, Christian A., Scott W. Linderman, Rajesh Ranganath, and David M. Blei. 2017. Variational Sequential Monte Carlo.” arXiv Preprint arXiv:1705.11140.
Oliva, Junier B., Barnabas Poczos, and Jeff Schneider. 2017. The Statistical Recurrent Unit.” arXiv:1703.00381 [Cs, Stat], March.
Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. 2013. On the Difficulty of Training Recurrent Neural Networks.” In arXiv:1211.5063 [Cs], 1310–18.
Rackauckas, Christopher, Yingbo Ma, Vaibhav Dixit, Xingjian Guo, Mike Innes, Jarrett Revels, Joakim Nyberg, and Vijay Ivaturi. 2018. A Comparison of Automatic Differentiation and Continuous Sensitivity Analysis for Derivatives of Differential Equation Solutions.” arXiv:1812.01892 [Cs], December.
Sanchez-Gonzalez, Alvaro, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter Battaglia. 2020. Learning to Simulate Complex Physics with Graph Networks.” In Proceedings of the 37th International Conference on Machine Learning, 8459–68. PMLR.
Sjöberg, Jonas, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard Delyon, Pierre-Yves Glorennec, Håkan Hjalmarsson, and Anatoli Juditsky. 1995. Nonlinear Black-Box Modeling in System Identification: A Unified Overview.” Automatica, Trends in System Identification, 31 (12): 1691–1724.
Söderström, T., and P. Stoica, eds. 1988. System Identification. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.
Stapor, Paul, Fabian Fröhlich, and Jan Hasenauer. 2018. Optimization and Uncertainty Analysis of ODE Models Using 2nd Order Adjoint Sensitivity Analysis.” bioRxiv, February, 272005.
Sutskever, Ilya. 2013. Training Recurrent Neural Networks.” PhD Thesis, Toronto, Ont., Canada, Canada: University of Toronto.
Takamoto, Makoto, Timothy Praditia, Raphael Leiteritz, Dan MacKinlay, Francesco Alesiani, Dirk Pflüger, and Mathias Niepert. 2022. PDEBench: An Extensive Benchmark for Scientific Machine Learning.” In.
Tallec, Corentin, and Yann Ollivier. 2017. Unbiasing Truncated Backpropagation Through Time.” arXiv.
Tippett, Michael K., Jeffrey L. Anderson, Craig H. Bishop, Thomas M. Hamill, and Jeffrey S. Whitaker. 2003. Ensemble Square Root Filters.” Monthly Weather Review 131 (7): 1485–90.
Wen, Ruofeng, Kari Torkkola, and Balakrishnan Narayanaswamy. 2017. A Multi-Horizon Quantile Recurrent Forecaster.” arXiv:1711.11053 [Stat], November.
Werbos, Paul J. 1988. Generalization of Backpropagation with Application to a Recurrent Gas Market Model.” Neural Networks 1 (4): 339–56.
———. 1990. Backpropagation Through Time: What It Does and How to Do It.” Proceedings of the IEEE 78 (10): 1550–60.
Williams, Ronald J., and Jing Peng. 1990. An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories.” Neural Computation 2 (4): 490–501.
Williams, Ronald J., and David Zipser. 1989. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks.” Neural Computation 1 (2): 270–80.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.