Feedback system identification, linear

In system identification, we infer the parameters of a stochastic dynamical system of a certain type, i.e. usually ones with feedback, so that we can e.g. simulate it, or deconvolve it to find the inputs and hidden state, maybe using state filters. In statistical terms, this is the parameter inference problem for dynamical systems.

Moreover, it totally works without Gaussian noise; that’s just convenient in optimal linear filtering, Kalman filtering isn’t rocket science, after all. Also, mathematically Gaussian is a useful crutch if you decide to go to a continuous time index, cf Gaussian processes.

This is the mostly offline version. There is a sub-notebook focussing on online recursive estimation.


Oppenheim and Verghese, Signals, Systems, and Inference is free online. ritvikmath explains partial autocorrelation as a graphical model, which is not complicated, but for some reason I never had it laid out this way in my own time series courses. See also Kenneth Tay, The relationship between MA(q)/AR(p) processes and ACF/PACF

(Martin 1999):

Consider the basic autoregressive model,

\[ Y(k) = \sum_{j=1}^pa_jY(k-j)=\epsilon(k). \]

Estimating AR(p) coefficients:

The [power] spectrum is easily obtained from [the above] as

\[ P(f) = \frac{\sigma^2}{|1+ \sum_{j=1}^pa_jz^{-1}|^2},\\ z=\exp 2\pi if\delta t \]

with \(\delta t\) the intersample spacing. […] for any given set of data, we need to be able to estimate the AR coefficients \(\{a_j\}_{j=1}^N\) conveniently. Three methods for achieving this are the Yule-Walker, Burg and Covariance methods. The Yule-Walker technique uses the sample autocovariance to obtain the coefficients; the Covariance method defines, for a set of numbers \(\mathbf{a}=\{a_j\}_{j=1}^N,\) a quantity known as the total forward and backward prediction error power:

\[ E(Y,\mathbf{a}) = \frac{1}{2(N-p)}\sum_{n=p+1}^N\left\{ \left|Y(n)+\sum{j=1}^pa_jY(n-p)\right|^2 + \left|Y(n-p)+\sum{j=1}^pa^*_jY(n-p+j)\right|^2 \right\} \]

and minimises this w.r.t. \(\mathbf{a}\). As \(E(Y, \mathbf{a})\) is a quadratic function of \(\mathbf{a}\), \(\partial E(Y, \mathbf{a})/partial a\) is linear in \(\mathbf{a}\) and so this is a linear optimisation problem. The Burg method is a constrained minimisation of \(E(Y, \mathbf{a})\) using the Levinson recursion, a computational device derived from the Yule-Walker method.

Instrumental variable regression

Unevenly sampled

Model estimation/system identification

You don’t know a parameterised model for the data (and hence a precise bandwidth) and you wish to estimate it.

This is a system identification problem, although the non-uniform sampling means that it has an unusual form.

(Martin 1999) summarizes:

One could consider the general problem in an approximate way as the missing data problem with a very high proportion of missing data points, but (Jones 1981, 1984) this is not very realistic. This has led to the consideration of the continuous-time model […] . (Lii and Masry 1992) shows that the coefficients in that equation may be obtained from the [irregularly sampled autocorrelation moments, but], the estimation of these requires a large amount of data and the results are asymptotic in the limit of infinite data. The other continuous-time approach is that of Jones (Jones 1981, 1984) who has used Kalman recursive estimation […] to obtain a likelihood function \(\operatorname{lik}(x|b)\) which is then maximised w.r.t. b to obtain an estimate of the true parameters.

There is a partial review and comparison of methods in (P. M. Broersen 2006; Stoica and Moses 2005). From the latter:

(Martin 1999) applied autoregressive modeling to irregularly sampled data using a dedicated method. It was particularly good in extracting sinusoids from noise in short data sets. (SΓΆderstrΓΆm and Mossberg 2000) evaluated the performance of methods for identifying continuous-time autoregressive processes, which replace the differentiation operator by different approximations. (Larsson and SΓΆderstrΓΆm 2002) apply this idea to randomly sampled autoregressive data. They report promising results for low-order processes. (Lahalle, Fleury, and Rivoira 2004) estimate continuous-time ARMA models. Unfortunately, their method requires explicit use of a model for irregular sampling instants. The precise shape of that distribution is very important for the result, but it is almost impossible to establish it from practical data.

No generally satisfactory spectral estimator for irregular data has been defined yet. Continuous time series models can be estimated for irregular data, and they are the only possible candidates for obtaining the Cramér-Rao lower boundary, because the true process for irregular data is a continuous-time process. (Jones 1981) has formulated the maximum likelihood estimator for irregular observations. However, (Jones 1984) also found that the likelihood has several local maxima and the optimisation requires extremely good initial estimates. (P. M. T. Broersen and Bos 2006) used the method of Jones to obtain maximum likelihood estimates for irregular data. If simulations started with the true process parameters as initial conditions, that was sometimes, but not always, good enough to converge to the global maximum of the likelihood. However, sometimes even those perfect and nonrealisable starting values were not capable of letting the likelihood converge to an acceptable model. So far, no practical maximum likelihood method for irregular data has solved all numerical problems, and certainly no satisfactory realisable initial conditions can be given. As an example, it has been verified in simulations that taking the estimated AR( pβ€”1) model together with an additional zero for order p as starting values for AR( p) estimation does not always converge to acceptable AR( p) models. The model with the maximum value of the likelihood might not in all cases be accurate and many good models have significantly lower numerical values of the likelihood. (Martin 1999) suggests that the exact likelihood is sensitive to round-off errors. (P. M. T. Broersen and Bos 2006) calculated the likelihood as a function of true model parameters, multiplied by a constant factor. Only the likelihood for a single pole was smooth. Two poles already gave a number of sharp peaks in the likelihood, and three or more poles gave a very rough surface of the likelihood. The scene is full of local minima, and the optimisation cannot find the global minimum, unless it starts very close to it.


Asymptotic methods based on gridding observations.

Method of transformed coefficients

Useful tool: equivalence of a continuous time Ito integral and a discrete ARIMA process (attributed by Martin (1998) to Bartlett (1946)) also implies you can estimate the model without estimating missing data, which is satisfying, although the precise form this takes is less satisfying.

A popular overview seem to be Martin (1999).

State filters

(Note that you can also do the signal reconstruction problem using state filters, but I’m interested in doing system identification using state filters.) [Jones (1981); MartinAutoregression1998gave this a go; while (Martin 1998) mentioned problems, I’m curious when it does work, since this seems natural, simple, and easier to make robust against model violations than the other methods.

(Martin 1998):

It is well known that if a univariate continuous time autoregression is sampled at equally spaced time intervals, the resulting, discrete time process is ARMA(p,p-1). If the sampling includes observational error, the resulting process is ARMA(p,p); however, these 2p parameters depend only on the p continuous time autoregression coefficients and the observational error variance. Modeling, the process as a continuous time autoregression with observational error may be much more parsimonious than modeling the discrete time process, whether or not the data are equally spaced. The direct modeling of observational error has the effect of smoothing noisy data and may eliminate the need for moving average terms.


Gradient descent learns Linear Dynamical systems

Linear Predictive Coding

LPC introductions traditionally start with a physical model of the human vocal tract as a resonating pipe, then mumble away the details. This confused the hell out of me. AFAICT, an LPC model is just a list of AR regression coefficients and a driving noise source coefficient. This is β€œcoding” because you can round the numbers, pack them down a smidgen and then use it to encode certain time series, such as the human voice, compactly. But it’s still a regression analysis, and can be treated as such.

The twists are that

  • we usually think about it in a compression context
  • Traditionally one performs many regressions to get time-varying models

It’s commonly described as a physical model because we can imagine these regression coefficients corresponding to a simplified physical model of the human vocal tract; But we can think of the regression coefficients as corresponding to any all-pole linear system, so I don’t think that brings special insight; especially as the models of, say, a resonating pipe, would intuitively be described by time-delays corresponding to the length of the pipe, not time-lags corresponding to a corresponding sample plus computational convenience. Sure we can get similar spectral response for this model as with a pipe, according to linear systems theory, but if you are going to assume so much advanced linear systems theory anyway, and mix it with crappy physics, why not just start with the linear systems and ditch the physics?

To discuss: these coefficients as spectrogram smoothing.


Agarwal, Anish, Muhammad Jehangir Amjad, Devavrat Shah, and Dennis Shen. 2018. β€œTime Series Analysis via Matrix Estimation.” arXiv:1802.09064 [Cs, Stat], February.
Akaike, Htrotugu. 1973. β€œMaximum Likelihood Identification of Gaussian Autoregressive Moving Average Models.” Biometrika 60 (2): 255–65.
Ansley, Craig F., and Robert Kohn. 1986. β€œA Note on Reparameterizing a Vector Autoregressive Moving Average Model to Enforce Stationarity.” Journal of Statistical Computation and Simulation 24 (2): 99–106.
Antoniano-Villalobos, Isadora, and Stephen G. Walker. 2016. β€œA Nonparametric Model for Stationary Time Series.” Journal of Time Series Analysis 37 (1): 126–42.
Atal, B. S. 2006. β€œThe History of Linear Prediction.” IEEE Signal Processing Magazine 23 (2): 154–61.
Bartlett, M. S. 1946. β€œOn the Theoretical Specification and Sampling Properties of Autocorrelated Time-Series.” Supplement to the Journal of the Royal Statistical Society 8 (1): 27–41.
Berkhout, A. J., and P. R. Zaanen. 1976. β€œA Comparison Between Wiener Filtering, Kalman Filtering, and Deterministic Least Squares Estimation*.” Geophysical Prospecting 24 (1): 141–97.
Box, G. E. P., and David A. Pierce. 1970. β€œDistribution of Residual Autocorrelations in Autoregressive-Integrated Moving Average Time Series Models.” Journal of the American Statistical Association 65 (332): 1509–26.
Box, George E. P., Gwilym M. Jenkins, Gregory C. Reinsel, and Greta M. Ljung. 2016. Time Series Analysis: Forecasting and Control. Fifth edition. Wiley Series in Probability and Statistics. Hoboken, New Jersey: John Wiley & Sons, Inc.
Broersen, P. M. T., and R. Bos. 2006. β€œEstimating Time-Series Models from Irregularly Spaced Data.” In IEEE Transactions on Instrumentation and Measurement, 55:1124–31.
Broersen, Petrus MT. 2006. Automatic Autocorrelation and Spectral Analysis. Secaucus, NJ, USA: Springer.
Broersen, Piet M. T., Stijn de Waele, and Robert Bos. 2004. β€œAutoregressive Spectral Analysis When Observations Are Missing.” Automatica 40 (9): 1495–1504.
BΓΌhlmann, Peter, and Hans R KΓΌnsch. 1999. β€œBlock Length Selection in the Bootstrap for Time Series.” Computational Statistics & Data Analysis 31 (3): 295–310.
Carmi, Avishy Y. 2013. β€œCompressive System Identification: Sequential Methods and Entropy Bounds.” Digital Signal Processing 23 (3): 751–70.
β€”β€”β€”. 2014. β€œCompressive System Identification.” In Compressed Sensing & Sparse Filtering, edited by Avishy Y. Carmi, Lyudmila Mihaylova, and Simon J. Godsill, 281–324. Signals and Communication Technology. Springer Berlin Heidelberg.
Chen, Bin, and Yongmiao Hong. 2012. β€œTesting for the Markov Property in Time Series.” Econometric Theory 28 (01): 130–78.
Christ, Maximilian, Andreas W. Kempa-Liehr, and Michael Feindt. 2016. β€œDistributed and Parallel Time Series Feature Extraction for Industrial Big Data Applications.” arXiv:1610.07717 [Cs], October.
Delft, Anne van, and Michael Eichler. 2016. β€œLocally Stationary Functional Time Series.” arXiv:1602.05125 [Math, Stat], February.
Doucet, Arnaud, Pierre E. Jacob, and Sylvain Rubenthaler. 2013. β€œDerivative-Free Estimation of the Score Vector and Observed Information Matrix with Application to State-Space Models.” arXiv:1304.5768 [Stat], April.
Durbin, J., and S. J. Koopman. 1997. β€œMonte Carlo Maximum Likelihood Estimation for Non-Gaussian State Space Models.” Biometrika 84 (3): 669–84.
β€”β€”β€”. 2012. Time Series Analysis by State Space Methods. 2nd ed. Oxford Statistical Science Series 38. Oxford: Oxford University Press.
Eguchi, Shoichi, and Yuma Uehara. n.d. β€œSchwartz-Type Model Selection for Ergodic Stochastic Differential Equation Models.” Scandinavian Journal of Statistics n/a (n/a).
Geer, Sara van de. 2002. β€œOn Hoeffdoing’s Inequality for Dependent Random Variables.” In Empirical Process Techniques for Dependent Data. BirkhhΓ€user.
Geweke, John, and Richard Meese. 1981. β€œEstimating Regression Models of Finite but Unknown Order.” Journal of Econometrics 16 (1): 162.
Gu, Albert, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher RΓ©. 2021. β€œCombining Recurrent, Convolutional, and Continuous-Time Models with Linear State Space Layers.” In Advances in Neural Information Processing Systems, 34:572–85. Curran Associates, Inc.
Hardt, Moritz, Tengyu Ma, and Benjamin Recht. 2018. β€œGradient Descent Learns Linear Dynamical Systems.” The Journal of Machine Learning Research 19 (1): 1025–68.
Harvey, A., and S. J. Koopman. 2005. β€œStructural Time Series Models.” In Encyclopedia of Biostatistics. John Wiley & Sons, Ltd.
Hazan, Elad, Karan Singh, and Cyril Zhang. 2017. β€œLearning Linear Dynamical Systems via Spectral Filtering.” In NIPS.
Heaps, Sarah E. 2020. β€œEnforcing Stationarity Through the Prior in Vector Autoregressions.” arXiv:2004.09455 [Stat], April.
Hefny, Ahmed, Carlton Downey, and Geoffrey Gordon. 2015. β€œA New View of Predictive State Methods for Dynamical System Learning.” arXiv:1505.05310 [Cs, Stat], May.
Hencic, Andrew, and Christian GouriΓ©roux. 2015. β€œNoncausal Autoregressive Model in Application to Bitcoin/USD Exchange Rates.” In Econometrics of Risk, edited by Van-Nam Huynh, Vladik Kreinovich, Songsak Sriboonchitta, and Komsan Suriya, 17–40. Studies in Computational Intelligence 583. Springer International Publishing.
Holan, Scott H., Robert Lund, and Ginger Davis. 2010. β€œThe ARMA Alphabet Soup: A Tour of ARMA Model Variants.” Statistics Surveys 4: 232–74.
Ives, Anthony R., Karen C. Abbott, and Nicolas L. Ziebarth. 2010. β€œAnalysis of Ecological Time Series with ARMA( p,q ) Models.” Ecology 91 (3): 858–71.
Jones, Richard H. 1981. β€œFitting a Continuous Time Autoregression to Discrete Data.” In Applied Time Series Analysis II, 651–82.
β€”β€”β€”. 1984. β€œFitting Multivariate Models to Unequally Spaced Data.” In Time Series Analysis of Irregularly Observed Data, 158–88. Springer.
Kailath, Thomas, Ali H. Sayed, and Babak Hassibi. 2000. Linear Estimation. Prentice Hall Information and System Sciences Series. Upper Saddle River, N.J: Prentice Hall.
Kalouptsidis, Nicholas, Gerasimos Mileounis, Behtash Babadi, and Vahid Tarokh. 2011. β€œAdaptive Algorithms for Sparse System Identification.” Signal Processing 91 (8): 1910–19.
KavčiΔ‡, A., and J. M. F. Moura. 2000. β€œMatrices with Banded Inverses: Inversion Algorithms and Factorization of Gauss-Markov Processes.” IEEE Transactions on Information Theory 46 (4): 1495–1509.
Kay, Steven M. 1993. Fundamentals of Statistical Signal Processing. Prentice Hall Signal Processing Series. Englewood Cliffs, N.J: Prentice-Hall PTR.
Kemerait, R., and D. Childers. 1972. β€œSignal Detection and Extraction by Cepstrum Techniques.” IEEE Transactions on Information Theory 18 (6): 745–59.
KΓΌnsch, Hans Rudolf. 1986. β€œDiscrimination Between Monotonic Trends and Long-Range Dependence.” Journal of Applied Probability 23 (4): 1025–30.
Lahalle, E., G. Fleury, and A. Rivoira. 2004. β€œContinuous ARMA Spectral Estimation from Irregularly Sampled Observations.” In Proceedings of the 21st IEEE Instrumentation and Measurement Technology Conference, 2004. IMTC 04, 2:923–927 Vol.2.
Larsson, Erik K., and Torsten SΓΆderstrΓΆm. 2002. β€œIdentification of Continuous-Time AR Processes from Unevenly Sampled Data.” Automatica 38 (4): 709–18.
Lii, Keh-Shin, and Elias Masry. 1992. β€œModel Fitting for Continuous-Time Stationary Processes from Discrete-Time Data.” Journal of Multivariate Analysis 41 (1): 56–79.
Ljung, Lennart. 1999. System Identification: Theory for the User. 2nd ed. Prentice Hall Information and System Sciences Series. Upper Saddle River, NJ: Prentice Hall PTR.
Ljung, Lennart, and Torsten SΓΆderstrΓΆm. 1983. Theory and Practice of Recursive Identification. The MIT Press Series in Signal Processing, Optimization, and Control 4. Cambridge, Mass: MIT Press.
Makhoul, J. 1975. β€œLinear Prediction: A Tutorial Review.” Proceedings of the IEEE 63 (4): 561–80.
Manton, J. H., V. Krishnamurthy, and H. V. Poor. 1998. β€œJames-Stein State Filtering Algorithms.” IEEE Transactions on Signal Processing 46 (9): 2431–47.
Marelli, D. 2007. β€œA Functional Analysis Approach to Subband System Approximation and Identification.” IEEE Transactions on Signal Processing 55 (2): 493–506.
Martin, R. J. 1998. β€œAutoregression and Irregular Sampling: Filtering.” Signal Processing 69 (3): 229–48.
β€”β€”β€”. 1999. β€œAutoregression and Irregular Sampling: Spectral Estimation.” Signal Processing 77 (2): 139–57.
Matos, Joao Amaro de, and Marcelo Fernandes. 2007. β€œTesting the Markov Property with High Frequency Data.” Journal of Econometrics, Semiparametric methods in econometrics, 141 (1): 44–64.
McDonald, Daniel J., Cosma Rohilla Shalizi, and Mark Schervish. 2011a. β€œGeneralization Error Bounds for Stationary Autoregressive Models.” arXiv:1103.0942 [Cs, Stat], March.
β€”β€”β€”. 2011b. β€œRisk Bounds for Time Series Without Strong Mixing.” arXiv:1106.0730 [Cs, Stat], June.
McLeod, A. Ian. 1998. β€œHyperbolic Decay Time Series.” Journal of Time Series Analysis 19 (4): 473–83.
McLeod, A. Ian, and Ying Zhang. 2008. β€œFaster ARMA Maximum Likelihood Estimation.” Computational Statistics & Data Analysis 52 (4): 2166–76.
Milanese, M., and A. Vicino. 1993. β€œInformation-Based Complexity and Nonparametric Worst-Case System Identification.” Journal of Complexity 9 (4): 427–46.
Pagano, Marcello. 1974. β€œEstimation of Models of Autoregressive Signal Plus White Noise.” The Annals of Statistics 2 (1): 99–108.
Pereyra, M., P. Schniter, Γ‰ Chouzenoux, J. C. Pesquet, J. Y. Tourneret, A. O. Hero, and S. McLaughlin. 2016. β€œA Survey of Stochastic Simulation and Optimization Methods in Signal Processing.” IEEE Journal of Selected Topics in Signal Processing 10 (2): 224–41.
Pillonetto, Gianluigi. 2016. β€œThe Interplay Between System Identification and Machine Learning.” arXiv:1612.09158 [Cs, Stat], December.
Plis, Sergey, David Danks, and Jianyu Yang. 2015. β€œMesochronal Structure Learning.” Uncertainty in Artificial Intelligence : Proceedings of the … Conference. Conference on Uncertainty in Artificial Intelligence 31 (July).
Pugachev, V. S., and I. N. SinitοΈ sοΈ‘yn. 2001. Stochastic systems: theory and applications. River Edge, NJ: World Scientific.
Ragazzini, J. R., and L. A. Zadeh. 1952. β€œThe Analysis of Sampled-Data Systems.” Transactions of the American Institute of Electrical Engineers, Part II: Applications and Industry 71 (5): 225–34.
Roy, Anindya, Tucker S. Mcelroy, and Peter Linton. 2019. β€œConstrained Estimation of Causal Invertible VARMA.” Statistica Sinica 29 (1): 455–78.
Scargle, Jeffrey D. 1981. β€œStudies in Astronomical Time Series Analysis. I-Modeling Random Processes in the Time Domain.” The Astrophysical Journal Supplement Series 45: 1–71.
Shen, Kaiming, and Wei Yu. 2018. β€œFractional Programming for Communication Systemsβ€”Part I: Power Control and Beamforming.” IEEE Transactions on Signal Processing 66 (10): 2616–30.
Simchowitz, Max, Horia Mania, Stephen Tu, Michael I. Jordan, and Benjamin Recht. 2018. β€œLearning Without Mixing: Towards A Sharp Analysis of Linear System Identification.” arXiv:1802.08334 [Cs, Math, Stat], February.
SΓΆderstrΓΆm, T., and M. Mossberg. 2000. β€œPerformance evaluation of methods for identifying continuous-time autoregressive processes.” Automatica 1 (36): 53–59.
Stoica, Petre, and Randolph L. Moses. 2005. Spectral Analysis of Signals. 1 edition. Upper Saddle River, N.J: Prentice Hall.
Taniguchi, Masanobu, and Yoshihide Kakizawa. 2000. Asymptotic Theory of Statistical Inference for Time Series. Springer Series in Statistics. New York: Springer.
Tufts, D. W., and R. Kumaresan. 1982. β€œEstimation of Frequencies of Multiple Sinusoids: Making Linear Prediction Perform Like Maximum Likelihood.” Proceedings of the IEEE 70 (9): 975–89.
Unser, Michael A., and Pouya Tafti. 2014. An Introduction to Sparse Stochastic Processes. New York: Cambridge University Press.
Vandenberghe, Lieven. 2012. β€œConvex Optimization Techniques in System Identification.” IFAC Proceedings Volumes, 16th IFAC Symposium on System Identification, 45 (16): 71–76.
Wedig, W. 1984. β€œA Critical Review of Methods in Stochastic Structural Dynamics.” Nuclear Engineering and Design 79 (3): 281–87.
Werbos, Paul J. 1988. β€œGeneralization of Backpropagation with Application to a Recurrent Gas Market Model.” Neural Networks 1 (4): 339–56.
Xu, Aolin, and Maxim Raginsky. 2017. β€œInformation-Theoretic Analysis of Generalization Capability of Learning Algorithms.” In Advances In Neural Information Processing Systems.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.