Some Classical and Some New Ideas for Identification of Linear Systems ? Lennart Ljung, Division of Automatic Control, Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden Abstract This paper gives an overview of identification of linear systems. It covers the classical approach of parametric methods using Maximum Likelihood and Predicion Error Methods, as well all classical non-parametric methods through spectral analysis. It also covers very recent techniques dealing with convex formulations by regularization of FIR and ARX models, as well as new alternatives to spectral analysis, through local linear models. An example of identification of aircraft dynamics illustrates the approaches. Key words: System identification; Maximimum Likelihood, Prediction Error Methods, Spectral Analysis, Regularization, Local Polynomial Methods 1 1.1 Introduction An Introductory Example: Aircraft Dynamics Consider a physical system, with observed input and output signals, see Figure 1. Let us take a modern military aircraft, like the Swedish fighter Gripen, as an example. From one of the earlier test flights, some data pitch rate elevator 0.08 0.01 0.06 0 0.04 −0.01 0.02 −0.02 0 −0.03 −0.02 −0.04 −0.04 −0.05 −0.06 −0.06 0 0.5 1 1.5 Time (seconds) 2 2.5 3 (a) The output: pitch rate 0 0.5 1 1.5 Time (seconds) 2 2.5 3 (b) Control input 1: elevator angle leading edge canard −0.02 0.2 −0.03 0.18 −0.04 0.16 −0.05 0.14 −0.06 0.12 −0.07 0.1 −0.08 0.08 −0.09 0.06 −0.1 0.04 0 0.5 1 1.5 Time (seconds) 2 2.5 (c) Control input leading edge flap 3 2: 0 0.5 1 1.5 Time (seconds) 2 2.5 3 (d) Control input 3: canard angle Fig. 2. Data from an early test flight of Gripen. These data cover 3 seconds of flight and are sampled at 60 Hz. Fig. 1. The Swedish aircraft Gripen were recorded as depicted in Figure 2. ? Support from the European Research Council under the advanced grant LEARN, contract 267381 is gratefully acknowledged. Email address: [email protected] (Lennart Ljung). In order to be able to simulate the aircaft and to design an effective autopilot, it is necessary to understand how, in this case, the pich rate is affected by the three inputs. We need mathematical expressions for this. A fair amount of knowledge exists about aircraft dynamics, but let us just try a simple difference equation relation. Denote the output, the pitch rate, at sample num- pitch rate • • • • • 0.1 0.08 0.06 0.04 how should the parameters in the model be adjusted? what inputs should be applied to obtain a good model? how do we assess the quality of the model? how do we gain confidence in an estimated mode? .... 0.02 There is a very extensive literature on the subject, with many text books, like [3] and [12]. Most of the techniques for system identification have their origins in estimation paradigms from mathematical statistics, and classical methods like Maximum Likelihood (ML) have been important elements in the area. In this article the main ingredients of this “classical” view of System Identification will be reviewed. Quite recently, alternative techniques, mostly from machine learning and convex optimizations, but also with the roots from classical statistics have emerged. The main elements of these will also be reviewed here. 0 −0.02 −0.04 −0.06 0 0.5 1 1.5 Time (seconds) 2 2.5 3 Fig. 3. The measured output (solid line) compared to the 5 step ahead predicted one (dashed line). ber t by y(t) and three control inputs at the same time by uk (t), k = 1, 2, 3. Then assume that we can write y(t) =a1 y(t − 1) − a2 y(t − 2) − a3 y(t − 3) + b1,1 u1 (t − 1) + b1,2 u1 (t − 2) + b2,1 u2 (t − 1) + b2,2 u2 (t − 2) + b3,1 u3 (t − 1) + b3,2 u3 (t − 2) 2 (1) 2.1 In this simple relationship we can adjust the parameters to fit the observed data as well as possible by a common Least Squares fit. We use only the 90 first data points of the observed data. That gives certain numerical values of the 9 parameters above: Model Structures A model structure M is a parameterized collection of models that describe the relations between the input and output signal of the system. The parameters are denoted by θ so M(θ) is a particular model. That model gives a rule to predict (one-step-ahead) the output at time t, i.e. y(t), based on observations of previos input-output data up to time t − 1 (denoted by Z t−1 ). y(t) − 1.15y(t − 1) + 0.50y(t − 2) − 0.35y(t − 3) = −0.54u1 (t − 1) + 0.04u1 (t − 2) +0.15u2 (t − 1) + 0.16u2 (t − 2) +0.16u3 (t − 1) + 0.07u3 (t − 2) (2) ŷ(t|θ) = g(t, θ, Z t−1 ) (3) For linear systems, a general model structure is given by the transfer function G from input to output and the transfer function H from a white noise source e to output additive disturbances: We may note that this model is unstable – it has a pole in 1.0026, but that is in order, because the pitch channel is unstable at the velocity and altitude in question. How can we test if this model is OK? Since we used only half of the observed data for the estimatatio we can test the model on the whole data record. Since the model is unstable and thus simulation is difficult, it is natural to let the model predict future outputs, say 5 samples ahead, and compare with the measured outputs. That is done in Figure 3. We see that the simple model (2) provides quite reasonable predictions over data it has not seen before. The could conceivably be improved if more elaborate mode structures than (1) were tried out. 1.2 Classical Approach to Parametric Methods y(t) = G(q, θ)u(t) + H(q, θ)e(t) Ee2 (t) = λ; Ee(t)e(k)) = 0 if k 6= t (4a) (4b) This model is in discrete time and q denotes the shift operator qy(t) = y(t + 1). We assume for simplicity that the sampling interval is one time unit. The expansion of G(q, θ) in the inverse (backwards) shift operator gives the impulse response of the system: G(q, θ) = System Identification ∞ X gk (θ)q −k u(t) = k=1 System Identification is about building mathematical models of dynamical systems from observed inputoutput signals, like we did in (2). This problem area contains a number of considerations, like ∞ X gk (θ)u(t − k) (5) k=1 The discrete time Fourier Transform, gives the frequency response of the system: G(eiω , θ) = • what model type, e.g. (1) should be used? ∞ X k=1 2 gk (θ)eikω (6) The natural predictor for(4a) is ŷ(t|θ) = G(q, θ) H(q, θ) − 1 y(t) + u(t) H(q) H(q, θ) Here θ corresponds to unknown physical parameters, while the other matrix entries signify known physical behaviour. This model can be sampled with the well-known sampling formulas to give (7) x(t + 1) = F(θ)x(t) + G(θ)u(t) y(t) = C(θ)x(t) + D(θ)u(t) + w(t) Note that the expansion of H starts with a ”1”, so the numerator in the first term starts with h1 q −1 so there is a delay in y. The question now is how to parameterize G and H. 2.1.1 See [8] for deeper discussion of sampling of systems with disturbances. Black-Box models The model (11) has the transfer function from u to y G(q, θ) = C(θ)[qI − F(θ)]−1 G(θ) + D(θ) Common black box (i.e. no physical insight or interpretation) parameterizations are to let G and H be rational in the shift operator: B(q) C(q) ; H(q, θ) = F (q) D(q) B(q) = b1 q −1 + b2 q −2 + . . . bnb q −nb G(q, θ) = −1 F (q) = 1 + f1 q + . . . + fnf q θ = [b1 , b2 , . . . , fnf ] nf (8a) 2.2 (8b) Z N = {u(1), y(1), . . . , u(N ), y(N )} (13) It is the most natural to compare the model predicted values (7) with the actual outputs and form the criterion of fit VN (θ) = (9a) (9b) (9c) (9d) N 1 X [y(t) − ŷ(t|θ)]2 N t=1 (14) and form the parameter estimate θ̂N = arg min VN (θ) This is the model structure we used in (1) in the introductory example. (15) We call this the Prediction Error Method, PEM. It coincides with the Maximum Likelihood, ML, method if the noise source e is Gaussian. See, e.g. [3] or [6] for more details. Other common black/box structures of this kind are FIR (Finite Impulse Response model, F = C = D = 1), ARMAX (F = D = A), and BJ (Box-Jenkins, all four polynomial different.) 2.3 Fitting Frequency-Domain Data Suppose instead that we have been given frequency domain data. That could be in the input-output form Grey-Box Models YN (eiωk ), UN (eiωk ), k = 1, 2, . . . , M If some physical facts are know about the system, it is possible to build in that into a Grey-Box Model. It could, for example be that for the airplane in the introduction, the motion equations are know from Newton’s laws, but certain parameters are unknown, like the aerodynamical derivatives. Then it is natural to build a continuous time state-space models from physical equations: ẋ(t) = A(θ)x(t) + B(θ)u(t) y(t) = C(θ)x(t) + D(θ)u(t) + v(t) Fitting Time-Domain Data Suppose now we have collected a data record in the time domain (8c) (8d) A very common case is that F = D = A and C = 1 which gives the ARX-model : B(q) 1 y(t) = u(t) + e(t) or A(q) A(q) A(q)y(t) = B(q)u(t) + e(t) or y(t) + a1 y(t − 1) + . . . + ana y(t − na) = b1 u(t − 1) + . . . + bnb u(t − nb) (12) so we have achieved a particular parameterization of the general linear model (4a). C and D are like F monic, i.e. start with a “1”. 2.1.2 (11) N 1 X YN (z) = √ y(k)z −k N k=1 (16) (17) or being observed samples from the frequency function ˆ ĜN (eiωk ), k = 1, 2, . . . , M YN (eiω ) ˆ e.g. ĜN (eiω ) = (ETFE) UN (eiω ) (10) 3 (18) (19) By taking the Fourier transform of (4a) we see that Y (eiω ) = G(eiω , θ)U (eiω ) have a higher variance: With higher flexibility it is easier to be fooled by disturbances. So the trade-off between bias and variance to reach a small total error is a choice of balanced flexibility of the model structure. (20) plus a noise term that has variance iω 2 n = λ|H(e , θ)| As the model gets more flexible, the fit to the estimation data in (15), VN (θ̂N ) will always improve. To account for the variance contribution, it is thus necessary to modify this fit to assess the total quality of the model. A much used technique for this is Akaike’s criterion,e.g. [1], (21) Simple least squares (LS) curve fitting says that we should fit observations with weights that are inversely proportional to the measurement variance. That gives the weighed LS criterion VN (θ) = M X dimθ θ̂N = arg min VN (θ) + 2 N |Y (eiωk ) − G(eiωk , θ)UN (eiωk )|2 /|H(eiωk , θ)|2 k=1 (22) (24) were the minimization also take place over a family of model structures with different number of parameters (dim θ). Another important technique is to evaluate the criterion function for the model for another set of data, validation data, and pick the model which gives the best fit to this independent data set. This is known as cross validation. (the constant λ does not effect the minimization of VN ). It can readily be verified that (22) coincides with (14) by Parseval’s identity in case M = N and the frequencies ωk are selected as the DFT grid. 3.2 Asymptotic Properties of the Model Notice that (22) can be written as Except in simple special cases it is quite difficult to compute the pdf of the estimate θ̂N . However, its asymptotic properties as N → ∞ are easier to establish. The basic results can be summarized as follows: (E denotes mathematical expectation) 2 UN (eiωk ) 2 YN (e ) iωk VN (θ) = UN (eiωk ) − G(e , θ) · H(eiωk , θ) k=1 (23) M X iωk • We can see that as a properly weighted curve-fitting of the frequency function to the ETFE (19). θ̂N → θ∗ = arg min E lim VN (θ) N →∞ 3 Bias and Variance So the estimate will converge to the best possible model, which gives the smallest average prediction error. The observations, certainly of the output from the system are affected by noise and disturbances, which of course also will influence the estimated model (15). The disturbances are typically described as stochastic processes, which makes the estimate θ̂N a random variable. This has a certain probability distribution function (pdf) and a mean and a variance. The difference between the mean and a true description of the system measures the bias of the model. If the mean coincides with the true system, the estimated is said to be unbiased. The total error in a model thus has two contributions: the bias and the variance. 3.1 (25) • Covθ̂N −1 λ d ∼ Cov ŷ(t|θ) N dθ (26) So the covariance matrix of the parameter estimate is given by the inverse covariance matrix of the gradient of the predictor wrt the parameters. λ is the variance of the optimal prediction errors (the innovations). See [3], chapters 8 and 9 for a general treatment. These results are valid for quite general model structures. Now, specialize to linear models (4a) and assume that the true system is described by Trade-off between bias and variance Generally speaking the quality of the model depends on the quality of the measured data and the flexibility of the chosen model structure (3). A more flexible model structure typically has smaller bias, since it is easier to come closer to the true system. At the same time, it will y(t) = G0 (q)u(t) + H0 (q)e(t) (27) which could be general transfer functions, possibly much more complicated than the model. Then 4 • 5 θ∗ = arg min θ Z π |G(eiω , θ) − G0 (eiω )|2 −π Φu (ω) dω |H(eiω , θ)|2 (28) 5.1 y(t) = ϕT (t)θ + e(t) (33) Here y (the output) and ϕ (the regression vector) are observed variables, e is a noise disturbance and θ is the unknown parameter vector. In general e(t) is assumed to be independent of ϕ(t). • n Φv (ω) as n, N → ∞ N Φu (ω) Linear Regressions A Linear Regression problem has the form That is, the frequency function of the limiting model will approximate the true frequency function as well as possible in a frequency norm given by the input spectrum Φu and the noise model. CovG(eiω , θ̂N ) ∼ Regularization of Linear Regression Models (29) It is convenient to rewrite (33) in vector form, by stacking all the elements (rows) in y(t) and ϕT (t) to form the vectors (matrices) Y and Φ and obtain where n is the model order and Φv is the noise spectrum λ|H0 (eiω )|2 . The variance of the estimated frequency function at a given frequency is thus, for a high order model proportional to the Noise-to-Signal ratio at that frequency. That is a natural and intuitive result. Y = Φθ + E (34) The least squares estimate of the parameter θ is θ̂N = arg min |Y − Φθ|2 or 4 θ̂N = Approximating Linear Systems by ARX Models 5.2 Suppose the true linear system is given by y(t) = G0 (q)u(t) + H0 (q)e(t) RN = Φ Φ; FN = Φ Y (35b) Regularized Least Squares θ̂N = (RN + P → G0 (q) Ân (q) 1 ) → H0 (q) as n, m → ∞ Ân (q (35a) T θ̂N = arg min |Y − Φθ|2 + θT P −1 θ or (31) Then it is well known from [5] that as the orders tend to infinity at the same time as then number of data N increases even faster we have for the ARX estimate B̂m (q) T It can be shown that the variance of θ̂ could be quite large, in particular if Φ has many columns and/or is illconditioned. Therefore is makes sense to regularize the estimate by a matrix P : (30) Suppose we build an ARX model (9) for larger and larger orders n = na and m = nb: An (q)y(t) = Bm (q)u(t) + e(t) −1 RN FN ; −1 −1 ) FN ; (36a) (36b) The presence of the matrix P will improve the numerical properties of the estimation and decrease the variance of the estimate, at the same time as some bias is introduced. Suppose that the data have been generated by (34) for a certain “true” vector θ0 with noise with variance EEE T = I. (E denotes mathematical expectation.) Then, the mean square error (MSE) of the estimate is (32a) (32b) E[(θ̂N − θ0 )(θ̂N − θ0 )T ] = (RN + P −1 )−1 × This is quite a useful result. ARX-models are easy to estimate. The estimates are calculated by linear least squares techniques, which are convex and numerically robust. Estimating a high order ARX model, possibly followed by some model order reduction could thus be a viable alternative to the numerically more demanding general PEM criterion (15). This has been extensively used, e.g. by [14], [15]. (RN + P −1 θ0 θ0T P −1 )(RN + P −1 )−1 (37) A rational choice of P is one that makes this MSE matrix small. How shall we think of good such choices? 5.3 Bayesian Interpretation Let us suppose θ is a random vector. That will make y in (34) random variables that are correlated with θ. If the prior (before Y has been observed) covariance matrix of θ is P , then it is known that the maximum a posteriori The only drawback with high order ARX-models is that they may suffer from high variance. That is the problem we now turn to. 5 models. In particular suitable choices of P should reflect what is reasonable to assume about an impulse response: If the system is stable, b should decay exponentially, and if the impulse response is smooth, neighbouring values should have a positive correlation. That means that a typical regularization matrix P b for θb would be matrix whose k, j element is something like (after Y has been observed) estimate of θ is given by (36a). [See [2] for all technical details in this section.] So a natural choice of P is to let it reflect how much is known about the vector θ. 5.4 “Empirical Bayes” b Pk,j (α) = C min(λk , λj ); λ < 1 α = [C, λ] Can we estimate this matrix P in some way? Consider (34). If θ is a Gaussian random vector with zero mean and covariance matrix P , and E is a random Gaussian vector with zero mean and covariance matrix I, and Φ is a known, deterministic matrix, then from (34) also Y will be a Gaussian random vector with zero mean and covariance matrix T Z(P ) = ΦP Φ + I The hyperparameter α can then be tuned by (40): α̂ = arg min W (Y, P b (α)) (38) (39) That will also be the negative log likelihood function for estimating P from observations Y , so the ML estimate of P will be P̂ = arg min W (Y, P ) 5.6 y(t) = − a1 y(t − 1) − . . . − an y(t − n) + b1 u(t − 1) + . . . + bm u(t − m) = ϕTy (t)θa + ϕTu (t)θb = ϕT (t)θ (44) If the matrix Φ is not deterministic, but depends on E in such a way that row ϕT (t) is independent of the element e(t) in E, it is still true that W (P ) in (39) will be the negative log likelihood function for estimating P from Y , although then Y is not necessarily Gaussian itself. [See, e.g. Lemma 5.1 in [3].] where ϕy and θa are made up from y and a in an obvious way. That means that also the ARX model is a linear regression, to which the same ideas of regularization can be applied. Eq (44) shows that the predictor consists of two impulse responses, one from y and one from u and similar ideas on the parameterization of the regularization matrix can be used. It is natural to partition the P -matrix in (36a) along with θa , θb and use FIR Models Let us now return to the impulse response (5) and assume it is finite (FIR): G(q, θ) = m X bk u(t − k) = ϕTu (t)θb ARX Models Recall that high order ARX models provide increasingly better approximations of general linear systems. We can write the ARX-model (9) as (40) We have thus lifted the problem of estimating θ to a problem where we estimate parameters (in) P that describe the distribution of θ. Such parameters are commonly known as hyperparameters. 5.5 (43) This method of estimating impulse response, possibly followed by a model reduction of the high order FIR model (“modred(idss(firmodel),n)”) has been extensively tested in Monte Carlo simulations in [2]. They clearly show that the approach is a viable alternative to the classical ML/PEM methods, and may in some cases provide better models. An important reason for that is that the tricky question of model order determination is avoided. (Two times) the negative logarithm of the probability density function (pdf) of the Gaussian random vector Y will thus be W (Y, P ) = Y T Z(P )−1 Y + log det Z(P ) (42) " P (α1 , α2 ) = # P a (α1 ) 0 0 P b (α2 ) (45) (41) with P a,b (α) as in (42). k=1 where we have collected the m elements of u(t − k) in ϕ(t) and the m impulse response coefficients bk in θb . That means that the estimation of FIR models is a linear regression problem. All that was said above about linear regressions, regularization and estimation of hyperparameters can thus be applied to estimation of FIR 5.7 Related work The text in this section essentially follows [2]. Important contributions of the same kind, based on ideas from machine learning, have been described in [11] and [10]. 6 6 vary with frequency like a low order (p) polynomial: Nonparametric Models of Linear Systems Classical non-parametric models of linear system are methods to estimate the frequency functions G(eiω ), H(eiω ) in (6) directly from data (often the ETFE (19)) without first finding any model parameters. Gk+r = p X xj r j j=0 Tk+r = p X yj r j (50) j=0 Φy (ω) = λ|H(eiω )|2 (46) βk = [xj , yj , j = 0, . . . , p] Then we could use the model (49) over a frequency range around k of more observations (2M + 1) than the number of unknown parameters (2p + 2) and estimate the parameters by least squares: is the spectrum of the disturbances, such methods are often referred to as spectral analysis. 6.1 Classical Spectral Analysis β̂k = arg min βk Classical spectral analysis is a way of directly smoothing the ETFE by averaging over a window sliding across it: Rπ iω Ĝ(e ) = −π ˆ Wγ (ξ − ω)β(ξ)ĜN (eiξ )dξ Rπ Wγ (ξ − ω)β(ξ)dξ −π (47) Ĝ(eiωk ) = Gk+0 = x̂0 (52) This Least Squares estimation has to be performed at each frequency value ωk of interest. The calculations are thus more extensive than for the classical estimate (47), but it is shown in [13] that the accuracy is much better. An interesting alternative is to use rational rather than polynomial approximations in (50) as discussed and illustrated in [9]. Conclusions System Identification is an area of clear importance for practical systems work. It has now a well developed theory and is a standard tool in industrial applications. Even though the area is quite mature with many links to classical theory, new exciting and fruitful ideas keep being developed. This article has tried to illustrate both these aspects. Further discussions and views on the current status and future perspectives on system identification are given in e.g. [4] and [7]. Local Polynomial Techniques Quite recently, an alternative way of smoothing the ETFE has been suggested, [13]. Consider the frequency measurements (16). Assume that they have been collected on an equidistant grid, and denote for simplicity U (k) = UN (eiωk ); r=−M The central estimate, r = 0, will then be our estimate: 7 Y (k) = YN (eiωk ); |Y (k + r) − Gk+r U (k + r) − Tk+r |2 (51) here β is a weighting that may account for the varying ˆ reliability of Ĝ over the frequencies. Wγ is the window which performs the smoothing. γ is a parameter that governs the width of the window which decides the tradeoff between frequency resolution and noise-sensitivity. This is a variant of the fundamental bias-variance trade off which is present in all estimation problems. See Section 6.4 in [3] for more details around this. 6.2 M X Gk = G(eiωk ) (48) References They are related to the frequency function as Y (k) = Gk U (k) + Tk + Ek [1] H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, AC-19:716–723, 1974. (49) [2] Tianshi Chen, Henrik Ohlsson, and Lennart Ljung. On the estimation of transfer functions, regularizations and Gaussian processes-Revisited. Automatica, 48(8):1525–1535, 2012. where Tk are transient errors and Ek is noise. If Gk and Tk were constant over a certain frequency interval, they could easily be estimated by averaging over the more rapidly changing noise term. Now assume that the frequency function and the transient error change rather slowly with k, so that they are not constant but may [3] L. Ljung. System Identification - Theory for the User. Prentice-Hall, Upper Saddle River, N.J., 2nd edition, 1999. [4] L. Ljung. Pespectives on system identification. IFAC Annual Reviews, Spring Issue, 2010. 7 [5] L. Ljung and B. Wahlberg. Asymptotic properties of the least-squares method for estimating transfer functions and disturbance spectra. Adv. Appl. Prob., 24:412–440, 1992. [6] Lennart Ljung. Prediction error estimation methods. Circuits, systems, and signal processing, 21(1):11–21, 2002. [7] Lennart Ljung, Hakan Hjalmarsson, and Henrik Ohlsson. Four Encounters with System Identification. European Journal of Control, 17(5-6):449–471, 2011. [8] Lennart Ljung and Adrian Wills. Issues in sampling and estimating continuous-time models with stochastic disturbances. AUTOMATICA, 46(5):925–931, 2010. [9] T. McKelvey and G. Guerin. Nonarametric frequency response estimation using a local rational model. In Proc. 18th IFAC Symposium on System Identification, Brussels, Belgium, July 2012. [10] G. Pillonetto, A. Chiuso, and G. De Nicolao. Prediction error identification of linear systems: a nonparametric Gaussian regression approach. Automatica, 47(2):291–305, 2011. [11] G. Pillonetto and G. De Nicolao. A new kernel-based approach for linear system identification. Automatica, 46(1):81–93, January 2010. [12] R. Pintelon and J. Schoukens. System Identification – A Frequency Domain Approach. IEEE Press, New York, 2nd edition, 2012. [13] J. Schoukens, G. Vandersteen, K. Barbe, and R. Pintelon. Nonparametric preprocessing in system identifiction – a powerful tool. European Journal of Control, 15:260–274, 2009. [14] Y.C. Zhu. Asymptotic properties of prediction error methods. Int. J. Adaptive Control and Signal Processing, 3:357–373, 1989. [15] Y.C. Zhu and A. C. M. P. Backx. Identification of Multivariable Indusrial Processes for Diagnosis and Control. Springer Verlag, London, 1993. 8

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement