Some Classical and Some New Ideas for Identification of Lennart Ljung,

Some Classical and Some New Ideas for Identification of Lennart Ljung,
Some Classical and Some New Ideas for Identification of
Linear Systems ?
Lennart Ljung,
Division of Automatic Control, Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden
This paper gives an overview of identification of linear systems. It covers the classical approach of parametric methods using
Maximum Likelihood and Predicion Error Methods, as well all classical non-parametric methods through spectral analysis.
It also covers very recent techniques dealing with convex formulations by regularization of FIR and ARX models, as well as
new alternatives to spectral analysis, through local linear models.
An example of identification of aircraft dynamics illustrates the approaches.
Key words: System identification; Maximimum Likelihood, Prediction Error Methods, Spectral Analysis, Regularization,
Local Polynomial Methods
An Introductory Example: Aircraft Dynamics
Consider a physical system, with observed input and
output signals, see Figure 1. Let us take a modern military aircraft, like the Swedish fighter Gripen, as an example. From one of the earlier test flights, some data
pitch rate
Time (seconds)
(a) The output: pitch
Time (seconds)
(b) Control input 1: elevator angle
leading edge
Time (seconds)
(c) Control input
leading edge flap
Time (seconds)
(d) Control input 3: canard angle
Fig. 2. Data from an early test flight of Gripen. These data
cover 3 seconds of flight and are sampled at 60 Hz.
Fig. 1. The Swedish aircraft Gripen
were recorded as depicted in Figure 2.
? Support from the European Research Council under the
advanced grant LEARN, contract 267381 is gratefully acknowledged.
Email address: [email protected] (Lennart Ljung).
In order to be able to simulate the aircaft and to design an effective autopilot, it is necessary to understand
how, in this case, the pich rate is affected by the three
inputs. We need mathematical expressions for this. A
fair amount of knowledge exists about aircraft dynamics, but let us just try a simple difference equation relation. Denote the output, the pitch rate, at sample num-
pitch rate
how should the parameters in the model be adjusted?
what inputs should be applied to obtain a good model?
how do we assess the quality of the model?
how do we gain confidence in an estimated mode?
There is a very extensive literature on the subject, with
many text books, like [3] and [12]. Most of the techniques
for system identification have their origins in estimation
paradigms from mathematical statistics, and classical
methods like Maximum Likelihood (ML) have been important elements in the area. In this article the main
ingredients of this “classical” view of System Identification will be reviewed. Quite recently, alternative techniques, mostly from machine learning and convex optimizations, but also with the roots from classical statistics have emerged. The main elements of these will also
be reviewed here.
Time (seconds)
Fig. 3. The measured output (solid line) compared to the 5
step ahead predicted one (dashed line).
ber t by y(t) and three control inputs at the same time
by uk (t), k = 1, 2, 3. Then assume that we can write
y(t) =a1 y(t − 1) − a2 y(t − 2) − a3 y(t − 3)
+ b1,1 u1 (t − 1) + b1,2 u1 (t − 2)
+ b2,1 u2 (t − 1) + b2,2 u2 (t − 2)
+ b3,1 u3 (t − 1) + b3,2 u3 (t − 2)
In this simple relationship we can adjust the parameters
to fit the observed data as well as possible by a common
Least Squares fit. We use only the 90 first data points of
the observed data. That gives certain numerical values
of the 9 parameters above:
Model Structures
A model structure M is a parameterized collection of
models that describe the relations between the input and
output signal of the system. The parameters are denoted
by θ so M(θ) is a particular model. That model gives
a rule to predict (one-step-ahead) the output at time t,
i.e. y(t), based on observations of previos input-output
data up to time t − 1 (denoted by Z t−1 ).
y(t) − 1.15y(t − 1) + 0.50y(t − 2) − 0.35y(t − 3)
= −0.54u1 (t − 1) + 0.04u1 (t − 2)
+0.15u2 (t − 1) + 0.16u2 (t − 2)
+0.16u3 (t − 1) + 0.07u3 (t − 2)
ŷ(t|θ) = g(t, θ, Z t−1 )
For linear systems, a general model structure is given
by the transfer function G from input to output and the
transfer function H from a white noise source e to output
additive disturbances:
We may note that this model is unstable – it has a pole
in 1.0026, but that is in order, because the pitch channel
is unstable at the velocity and altitude in question.
How can we test if this model is OK? Since we used only
half of the observed data for the estimatatio we can test
the model on the whole data record. Since the model
is unstable and thus simulation is difficult, it is natural
to let the model predict future outputs, say 5 samples
ahead, and compare with the measured outputs. That
is done in Figure 3. We see that the simple model (2)
provides quite reasonable predictions over data it has
not seen before. The could conceivably be improved if
more elaborate mode structures than (1) were tried out.
Classical Approach to Parametric Methods
y(t) = G(q, θ)u(t) + H(q, θ)e(t)
Ee2 (t) = λ; Ee(t)e(k)) = 0 if k 6= t
This model is in discrete time and q denotes the shift
operator qy(t) = y(t + 1). We assume for simplicity that
the sampling interval is one time unit. The expansion of
G(q, θ) in the inverse (backwards) shift operator gives
the impulse response of the system:
G(q, θ) =
System Identification
gk (θ)q −k u(t) =
System Identification is about building mathematical
models of dynamical systems from observed inputoutput signals, like we did in (2). This problem area
contains a number of considerations, like
gk (θ)u(t − k)
The discrete time Fourier Transform, gives the frequency
response of the system:
G(eiω , θ) =
• what model type, e.g. (1) should be used?
gk (θ)eikω
The natural predictor for(4a) is
ŷ(t|θ) =
G(q, θ)
H(q, θ) − 1
y(t) +
H(q, θ)
Here θ corresponds to unknown physical parameters,
while the other matrix entries signify known physical behaviour. This model can be sampled with the well-known
sampling formulas to give
x(t + 1) = F(θ)x(t) + G(θ)u(t)
y(t) = C(θ)x(t) + D(θ)u(t) + w(t)
Note that the expansion of H starts with a ”1”, so the
numerator in the first term starts with h1 q −1 so there is
a delay in y. The question now is how to parameterize
G and H.
See [8] for deeper discussion of sampling of systems with
Black-Box models
The model (11) has the transfer function from u to y
G(q, θ) = C(θ)[qI − F(θ)]−1 G(θ) + D(θ)
Common black box (i.e. no physical insight or interpretation) parameterizations are to let G and H be rational
in the shift operator:
; H(q, θ) =
F (q)
B(q) = b1 q −1 + b2 q −2 + . . . bnb q −nb
G(q, θ) =
F (q) = 1 + f1 q + . . . + fnf q
θ = [b1 , b2 , . . . , fnf ]
Z N = {u(1), y(1), . . . , u(N ), y(N )}
It is the most natural to compare the model predicted
values (7) with the actual outputs and form the criterion
of fit
VN (θ) =
1 X
[y(t) − ŷ(t|θ)]2
N t=1
and form the parameter estimate
θ̂N = arg min VN (θ)
This is the model structure we used in (1) in the introductory example.
We call this the Prediction Error Method, PEM. It coincides with the Maximum Likelihood, ML, method if the
noise source e is Gaussian. See, e.g. [3] or [6] for more
Other common black/box structures of this kind are FIR
(Finite Impulse Response model, F = C = D = 1),
ARMAX (F = D = A), and BJ (Box-Jenkins, all four
polynomial different.)
Fitting Frequency-Domain Data
Suppose instead that we have been given frequency domain data. That could be in the input-output form
Grey-Box Models
YN (eiωk ), UN (eiωk ), k = 1, 2, . . . , M
If some physical facts are know about the system, it is
possible to build in that into a Grey-Box Model. It could,
for example be that for the airplane in the introduction,
the motion equations are know from Newton’s laws, but
certain parameters are unknown, like the aerodynamical
derivatives. Then it is natural to build a continuous time
state-space models from physical equations:
ẋ(t) = A(θ)x(t) + B(θ)u(t)
y(t) = C(θ)x(t) + D(θ)u(t) + v(t)
Fitting Time-Domain Data
Suppose now we have collected a data record in the time
A very common case is that F = D = A and C = 1
which gives the ARX-model :
y(t) =
u(t) +
e(t) or
A(q)y(t) = B(q)u(t) + e(t) or
y(t) + a1 y(t − 1) + . . . + ana y(t − na)
= b1 u(t − 1) + . . . + bnb u(t − nb)
so we have achieved a particular parameterization of the
general linear model (4a).
C and D are like F monic, i.e. start with a “1”.
1 X
YN (z) = √
y(k)z −k
N k=1
or being observed samples from the frequency function
ĜN (eiωk ), k = 1, 2, . . . , M
YN (eiω )
e.g. ĜN (eiω ) =
UN (eiω )
By taking the Fourier transform of (4a) we see that
Y (eiω ) = G(eiω , θ)U (eiω )
have a higher variance: With higher flexibility it is easier
to be fooled by disturbances. So the trade-off between
bias and variance to reach a small total error is a choice
of balanced flexibility of the model structure.
plus a noise term that has variance
n = λ|H(e , θ)|
As the model gets more flexible, the fit to the estimation
data in (15), VN (θ̂N ) will always improve. To account for
the variance contribution, it is thus necessary to modify
this fit to assess the total quality of the model. A much
used technique for this is Akaike’s criterion,e.g. [1],
Simple least squares (LS) curve fitting says that we
should fit observations with weights that are inversely
proportional to the measurement variance. That gives
the weighed LS criterion
VN (θ) =
θ̂N = arg min VN (θ) + 2
|Y (eiωk ) − G(eiωk , θ)UN (eiωk )|2 /|H(eiωk , θ)|2
were the minimization also take place over a family of
model structures with different number of parameters
(dim θ).
Another important technique is to evaluate the criterion
function for the model for another set of data, validation
data, and pick the model which gives the best fit to this
independent data set. This is known as cross validation.
(the constant λ does not effect the minimization of VN ).
It can readily be verified that (22) coincides with (14) by
Parseval’s identity in case M = N and the frequencies
ωk are selected as the DFT grid.
Asymptotic Properties of the Model
Notice that (22) can be written as
Except in simple special cases it is quite difficult to compute the pdf of the estimate θ̂N . However, its asymptotic
properties as N → ∞ are easier to establish. The basic
results can be summarized as follows: (E denotes mathematical expectation)
2 UN (eiωk ) 2
YN (e )
VN (θ) =
UN (eiωk ) − G(e , θ) · H(eiωk , θ) k=1
We can see that as a properly weighted curve-fitting of
the frequency function to the ETFE (19).
θ̂N → θ∗ = arg min E lim VN (θ)
N →∞
Bias and Variance
So the estimate will converge to the best possible
model, which gives the smallest average prediction error.
The observations, certainly of the output from the system are affected by noise and disturbances, which of
course also will influence the estimated model (15). The
disturbances are typically described as stochastic processes, which makes the estimate θ̂N a random variable.
This has a certain probability distribution function (pdf)
and a mean and a variance. The difference between the
mean and a true description of the system measures the
bias of the model. If the mean coincides with the true
system, the estimated is said to be unbiased. The total
error in a model thus has two contributions: the bias and
the variance.
Cov ŷ(t|θ)
So the covariance matrix of the parameter estimate is
given by the inverse covariance matrix of the gradient
of the predictor wrt the parameters. λ is the variance
of the optimal prediction errors (the innovations). See
[3], chapters 8 and 9 for a general treatment.
These results are valid for quite general model structures. Now, specialize to linear models (4a) and assume
that the true system is described by
Trade-off between bias and variance
Generally speaking the quality of the model depends on
the quality of the measured data and the flexibility of
the chosen model structure (3). A more flexible model
structure typically has smaller bias, since it is easier to
come closer to the true system. At the same time, it will
y(t) = G0 (q)u(t) + H0 (q)e(t)
which could be general transfer functions, possibly much
more complicated than the model. Then
θ∗ = arg min
|G(eiω , θ) − G0 (eiω )|2
Φu (ω)
|H(eiω , θ)|2
y(t) = ϕT (t)θ + e(t)
Here y (the output) and ϕ (the regression vector) are
observed variables, e is a noise disturbance and θ is the
unknown parameter vector. In general e(t) is assumed
to be independent of ϕ(t).
n Φv (ω)
as n, N → ∞
N Φu (ω)
Linear Regressions
A Linear Regression problem has the form
That is, the frequency function of the limiting model
will approximate the true frequency function as well
as possible in a frequency norm given by the input
spectrum Φu and the noise model.
CovG(eiω , θ̂N ) ∼
Regularization of Linear Regression Models
It is convenient to rewrite (33) in vector form, by stacking
all the elements (rows) in y(t) and ϕT (t) to form the
vectors (matrices) Y and Φ and obtain
where n is the model order and Φv is the noise spectrum λ|H0 (eiω )|2 . The variance of the estimated frequency function at a given frequency is thus, for a high
order model proportional to the Noise-to-Signal ratio
at that frequency. That is a natural and intuitive result.
Y = Φθ + E
The least squares estimate of the parameter θ is
θ̂N = arg min |Y − Φθ|2 or
θ̂N =
Approximating Linear Systems by ARX
Suppose the true linear system is given by
y(t) = G0 (q)u(t) + H0 (q)e(t)
RN = Φ Φ; FN = Φ Y
Regularized Least Squares
θ̂N = (RN + P
→ G0 (q)
Ân (q)
) → H0 (q) as n, m → ∞
Ân (q
θ̂N = arg min |Y − Φθ|2 + θT P −1 θ or
Then it is well known from [5] that as the orders tend
to infinity at the same time as then number of data N
increases even faster we have for the ARX estimate
B̂m (q)
It can be shown that the variance of θ̂ could be quite
large, in particular if Φ has many columns and/or is illconditioned. Therefore is makes sense to regularize the
estimate by a matrix P :
Suppose we build an ARX model (9) for larger and larger
orders n = na and m = nb:
An (q)y(t) = Bm (q)u(t) + e(t)
FN ;
−1 −1
FN ;
The presence of the matrix P will improve the numerical
properties of the estimation and decrease the variance
of the estimate, at the same time as some bias is introduced. Suppose that the data have been generated by
(34) for a certain “true” vector θ0 with noise with variance EEE T = I. (E denotes mathematical expectation.)
Then, the mean square error (MSE) of the estimate is
E[(θ̂N − θ0 )(θ̂N − θ0 )T ] = (RN + P −1 )−1 ×
This is quite a useful result. ARX-models are easy to
estimate. The estimates are calculated by linear least
squares techniques, which are convex and numerically
robust. Estimating a high order ARX model, possibly
followed by some model order reduction could thus be
a viable alternative to the numerically more demanding
general PEM criterion (15). This has been extensively
used, e.g. by [14], [15].
(RN + P −1 θ0 θ0T P −1 )(RN + P −1 )−1
A rational choice of P is one that makes this MSE matrix
small. How shall we think of good such choices?
Bayesian Interpretation
Let us suppose θ is a random vector. That will make y in
(34) random variables that are correlated with θ. If the
prior (before Y has been observed) covariance matrix of
θ is P , then it is known that the maximum a posteriori
The only drawback with high order ARX-models is that
they may suffer from high variance. That is the problem
we now turn to.
models. In particular suitable choices of P should reflect
what is reasonable to assume about an impulse response:
If the system is stable, b should decay exponentially, and
if the impulse response is smooth, neighbouring values
should have a positive correlation. That means that a
typical regularization matrix P b for θb would be matrix
whose k, j element is something like
(after Y has been observed) estimate of θ is given by
(36a). [See [2] for all technical details in this section.]
So a natural choice of P is to let it reflect how much is
known about the vector θ.
“Empirical Bayes”
(α) = C min(λk , λj ); λ < 1 α = [C, λ]
Can we estimate this matrix P in some way? Consider
(34). If θ is a Gaussian random vector with zero mean
and covariance matrix P , and E is a random Gaussian
vector with zero mean and covariance matrix I, and Φ
is a known, deterministic matrix, then from (34) also Y
will be a Gaussian random vector with zero mean and
covariance matrix
Z(P ) = ΦP Φ + I
The hyperparameter α can then be tuned by (40):
α̂ = arg min W (Y, P b (α))
That will also be the negative log likelihood function for
estimating P from observations Y , so the ML estimate
of P will be
P̂ = arg min W (Y, P )
y(t) = − a1 y(t − 1) − . . . − an y(t − n) + b1 u(t − 1) + . . .
+ bm u(t − m) = ϕTy (t)θa + ϕTu (t)θb = ϕT (t)θ
If the matrix Φ is not deterministic, but depends on E in
such a way that row ϕT (t) is independent of the element
e(t) in E, it is still true that W (P ) in (39) will be the
negative log likelihood function for estimating P from
Y , although then Y is not necessarily Gaussian itself.
[See, e.g. Lemma 5.1 in [3].]
where ϕy and θa are made up from y and a in an obvious
way. That means that also the ARX model is a linear
regression, to which the same ideas of regularization can
be applied. Eq (44) shows that the predictor consists of
two impulse responses, one from y and one from u and
similar ideas on the parameterization of the regularization matrix can be used. It is natural to partition the
P -matrix in (36a) along with θa , θb and use
FIR Models
Let us now return to the impulse response (5) and assume it is finite (FIR):
G(q, θ) =
bk u(t − k) = ϕTu (t)θb
ARX Models
Recall that high order ARX models provide increasingly
better approximations of general linear systems. We can
write the ARX-model (9) as
We have thus lifted the problem of estimating θ to a
problem where we estimate parameters (in) P that describe the distribution of θ. Such parameters are commonly known as hyperparameters.
This method of estimating impulse response, possibly
followed by a model reduction of the high order FIR
model (“modred(idss(firmodel),n)”) has been extensively tested in Monte Carlo simulations in [2]. They
clearly show that the approach is a viable alternative to
the classical ML/PEM methods, and may in some cases
provide better models. An important reason for that is
that the tricky question of model order determination is
(Two times) the negative logarithm of the probability
density function (pdf) of the Gaussian random vector Y
will thus be
W (Y, P ) = Y T Z(P )−1 Y + log det Z(P )
P (α1 , α2 ) =
P a (α1 )
P b (α2 )
with P a,b (α) as in (42).
where we have collected the m elements of u(t − k) in
ϕ(t) and the m impulse response coefficients bk in θb .
That means that the estimation of FIR models is a linear
regression problem. All that was said above about linear regressions, regularization and estimation of hyperparameters can thus be applied to estimation of FIR
Related work
The text in this section essentially follows [2]. Important
contributions of the same kind, based on ideas from machine learning, have been described in [11] and [10].
vary with frequency like a low order (p) polynomial:
Nonparametric Models of Linear Systems
Classical non-parametric models of linear system
are methods to estimate the frequency functions
G(eiω ), H(eiω ) in (6) directly from data (often the
ETFE (19)) without first finding any model parameters.
Gk+r =
xj r j
Tk+r =
yj r j
Φy (ω) = λ|H(eiω )|2
βk = [xj , yj , j = 0, . . . , p]
Then we could use the model (49) over a frequency range
around k of more observations (2M + 1) than the number of unknown parameters (2p + 2) and estimate the
parameters by least squares:
is the spectrum of the disturbances, such methods are
often referred to as spectral analysis.
Classical Spectral Analysis
β̂k = arg min
Classical spectral analysis is a way of directly smoothing
the ETFE by averaging over a window sliding across it:
Ĝ(e ) =
Wγ (ξ − ω)β(ξ)ĜN (eiξ )dξ
Wγ (ξ − ω)β(ξ)dξ
Ĝ(eiωk ) = Gk+0 = x̂0
This Least Squares estimation has to be performed at
each frequency value ωk of interest. The calculations are
thus more extensive than for the classical estimate (47),
but it is shown in [13] that the accuracy is much better.
An interesting alternative is to use rational rather than
polynomial approximations in (50) as discussed and illustrated in [9].
System Identification is an area of clear importance for
practical systems work. It has now a well developed theory and is a standard tool in industrial applications.
Even though the area is quite mature with many links
to classical theory, new exciting and fruitful ideas keep
being developed. This article has tried to illustrate both
these aspects. Further discussions and views on the current status and future perspectives on system identification are given in e.g. [4] and [7].
Local Polynomial Techniques
Quite recently, an alternative way of smoothing the
ETFE has been suggested, [13]. Consider the frequency
measurements (16). Assume that they have been collected on an equidistant grid, and denote for simplicity
U (k) = UN (eiωk );
The central estimate, r = 0, will then be our estimate:
Y (k) = YN (eiωk );
|Y (k + r) − Gk+r U (k + r) − Tk+r |2
here β is a weighting that may account for the varying
reliability of Ĝ over the frequencies. Wγ is the window
which performs the smoothing. γ is a parameter that
governs the width of the window which decides the tradeoff between frequency resolution and noise-sensitivity.
This is a variant of the fundamental bias-variance trade
off which is present in all estimation problems. See Section 6.4 in [3] for more details around this.
Gk = G(eiωk )
They are related to the frequency function as
Y (k) = Gk U (k) + Tk + Ek
[1] H. Akaike. A new look at the statistical model identification.
IEEE Transactions on Automatic Control, AC-19:716–723,
[2] Tianshi Chen, Henrik Ohlsson, and Lennart Ljung. On the
estimation of transfer functions, regularizations and Gaussian
processes-Revisited. Automatica, 48(8):1525–1535, 2012.
where Tk are transient errors and Ek is noise. If Gk and
Tk were constant over a certain frequency interval, they
could easily be estimated by averaging over the more
rapidly changing noise term. Now assume that the frequency function and the transient error change rather
slowly with k, so that they are not constant but may
[3] L. Ljung. System Identification - Theory for the User.
Prentice-Hall, Upper Saddle River, N.J., 2nd edition, 1999.
[4] L. Ljung. Pespectives on system identification. IFAC Annual
Reviews, Spring Issue, 2010.
[5] L. Ljung and B. Wahlberg. Asymptotic properties of the
least-squares method for estimating transfer functions and
disturbance spectra. Adv. Appl. Prob., 24:412–440, 1992.
[6] Lennart Ljung.
Prediction error estimation methods.
Circuits, systems, and signal processing, 21(1):11–21, 2002.
[7] Lennart Ljung, Hakan Hjalmarsson, and Henrik Ohlsson.
Four Encounters with System Identification. European
Journal of Control, 17(5-6):449–471, 2011.
[8] Lennart Ljung and Adrian Wills.
Issues in sampling
and estimating continuous-time models with stochastic
disturbances. AUTOMATICA, 46(5):925–931, 2010.
[9] T. McKelvey and G. Guerin. Nonarametric frequency
response estimation using a local rational model. In Proc.
18th IFAC Symposium on System Identification, Brussels,
Belgium, July 2012.
[10] G. Pillonetto, A. Chiuso, and G. De Nicolao. Prediction error
identification of linear systems: a nonparametric Gaussian
regression approach. Automatica, 47(2):291–305, 2011.
[11] G. Pillonetto and G. De Nicolao. A new kernel-based
approach for linear system identification.
46(1):81–93, January 2010.
[12] R. Pintelon and J. Schoukens. System Identification – A
Frequency Domain Approach. IEEE Press, New York, 2nd
edition, 2012.
[13] J. Schoukens, G. Vandersteen, K. Barbe, and R. Pintelon.
Nonparametric preprocessing in system identifiction – a
powerful tool. European Journal of Control, 15:260–274,
[14] Y.C. Zhu. Asymptotic properties of prediction error methods.
Int. J. Adaptive Control and Signal Processing, 3:357–373,
[15] Y.C. Zhu and A. C. M. P. Backx.
Identification of
Multivariable Indusrial Processes for Diagnosis and Control.
Springer Verlag, London, 1993.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF