Initialization Methods for System Identification $(function(){PrimeFaces.cw("Tooltip","widget_formSmash_items_resultList_42_j_idt799_0_j_idt801",{id:"formSmash:items:resultList:42:j_idt799:0:j_idt801",widgetVar:"widget_formSmash_items_resultList_42_j_idt799_0_j_idt801",showEffect:"fade",hideEffect:"fade",target:"formSmash:items:resultList:42:j_idt799:0:fullText"});});

Initialization Methods for System Identification $(function(){PrimeFaces.cw("Tooltip","widget_formSmash_items_resultList_42_j_idt799_0_j_idt801",{id:"formSmash:items:resultList:42:j_idt799:0:j_idt801",widgetVar:"widget_formSmash_items_resultList_42_j_idt799_0_j_idt801",showEffect:"fade",hideEffect:"fade",target:"formSmash:items:resultList:42:j_idt799:0:fullText"});});
Linköping studies in science and technology. Thesis.
No. 1426
Initialization Methods for
System Identification
Christian Lyzell
LERTEKNIK
REG
AU
T
O MA
RO
TI C C O N T
L
LINKÖPING
Division of Automatic Control
Department of Electrical Engineering
Linköping University, SE-581 83 Linköping, Sweden
http://www.control.isy.liu.se
[email protected]
Linköping 2009
This is a Swedish Licentiate’s Thesis.
Swedish postgraduate education leads to a Doctor’s degree and/or a Licentiate’s degree.
A Doctor’s Degree comprises 240 ECTS credits (4 years of full-time studies).
A Licentiate’s degree comprises 120 ECTS credits,
of which at least 60 ECTS credits constitute a Licentiate’s thesis.
Linköping studies in science and technology. Thesis.
No. 1426
Initialization Methods for System Identification
Christian Lyzell
[email protected]
www.control.isy.liu.se
Department of Electrical Engineering
Linköping University
SE-581 83 Linköping
Sweden
ISBN 978-91-7393-477-0
ISSN 0280-7971
LiU-TEK-LIC-2009:34
c 2009 Christian Lyzell
Copyright Printed by LiU-Tryck, Linköping, Sweden 2009
To my better two thirds Carolin and Tuva
Abstract
In the system identification community a popular framework for the problem of estimating a parametrized model structure given a sequence of input and output pairs is given
by the prediction-error method. This method tries to find the parameters which maximize
the prediction capability of the corresponding model via the minimization of some chosen cost function that depends on the prediction error. This optimization problem is often
quite complex with several local minima and is commonly solved using a local search algorithm. Thus, it is important to find a good initial estimate for the local search algorithm.
This is the main topic of this thesis.
The first problem considered is the regressor selection problem for estimating the
order of dynamical systems. The general problem formulation is difficult to solve and
the worst case complexity equals the complexity of the exhaustive search of all possible
combinations of regressors. To circumvent this complexity, we propose a relaxation of
the general formulation as an extension of the nonnegative garrote regularization method.
The proposed method provides means to order the regressors via their time lag and a novel
algorithmic approach for the ARX and LPV- ARX case is given.
Thereafter, the initialization of linear time-invariant polynomial models is considered.
Usually, this problem is solved via some multi-step instrumental variables method. For
the estimation of state-space models, which are closely related to the polynomial models via canonical forms, the state of the art estimation method is given by the subspace
identification method. It turns out that this method can be easily extended to handle the
estimation of polynomial models. The modifications are minor and only involve some intermediate calculations where already available tools can be used. Furthermore, with the
proposed method other a priori information about the structure can be readily handled,
including a certain class of linear gray-box structures. The proposed extension is not
restricted to the discrete-time case and can be used to estimate continuous-time models.
The final topic in this thesis is the initialization of discrete-time systems containing
polynomial nonlinearities. In the continuous-time case, the tools of differential algebra,
especially Ritt’s algorithm, have been used to prove that such a model structure is globally
identifiable if and only if it can be written as a linear regression model. In particular, this
implies that once Ritt’s algorithm has been used to rewrite the nonlinear model structure
into a linear regression model, the parameter estimation problem becomes trivial. Motivated by the above and the fact that most system identification problems involve sampled
data, a version of Ritt’s algorithm for the discrete-time case is provided. This algorithm
is closely related to the continuous-time version and enables the handling of noise signals
without differentiations.
v
Acknowledgments
First of all, I would like to thank my supervisor Professor Lennart Ljung and my cosupervisor Dr. Martin Enqvist. You have shown a great deal of patience in guiding me
towards the results presented in this thesis and it would almost surely not have been the
same without your support. Especially Martin, I learn a lot from our discussions and I am
most grateful for your attention to details and the time you have given me.
Professor Lennart Ljung, as the head of the Division of Automatic Control, and our
administrator Ulla Salaneck deserve extra gratitude for their immense work in managing
our group. I feel most comfortable in your guidance down the dark slippery slope1 .
I would also like to thank Professor Torkel Glad and Dr. Jacob Roll for their guidance
and co-authoring some of the papers, which now constitutes the foundation of this thesis.
Furthermore, working with Dr. Roland Tóth, Dr. Peter Heuberger and Professor Paul
Van den Hof at the Delft Center for Systems and Control is a breeze and I hope our
collaboration will continue in the future.
This work has been supported by the Swedish Research Council which is gratefully
acknowledged.
Writing a thesis, even if it is only for a licentiate degree, can be cumbersome and
quite often frustrating. Therefore, help in different shapes from our computer gurus
Dr. Gustaf Hendeby and Lic. Henrik Tidefelt are much appreciated. Your knowledge in
Linux, TEX, C++ and computer graphics is substantial and your willingness to explain is
even more impressive. This thesis is written in a LATEX template constructed by Gustaf
and all the figures have been generated via the languages of shapes and mpscatter
written by Henrik.
In a project such as this, it sometimes happens that incomprehensible formulations
magically appear in the text. Luckily, I am surrounded by friends and colleagues with eyes
like eagles and I am most grateful to Lic. Daniel Ankelhed, Rikard Falkeborn, Fredrik
Lindsten, Lic. Henrik Ohlsson, Dr. Umut Orguner, Daniel Petersson, Lic. Henrik Tidefelt
and Lic. Johanna Wallén for proofreading various parts of this thesis.
Without my friends, within and outside this group, I would have been lost. The time
we spend on “tjöta”, sharing a pint or even a balcony, is what life is all about. Every time
I get to beat you at Guitar Hero, I feel that I am a little bit better than you, which is quite
a satisfactory feeling ©. Seriously, you rock!
Finally, I would like to thank the loves of my life Carolin and Tuva. Please, bare with
me. I can change, I promise. . .
1 We
fell only once, which is a feat in itself (ERNSI, 2009).
vii
Contents
1 Introduction
1.1 Research Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
3
Linear Regression Problems
2.1 The Least-Squares Estimator
2.2 Instrumental Variables . . .
2.3 Regressor Selection . . . . .
2.4 Selection Criteria . . . . . .
1
1
6
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
10
11
15
System Identification
3.1 Introduction . . . . . . . . . . . .
3.2 Model Structures . . . . . . . . .
3.2.1 Linear Time-Invariant . .
3.2.2 Linear Parameter-Varying
3.2.3 Nonlinear Time-Invariant .
3.2.4 General Theory . . . . . .
3.3 Instrumental Variables . . . . . .
3.4 Subspace Identification . . . . . .
3.4.1 Discrete Time . . . . . . .
3.4.2 Continuous Time . . . . .
3.5 An Algebraic Approach . . . . . .
3.6 Model Validation . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
18
18
21
22
22
24
28
28
34
36
38
.
.
.
.
.
.
.
.
ix
x
Contents
4 The Nonnegative Garrote in System Identification
4.1 Introduction . . . . . . . . . . . . . . . . . . .
4.2 Problem Formulation . . . . . . . . . . . . . .
4.3 Applications to ARX Models . . . . . . . . . .
4.4 Applications to LPV-ARX Models . . . . . . .
4.5 Discussion . . . . . . . . . . . . . . . . . . . .
4.A Parametric Quadratic Programming . . . . . .
5
6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
41
45
46
50
55
56
Utilizing Structure Information in Subspace Identification
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
5.2 OE Models . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Discrete Time . . . . . . . . . . . . . . . . . . .
5.2.2 Continuous Time . . . . . . . . . . . . . . . . .
5.3 ARMAX Models . . . . . . . . . . . . . . . . . . . . .
5.4 Special Gray-Box Models . . . . . . . . . . . . . . . . .
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
60
61
67
69
71
75
Difference Algebra and System Identification
6.1 Introduction . . . . . . . . . . . . . . . .
6.2 Algebraic Concepts . . . . . . . . . . . .
6.2.1 Signal Shifts . . . . . . . . . . .
6.2.2 Polynomials . . . . . . . . . . .
6.2.3 Systems of Polynomials . . . . .
6.3 Identifiability . . . . . . . . . . . . . . .
6.4 Identification Aspects . . . . . . . . . . .
6.5 Discussion . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
77
77
79
79
80
84
88
90
93
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
95
Notation
Symbols, Operators and Functions
N
Z
Z+
R
∈
A⊂B
A\B
,
4
arg min f (x)
q
(x(t))M
t=0
kxk1
kxk2
θ
u(t)
y(t)
the set of natural numbers (0 ∈ N)
the set of integers
the set of positive integers
the set of real numbers
belongs to
A is a subset of B
set difference, {x | x ∈ A ∧ x ∈
/ B}
equal by definition
component-wise inequality (for vectors),
negative semidefiniteness (for a matrix A with A 4 0)
value of x that minimizes f (x)
the shift operator, qu(t) = u(t + 1)
the
Pnsequence x(0), x(1), . . . , x(M )
i=1 |xi |
P
n
2 1/2
i=1 xi
parameter vector
input signal at time t
output signal at time t
xi
xii
Notation
Abbreviations and Acronyms
AIC
ARX
ARMAX
IV
FS
LTI
LPV
LS
MDL
NARX
NFIR
NNG
OE
OCF
PEM
QP
SID
Akaikes Information Criterion
AutoRegressive (system) with eXternal input
AutoRegressive Moving Average (system) with eXternal input
Instrumental Variables
Forward Stepwise regression
Linear Time-Invariant (system)
Linear Parameter-Varying (system)
Least-Squares
Minimum Description Length
Nonlinear AutoRegressive (system) with eXternal input
Nonlinear Finite Impulse Response (system)
NonNegative Garrote
Output Error (system)
Observer Canonical Form
Prediction Error Method
Quadratic Programming
Subspace IDentification
1
Introduction
The art of modeling is an inherent nature of the human being and plays a major role
throughout our lives. From the moment of our birth, empirical data of the surrounding
environment are gathered, either through our own experiences or via the experience of
others. The data are then used to construct models, which can help us predict the outcome
of our actions in different situations. Some models protect us from danger, while others
help us to plan our actions to get the desired outcome. As an example, consider the
problem of traveling from one location to another, where there is a constraint on the time
of arrival. Usually, there are several choices of transportation, for example, one can go
by car or take a bus. When using the car, one needs to consult a map to get an idea of
which roads to choose and how much time is needed to cover the distance. Here, the map
represents a model of the terrain and the road network available.
It is important to differentiate the model from the system itself, that is, the system is
the reality that the model tries to explain. In the traveling example, the system is the true
terrain, while the map is a two dimensional model of the system with finite resolution.
System identification is a subset of mathematical modeling, which treats the subject
of modeling of dynamical systems from empirical data. In the following section, we will
try to form a basic understanding of the problems involved when trying to model a system
from measured data.
1.1
Research Motivation
The fundamental concept of system identification can be described as follows: given some
observations of a system, find a mathematical model that explains the observations as accurately as possible, and thus yielding a valid model of the system itself. The observations
usually consist of a collection of measurements of input u(t) and output y(t) signals, respectively:
N
Z N = u(t), y(t) t=1 .
(1.1)
1
2
1
Introduction
Noise
v(t)
Input
u(t)
System
Output
y(t)
Figure 1.1: A schematic description of the signals involved in the system identification process.
In most real world processes, not all variables which affect the output of the system are
measured. Such unmeasured signals v(t) are referred as disturbances or noise. Figure 1.1
shows a schematic diagram of the signals involved in the identification process.
A common method in system identification is the prediction-error method (PEM) (see,
for instance, Ljung, 1999). In this method, one tries to find the parameter vector θ which
best describes the data in the sense that some cost function is minimized
θ̂N = arg min VN (ε(t, θ), Z N ).
(1.2)
θ∈Rn
A common choice of the cost function is the quadratic cost function
VN (θ, Z N ) =
N
2
1 X
y(t) − ŷ(t|t − 1, θ) ,
N t=1
(1.3)
where ŷ(t|t − 1, θ) is the predictor of the output defined by the chosen model structure.
Even though it is often quite straightforward to formulate the optimization problem (1.2)
of the parameter estimation problem, it turns out that it can sometimes be quite tricky to
find the global optimum.
An important subproblem of system identification is to find the model with the lowest
complexity, within some model set, which describes a given set of data sufficiently well
according to some criterion (1.2). There are several reasons for considering this problem.
One reason is that even though a higher model complexity will yield a better adaptation
to the data used for the estimation, it might be that the model is adapting too well. Thus,
the model does not properly represent the system itself, only the data that is used for the
estimation. Also, a higher model complexity usually means a higher computational cost,
both in time and in memory. Hence, a model with lower complexity might be preferred
to one with higher complexity if the loss in the ability to describe the data is not too
great. The model order selection problem is in several ways a difficult problem and is
best described by considering a simple example:
Example 1.1
Consider the simple linear regression model
y(t) = ϕ(t)T θ + e(t),
(1.4)
where y(t) and ϕ(t) are known entities, e(t) is an unknown disturbance, and θ ∈ Rn is
the unknown parameter vector that we want to estimate. For the special case of a white
noise disturbance e(t), the predictor of the output of the model structure (1.4) is given by
ŷ(t|t − 1, θ) = ϕ(t)T θ.
(1.5)
1.1
3
Research Motivation
A common measure of the complexity for linear regression models is the number of parameters used. Thus, it is interesting to consider the problem of finding the least-squares
estimate of θ using only k ≤ n parameters can be formulated as
minimize
n
N
2
1 X
y(t) − ϕ(t)T θ
N t=1
subject to
card{i ∈ N | θi 6= 0, 1 ≤ i ≤ n} ≤ k,
θ∈R
(1.6)
where the cardinality operator card returns the number of elements in the set. In addition,
to find the most appropriate model order, the problem (1.6) needs to be solved for all
1 ≤ k ≤ n. The optimization problem (1.6) can be shown to be NP-hard (see, for
instance, Amaldi and Kann, 1998). Instead of solving (1.6), one could try to estimate all
possible combinations of the regressors and choose the combination that yields the best
prediction of the data. It is easy to see that the total number of combinations is given by
n X
n
= 2n − 1,
(1.7)
k
k=1
which grows quite rapidly with the number of possible regressors n. Thus, there is a need
for algorithms to select model order with lower computational complexity.
There exists a wide variety of methods to handle the difficulties described above, but many
of these methods do not take the properties of dynamical systems into account. A selection
of these methods will be presented in a following chapter where also the modifications in
the dynamical case will be discussed.
Now, let us consider a direct application of the PEM on a special class of linear models.
A common model structure is the output error (OE) model structure, which is defined by
the input-output relationship
y(t) =
Bp (q)
u(t) + e(t),
Fp (q)
(1.8)
where y(t) is the output, u(t) the input and e(t) is the measurement noise. The polynomials Bp (q) and Fp (q) are given by
Bp (q) = b1 q −nk + · · · + bnb q −nk −nb +1
Fp (q) = 1 + f1 q
−1
+ · · · + fnf q
−nf
,
(1.9a)
(1.9b)
where na , nb , nk ∈ N and q is the shift operator, that is, qu(t) = u(t + 1). The predictor
of the output defined by the OE model structure is
ŷ(t|t − 1, θ) =
Bp (q)
u(t),
Fp (q)
(1.10)
where the unknown parameters in θ consist of the coefficients to the numerator and denominator polynomials. The problem of estimating the parameters from (1.2), given a set
of data (1.1), using the quadratic loss function (1.3), may turn out to be have several local
minima.
4
1
Introduction
Example 1.2: (Söderström, 1975)
Let the system be given by the OE model
y(t) =
q −1
u(t),
(1 − 0.7q −1 )2
(1.11)
and let the input be generated by
u(t) = (1 − 0.7q −1 )2 (1 + 0.7q −1 )2 v(t),
(1.12)
where v(t) is white Gaussian noise with zero mean and unit variance. Now, let us try to
fit a model
b1 q −1
ŷ(t|t − 1, θ) =
u(t),
(1.13)
1 + a1 q −1 + a2 q −2
N
to the data Z N = y(t), u(t) t=1 generated by (1.11) and (1.12). This is done by minimizing the loss function as in (1.3). It can be shown (Söderström, 1975) that the minimization with respect to b1 can be done separately and an analytic solution as a function
of the values of a1 and a2 can be found. Hence, without loss of generality, we need only
to consider the parameter values of a1 and a2 when plotting the level curves of the cost
function (1.3). Such a contour plot, for the case N → ∞, is given in Figure 1.2.
1
a2
0.5
0
−0.5
−1
−2
−1
0
a1
1
2
Figure 1.2: A contour plot of the cost function (1.3) for different values of the
parameters a1 and a2 . The triangle is the region of stability.
Here we notice that there are two local minima, the global minimum in b1 = 1,
a1 = −1.4, a2 = 0.49 and a local one in b1 = −0.23, a1 = 1.367, a2 = 0.513. Thus, it
is important to find a good initial estimate to the optimization problem (1.2), if one wish
to apply a local search algorithm, to be able to find the true parameter values. It is worth
noting that the non-uniqueness of the minima is due to the special choice of the input
signal (1.12) and is not an inherent property of the system (1.11) itself.
Even for the seemingly simple model structure in the example above, it turns out that the
1.1
5
Research Motivation
PEM formulation of the parameter estimation problem can be quite difficult to solve. To
this end, a lot of effort has been put into finding good initial estimates which lie close to
the global optimum, from where one can start a local search. For the OE model structure,
and other closely related structures, there is still ongoing research to find good initial
estimates and we will return to this problem at a later stage.
Now, let us consider an even simpler model structure:
Example 1.3
Let the system that we want to model be given by the input-output relation
y(t) = θ0 u(t) + θ02 u(t − 1),
(1.14)
where θ0 = 0.9 and let the input be a sum of sines u(t) = sin(t) + sin(t/3). Plotting the
cost function (1.3) with
ŷ(t|t − 1, θ) = θu(t) + θ2 u(t − 1),
(1.15)
for some different values of θ yields the curve in Figure 1.3.
VN (θ, Z N )
2
1
0
−2
0
θ
2
Figure 1.3: The cost function (1.3) for the system (1.14) with the predictor (1.15)
for some different parameter values θ.
There are two local minima present, the global minimum in θ = 0.9 and a local in
θ = −1.4279. Thus, also for this simple example, it is important to have a good initial
estimate to find the true global minimum.
The difficulties that were found in the example above are due to the square on the unknown
parameter. Fortunately, the model structure discussed above has a favorable structure, that
is, only polynomial nonlinearities are present. This enables the use of certain recently developed algebraic methods, which can be used to rewrite (1.14) to an equivalent form
where all the nonlinearities are moved from the unknown parameter to the signals involved. Thus, instead of minimizing a fourth order polynomial, one only needs to find the
unique minimum of a second order polynomial. The mentioned algebraic techniques will
be discussed later in the thesis.
6
1
Introduction
1.2 Contributions
The main contributions of this thesis are new methods to approach the problems discussed in the section above. Recent developments in the statistical learning community
have opened up for new efficient algorithms for the model selection problem. These algorithms also have interesting implications for the model order selection of linear dynamic
regression models:
C. Lyzell, J. Roll, and L. Ljung. The use of nonnegative garrote for order
selection of ARX models. In Proceedings of the 47th IEEE Conference on
Decision and Control, Cancun, Mexico, December 2008
This method can be modified to also include the structural selection problem, that is, the
problem of choosing between several possible grouped regressors. These modifications
have been developed to include a slightly more general model class in:
R. Tóth, C. Lyzell, M. Enqvist, P. Heuberger, and P. Van den Hof. Order and
structural dependence selection of LPV-ARX models using a nonnegative
garrote approach. In Proceedings of the 48th IEEE Conference on Decision
and Control, Shanghai, China, December 2009
The initialization of linear polynomial model structures is a mature topic in system
identification and several methods exist. In the last two decades, a new algorithm for
estimation of linear state-space models has been developed. By utilizing the structure
of certain canonical state-space forms, this method can be altered to work also as an
initialization method for certain linear polynomial models, such as the OE model in Example 1.2, which is the topic of:
C. Lyzell, M. Enqvist, and L. Ljung. Handling certain structure information
in subspace identification. In Preprints of the 15th IFAC Symposium on
System Identification, Saint-Malo, France, July 2009a
The tools of differential algebra have shown to be quite useful when analyzing different system identification aspects of certain continuous-time model structures. It turns out
that a model structure containing only polynomial nonlinearities is globally identifiable if
and only if it can be written as a linear regression model. If these methods were available
for discrete-time equivalent model structures, this would imply that the parameter estimation problem in Example 1.3 could be rewritten as an optimization problem with only one
minimum. It turns out that some of the results can be generalized, with slight alterations,
to handle discrete-time model structures:
C. Lyzell, T. Glad, M. Enqvist, and L. Ljung. Identification aspects of Ritt’s
algorithm for discrete-time systems. In Preprints of the 15th IFAC Symposium on System Identification, Saint-Malo, France, July 2009b
Not included published material is:
C. Lyzell and G. Hovland. Verification of the dynamics of the 5-DOF GantryTau parallel kinematic machine. In Proceedings of Robotics and Applications
and Telematics, Würzburg, Germany, August 2007
1.3
Thesis Outline
7
1.3 Thesis Outline
This thesis is structured as follows: In Chapter 2, the fundamental linear regression model
is reviewed, which includes the regression selection problems discussed in Example 1.1.
Chapter 3 concerns the system identification framework and different model structures
and estimation methods are reviewed. In Chapter 4 a novel algorithmic contribution to the
model selection and the structural selection problem in the dynamical case is described.
The initialization of different linear polynomial model structures is analyzed in Chapter 5,
while Chapter 6 presents an algebraic framework for analyzing system identification aspects of discrete-time model structures with polynomial nonlinearities.
2
Linear Regression Problems
The concept of linear regression goes back to the work of Gauss (1809) on the motion of
heavenly bodies and is still today a widely used tool in the field of applied science. The
linear regression model is the simplest type of a parametric model and is described by the
relationship
y(t) = ϕT (t)θ + v(t).
(2.1)
Here, y(t) is a measurable quantity called the observations or the output, ϕ(t) is a known
vector of independent variables or regressors and θ is a vector of unknown parameters.
The signal v(t) is called disturbance or noise and represents the error that might occur
when measuring the regressed variable y(t). In this chapter, we will review different
methods for estimating the parameter vector θ in (2.1) and also the problem of determining
which regressors are important for describing the data. For further details on the subjects
at hand, see, for instance, Draper and Smith (1981), Casella and Berger (2002) or Hastie
et al. (2009) and the references therein.
2.1
The Least-Squares Estimator
How does one find a reliable estimate of the parameter vector θ in (2.1) given a sequence
of observations y(t) and regressors ϕ(t) where the variable t ranges from 1 to N ? Assuming that the noise is a white Gaussian process with zero mean, which is uncorrelated
with the regressors, it can be shown that the least-squares (LS) estimator
LS
θ̂N
= arg min
θ∈Rnθ
N
2
1 X
y(t) − ϕT (t)θ ,
N t=1
(2.2)
is a statistically efficient estimator (see, for example, Casella and Berger, 2002). The
difference y(t)−ϕ(t)T θ is known as the residual and represents the remaining unmodeled
9
10
2
Linear Regression Problems
behavior of the data. By introducing the notation
RN ,
fN ,
N
1 X
ϕ(t)ϕT (t),
N t=1
(2.3a)
N
1 X
ϕ(t)y(t),
N t=1
(2.3b)
the solution to the unconstrained quadratic optimization problem (2.2) can be written as
−1
LS
θ̂N
= RN
fN ,
(2.4)
LS
given that the matrix RN has full rank. Determining the estimate θ̂N
directly via the
formula (2.4) is not numerically sound and (2.2) should in practice be solved using a
more efficient algorithm (see, for instance, Golub and Van Loan, 1996).
In some applications, one might have prior knowledge of the values of certain parameters or there might exist a linear relationship between some of them, that is, the
parameters satisfy some linear equality constraints Aθ = b. Adding these constraints
to the optimization problem (2.2) results in an equality constrained least-squares (LSE)
problem
minimize
n
θ∈R
θ
N
2
1 X
y(t) − ϕT (t)θ ,
N t=1
(2.5)
subject to Aθ = b.
A simple solution to this class of optimization problems is to eliminate the linear equality
constraints by an appropriate change of coordinates. If A has full rank, a possible coordinate change can be found by determining the QR decomposition with column pivoting
R
,
0
(2.6a)
θ = Q1 R−T ΠT b + Q2 θ̃.
(2.6b)
AT Π = Q1
Q2
and make the change of variables
This coordinate change eliminates the equality constraints and only an ordinary leastsquares problem in the variable θ̃ remains to be solved. See, for example, Nocedal and
Wright (2006) for further details on the handling of linear equality constraints.
2.2
Instrumental Variables
Let the system have the linear regression structure
y(t) = ϕT (t)θ0 + v(t).
(2.7)
2.3
11
Regressor Selection
The least-squares estimator (2.4), using the notation in (2.3), becomes
X
N
1
−1
−1
LS
θ̂N
ϕ(t)v(t) .
= RN
fN = θ 0 + RN
N t=1
−1
Now, if the regressors and the noise are independent and RN
is a finite matrix, then it
LS
holds that E θ̂N = θ0 , that is, the least-squares estimator will be unbiased.
When the regressors and the noise are correlated, it might happen that the least-squares
estimator is biased and therefore needs to be modified. To this end, the instrumental
variables (IV) method was proposed in Reiersøl (1941). A thorough treatment of the
method is given in Söderström and Stoica (1980) and the fundamentals can be found in,
for instance, Ljung (1999) and Söderström and Stoica (1989). The basic idea of the IV
method is to find variables ζ(t), which are uncorrelated with the noise v(t), so that the
estimator
−1 X
X
N
N
1
1
T
IV
ζ(t)ϕ (t)
ζ(t)y(t) ,
(2.8)
θ̂N =
N t=1
N t=1
is asymptotically unbiased. The variables ζ(t) are often referred to as instruments. The
variance optimal instruments depend on the true system (see, for example, Söderström
and Stoica, 1980), which is generally not known beforehand. Thus, it may be quite difficult to choose appropriate instruments and different methods for approximating the optimal instruments have been designed (see, for example, the IV 4 method in Söderström
and Stoica, 1980). It is clear that the least-squares estimator is a special case of IV when
choosing the instruments as ζ(t) = ϕ(t).
2.3
Regressor Selection
An important problem when dealing with linear regression models is the selection of
regressors, that is, finding the regressors that are significant for describing the data. There
exists a wide variety of methods for regressor selection, some of which are described
in Hastie et al. (2009). Recent developments can be found in Hesterberg et al. (2008).
A brute force way of selecting the most significant regressors is to estimate all possible
combinations and then select the best one according to some criterion. The selection
criterion should take both the fit to data and the number of parameters into account. This
kind of method is often referred to as all possible regressors and efficient implementations
can be found in, for example, Furnival and Wilson (1974) or in Niu et al. (1996). A
huge disadvantage of the above strategy is that the computational complexity grows quite
rapidly with the number of regressors and it is therefore not applicable in cases with a large
number of possible regressors, as is often the case in biological and medical applications.
If a large number of regressors are available, a more computationally tractable solution to
the regressor selection problem is to use regularization methods.
One of the first regularization methods proposed in the statistical community was
presented in Hoerl and Kennard (1970) and is referred to as ridge regression
ridge
θ̂N
, arg min
θ∈Rnθ
N
2
λ
1 X
y(t) − ϕT (t)θ + kθk22 .
N t=1
N
(2.9)
12
2
Linear Regression Problems
The nonnegative variable λ is called the regularization parameter, which balances the
need for a good fit to data and the size of the estimated parameter values. For λ equal
to zero, only the fit to data will be considered and the solution equals the least-squares
estimate (2.4). As λ increases, the focus shifts from having a good fit to data to having
small parameter values and in the limit all the parameter values will be zero.
In Tibshirani (1996), a slightly different regularization method was proposed, referred
to as lasso regression
lasso
θ̂N
, arg min
θ∈Rnθ
N
2
λ
1 X
y(t) − ϕT (t)θ + kθk1 .
N t=1
N
(2.10)
Even though (2.9) and (2.10) might appear similar, their solutions have quite different
properties. To illustrate this difference, rewrite (2.9) and (2.10) as
minimize
n
N
2
1 X
y(t) − ϕT (t)θ
N t=1
subject to
kθkpp
θ∈R
θ
(2.11)
≤ s,
where s ≥ 0 and p ∈ {1, 2}, respectively, where there is a one-to-one correspondence
between the variables s and λ. In Figure 2.1, the feasible set and the cost function in (2.11)
are illustrated for a problem with two parameters.
θ2
θ2
θ0
θ0
θ̂
s
θ̂
s
θ1
a) Ridge regression
θ1
b) Lasso regression
Figure 2.1: The feasible set (shaded area) and the level curves of (2.11) for ridge (to
the left) and lasso (to the right) regression. The true parameter values are denoted by
θ0 and the estimate by θ̂.
In the ridge regression case, the spherical shape of the feasible set allows both parameters to be nonzero at the optimum. This is a drawback of the method in the sense that even
though the parameter values will shrink towards zero as the parameter s decreases, most
of them will be nonzero until s is very small. Thus, it may be difficult to decide which
regressors are more important than the others (see, for instance, Hastie et al., 2009).
2.3
13
Regressor Selection
In the lasso regression case, the diamond shape of the feasible set forces the less important parameter to be exactly zero. The lasso method generally releases one parameter
at a time as s increases until all parameters are nonzero, which makes the regression selection problem easier. The formulation (2.11) with p = 1 can easily be rewritten as
a convex quadratic optimization problem by introducing slack variables for the `1 norm
constraint (Tibshirani, 1996). This makes it a simple problem to solve for any given λ,
but it is not clear how to solve the problem efficiently for all nonnegative λ.
An important contribution to the regularized regression problem was presented in the
seminal paper by Efron et al. (2004), where a new way of describing how the regressors
should be selected was presented. The algorithm is called least angle regression (LARS)
and is easiest explained via the forward stepwise (FS) regression method. The FS method
is quite simple and builds a model sequentially by adding one parameter at a time. At each
time step, it identifies the regressor that is most correlated with the unmodeled response
(the residual) and then updates the least-squares parameter estimate accordingly.
The LARS method uses a similar strategy. At each time step, LARS locates the regressor that is most correlated with the residual. But, instead of fitting the parameter
completely in the least-squares sense, it only fits the parameter as much that another unused regressor gets an equal correlation with the residual. This process is repeated, adding
one regressor at a time, until the full linear regression model is retrieved. To illustrate the
LARS procedure, let us consider a simple example.
Example 2.1
Let us consider a linear regression model (2.1) involving two parameters
y(t) = ϕ1 (t)θ1 + ϕ2 (t)θ2 ,
(2.12)
where we for simplicity assumed noise-free output. Introducing the vector of stacked
outputs
T
y , y(1) y(2) · · · y(N ) ,
and similarly for ϕ1 and ϕ2 , then (2.12) may be written in vector form as
y = ϕ1 θ1 + ϕ2 θ2 .
(2.13)
In the following, assume (without loss of generality) that the vectors y, ϕ1 and ϕ2 have
zero mean. Furthermore, assume the regressor vectors ϕ1 and ϕ2 have been scaled to
have unit norm and that ϕ1 has a higher correlation with y than ϕ2 does. Here, by a
predictor we mean
ŷ = ϕ1 θ̂1 + ϕ2 θ̂2 ,
for some estimates θ̂1 and θ̂2 of the parameters θ1 and θ2 , respectively. The least-squares
predictor of (2.13) using both regressors is illustrated in Figure 2.2 by ŷ2 . The best leastsquares predictor when only one regressor may be used, in this case ϕ1 , is given by the
predictor ŷ1 in Figure 2.2.
The FS regression method starts with both parameters equal to zero, that is, in the
predictor denoted ŷ0 in Figure 2.2. Since ϕ1 has the highest correlation with y of the
two available regressors, the FS method first adds ϕ1 to the model which results in the
predictor ŷ1 . Then, ϕ2 is added and the estimate is updated resulting in ŷ2 . The FS
solution path is illustrated by the solid line in Figure 2.2.
14
2
ϕ2
Linear Regression Problems
ϕ2
α
α
ŷ0 , µ̂0
µ̂1
l
fs
ŷ2 , µ̂2
ars
ŷ1
ϕ1
Figure 2.2: The solution path of the FS method (solid line) and the
(dashed line) for the simple two regressors model (2.12).
LARS
procedure
The LARS has a similar solution path as the one given by the FS method. It starts with
both parameters equal to zero, that is, the point denoted µ̂0 in Figure 2.2. In the same
way as for the FS method, the regressor ϕ1 is added to the model. Instead of fitting this
estimate completely, LARS changes the value of the corresponding parameter so that the
predictor moves continuously towards the least-squares predictor ŷ1 until the regressor
ϕ2 has as high correlation with the residual y − ϕ1 θ̂1 as ϕ1 does. This happens when
the predictor denoted µ̂1 in Figure 2.2 is reached. Finally, the regressor ϕ2 is added to
the model and the parameter values changes continuously until the least-squares predictor
µ̂2 = ŷ2 is found. The solution path of the LARS algorithm is depicted as the dashed line
in Figure 2.2. For further examples and illustrations of the LARS algorithm, the reader is
referred to Efron et al. (2004) and Hastie et al. (2009).
By the description of the LARS algorithm and the example above, one notices that the
LARS
corresponding solution path is piecewise affine, that is, the resulting estimate θ̂N
is a
piecewise affine function, see also Figure 2.2. One of the main results in Efron et al.
(2004) showed that the lasso regression can be solved via the LARS algorithm if the step
length is restricted (see also Hastie et al., 2009). In addition to providing an efficient
algorithm for finding the entire solution path of the lasso method, this implies that the
solution to the lasso regularization problem (2.10) is piecewise affine in the regularization
lasso
parameter. In other words, the solution path θ̂N
(λ) is a piecewise affine function of the
parameter λ.
Another interesting regularization method is the nonnegative garrote (NNG), which
was presented in Breiman (1995)
minimize
n
w∈R
θ
nθ
N
2
λ X
1 X
LS T
y(t) − ϕ(t) θ̂N
w +
wi
N t=1
N i=1
(2.14)
subject to w 0,
where denotes componentwise multiplication and denotes componentwise inequality. This method differs from the previous methods in that it uses the least-squares esti-
2.4
15
Selection Criteria
mate (2.4) as a starting value, and then finds the importance weights w of the parameters
instead of manipulating the parameter values directly. In Yuan and Lin (2006) it was
shown that also the NNG regularization problem (2.14) has a piecewise affine solution
path and an efficient algorithm similar to the LARS method was presented.
The algorithm used in Efron et al. (2004) can be generalized to a larger class of regularization problems and the optimization algorithms used are referred to as parametric
optimization, see, for example, Rosset and Zhu (2007) or Roll (2008).
The regularization methods presented above yield a collection of parameter estimates
along the solution path and somehow one must decide which of these estimates to use.
This is the topic of the next section.
2.4
Selection Criteria
In the estimation and selection methods discussed so far, the cost function
VN (θ) ,
N
2
1 X
y(t) − ϕT (t)θ
N t=1
(2.15)
has been used to determine estimates of the parameters. One drawback of this cost function is that it only takes the fit to data into consideration when selecting the model and
thus rewards higher order structures with more degrees of freedom. Thus, a different kind
of cost function is needed, which also considers the number of parameters used to achieve
the fit to prevent over-fitting.
To this end, several criteria have been proposed in the literature, some of which can
be found in, for instance, Ljung (1999) and Hastie et al. (2009). In this thesis, two different selection criteria will be used. The first one is an approximation of the Akaikes
information criteria (AIC) (see Akaike, 1969) which has the form
2 dim θ .
WNAIC (θ) , VN (θ) 1 +
N
(2.16)
where the dim operator yields the number of nonzero elements of a vector. The AIC
criteria has a tendency to select higher order models and a different criterion called the
minimum description length (MDL) (see Rissanen, 1978) can be used instead
dim θ log N WNMDL (θ) , VN (θ) 1 +
,
N
(2.17)
The primary use of these two criteria in this thesis is to automatize the selection of which
parameter estimate to use out of the entire solution path given by the regularization methods presented in the previous section. As an example, consider the piecewise affine solulasso
tion path θ̂N
(λ) to the regularization problem (2.10). The MDL choice is the given by
lasso
minimizing WNMDL (θ̂N
(λ)) for all λ ≥ 0. Due to the piecewise affine solution path of
lasso
the lasso method, it suffices to find the minimum of WNMDL (θ̂N
(λ)) for a finite number
of points, namely the values of λ for which the local affine representation of the solution
path changes.
3
System Identification
The subject of system identification, which already has been touched upon in the introductory chapters, is the art of modeling dynamical systems given a collection of measurements of the input and output signals of the system. Here, a more thorough treatment of
the concepts needed in later chapters will be given. The topics presented here only form
a minor subset of the vast subject that is system identification. To be able to grasp the
diversity of the topic and to get a more complete picture of its essence one should explore
the contents of the de facto standard textbooks Ljung (1999) or Söderström and Stoica
(1989).
3.1
Introduction
System identification is the art of modeling dynamical systems from a given set of data
N
Z N = u(t), y(t) t=1 ,
(3.1)
where u and y are the input and the output signals, respectively, of the system that should
be modeled, see also Figure 1.1. A common assumption is that the system can be well
approximated using some parametrized model structure which may represent the prior
knowledge of the system. The choice of parametrization is highly dependent on the application at hand, where experience and system knowledge are vital to decide an appropriate
parametrization. Once the model structure has been decided upon, one can adapt a model
to the measurement by minimizing some criterion
θ̂N = arg min VN (θ, Z N ),
(3.2)
θ∈D
where the unknown parameter vector θ represents the parametrization of the model structure. Thus, there are mainly two choices that need to be made: an appropriate parametrization of the model and a cost function which reflects the application in which the model
17
18
3
System Identification
will be used. In automatic control applications, the model is often used to predict the output of the system given the current state and input signal. This is represented by choosing
a cost function on the form
VN (θ, Z N ) =
N
1 X
` L(q)ε(t, θ) ,
N t=1
(3.3)
where `(·) is some convex function and L(q) is a filter which removes unwanted properties in the data. The quantity
ε(t, θ) = y(t) − ŷ(t|t − 1, θ),
(3.4)
is called the prediction error and ŷ(t|t−1, θ) is the one-step-ahead predictor representing
the model of the system. The method of finding the predictor which best describes a set
of data (3.1) by minimizing the criterion (3.3) is commonly referred to as the predictionerror method (PEM). The cost function (3.3) is quite general and it is not uncommon
to choose a less complex criterion. The choice in this thesis is the quadratic cost function (1.3), that is,
N
1 X
N
VN (θ, Z ) =
ε(t, θ)2 .
(3.5)
N t=1
This choice is probably the most widely used special case of the general cost function (3.3). The main motivation for its use is the simplicity, both in terms of deriving theoretical properties of the resulting estimator and in terms of implementing simple/practical
algorithms for adapting models to data. For a thorough description of the PEM and its
properties, see Ljung (1999) and the references therein.
In the sections that follow, some examples of common choices of model parametrization will be given. Furthermore, as was pointed out in the introductory chapter, since
the PEM often involves solving a non-convex optimization problem, some methods for
finding an initial estimate which lies close to the optimal will be presented. Finally, the
chapter is concluded with a collection of methods for validating estimated models.
3.2
Model Structures
In this section, some of the commonly used model parametrizations will be presented.
Each choice of parametrization incorporates some knowledge about the system, for example, where the noise enters the system. Each assumption inflicts some structure to
the estimation problem (3.3), which should be utilized to increase the efficiency of the
optimization routine. The section is concluded with a mathematical formalization of the
concepts introduced.
3.2.1
Linear Time-Invariant
A widely used assumption in system identification is that the system at hand is linear
time-invariant (LTI):
3.2
19
Model Structures
e(t)
H(q, θ)
u(t)
G(q, θ)
y(t)
+
Figure 3.1: A common representation of
LTI
systems.
Definition 3.1. A system is linear if its output response of a linear combination of inputs
is the same linear combination of the output responses of the individual inputs. It is said
to be time invariant if its response to a certain input signal does not depend on absolute
time.
A common representation of LTI systems is given by the class of linear transferfunction models
y(t) = G(q, θ)u(t) + H(q, θ)e(t),
(3.6)
where q is the forward shift operator, that is, qu(t) = u(t + 1). Here, y(t) is an ny
dimensional vector of outputs, u(t) is an nu dimensional vector of inputs, and e(t) is
the noise signal with appropriate dimensions, respectively. Furthermore, the transferfunctions G(q, θ) and H(q, θ) are rational functions in q and the coefficients are given
by the elements of the parameter vector θ. In this presentation, we will assume that
ny = nu = 1. The predictor associated with the output of (3.6) is given by (see, for
example, Ljung, 1999, page 80)
ŷ(t|t − 1, θ) , H −1 (q, θ)G(q, θ)u(t) + 1 − H −1 (q, θ) y(t),
(3.7)
where H −1 (q, θ) , 1/H(q, θ). The model structure (3.6) is quite general and some
special cases deserve attention. A simple case of (3.6) is the ARX model structure, that is,
Ap (q)y(t) = Bp (q)u(t) + e(t),
(3.8)
where
Ap (q) = 1 + a1 q −1 + · · · + ana q −na ,
Bp (q) = b1 q
−nk
+ · · · + bnb q
−nk −nb +1
(3.9a)
.
(3.9b)
In the case when no input is present, that is Bp (q) = 0 in (3.8), the model structure
is referred to as an AR model. The ARX model structure (3.8) can be generalized by
assuming that the noise is described by a moving average process, which results in the
ARMAX model structure
Ap (q)y(t) = Bp (q)u(t) + Cp (q)e(t),
where
Cp (q) = 1 + c1 q −1 + · · · + cnc q −nc .
(3.10)
20
3
System Identification
An important special case of (3.6) is the output error (OE) model structure given by
y(t) =
Bp (q)
u(t) + e(t).
Fp (q)
(3.11)
The predictors of the output given by the model structures (3.8)-(3.11) can be derived
using (3.7) with straightforward manipulations.
Example 3.1: (Estimation of ARX models)
Let us consider finding the PEM estimate (3.3) with the cost function (3.5) for the ARX
model structure (3.8) where we, without loss of generality, assume that nk = 1. Since
G(q, θ) = Bp (q)/Ap (q) and H(q, θ) = 1/Ap (q), the predictor (3.7) of the output can be
written as
ŷ(t|t − 1, θ) = Bp (q)u(t) + 1 − Ap (q) y(t) = ϕ(t)T θ,
(3.12)
where
ϕ(t) , −y(t − 1) · · ·
θ , a1
···
ana
−y(t − na ) u(t − 1) · · ·
T
b1 · · · bnb .
T
u(t − nb ) ,
(3.13)
(3.14)
Thus, for the ARX model structure (3.8) with the quadratic cost function (3.5), the PEM
estimate coincides with the LS estimate (2.2) and the simple methods presented in Section 2.1 can be used.
The development in the automatic control community during the 1960s led to that a different representation of LTI models became popular, namely the state-space model
x(t + 1) = A(θ)x(t) + B(θ)u(t) + w(t),
y(t) = C(θ)x(t) + D(θ)u(t) + v(t).
(3.15a)
(3.15b)
Here, x(t) is the auxiliary state vector of dimension nx and the white signals w(t) and
v(t) represent the process and measurement noise, respectively. The best linear predictor
associated with the output of (3.15) is given by (see, for example, Ljung, 1999, page 98)
−1
ŷ(t|t − 1, θ) = C(θ) qI−A(θ) + K(θ)C(θ)
(3.16)
× B(θ) − K(θ)D(θ) u(t) + K(θ)y(t) ,
where I denotes the identity matrix of appropriate dimensions and K(θ) is the Kalman
gain. Using the residual ε(t, θ) = y(t)− ŷ(t|t−1, θ), one may rewrite the predictor (3.16)
on state-space form as
x̂(t + 1|t, θ) = A(θ)x̂(t|t, θ) + B(θ)u(t) + K(θ)ε(t, θ),
y(t) = C(θ)x̂(t|t, θ) + D(θ)u(t) + ε(t, θ).
(3.17a)
(3.17b)
Now, assume that the system agrees with (3.15) for some θ = θ0 . Then the corresponding
residual ε(t, θ0 ) is a white noise sequence with zero mean coinciding with the innovation
process of the system and (3.17) may be represented by
x̂(t + 1|t, θ) = A(θ)x̂(t|t, θ) + B(θ)u(t) + K(θ)e(t),
y(t) = C(θ)x̂(t|t, θ) + D(θ)u(t) + e(t),
(3.18a)
(3.18b)
3.2
21
Model Structures
where e(t) is the innovation process. The representation (3.18) is referred to as the innovations form of (3.15). There exist several canonical state-space representations of the
linear transfer-function structures given by (3.6). An example of a representation, which
will be utilized in a later chapter, is the observer canonical form (OCF), which for the
ARMAX case (3.10) with nk = 1 can be written as

−a1
−a2
..
.
1
0
..
.
0 ...
1 ...
.. . .
.
.
0 ...
0 ...




b1
0
 b2 
0



..  x(t) +  ..  u(t)
 . 
.



bn−1 
1
bn
0



x(t + 1) = 

−an−1 0
−an
0

c1 − a1
 c2 − a2 




..
+
 e(t),
.


cn−1 − an−1 
cn − an
y(t) = 1 0 0 . . . 0 x(t) + e(t).
(3.19a)
(3.19b)
The unknown parameters in the matrices (3.19) correspond directly to the coefficients
of the polynomials (3.10). For the OE model structure (3.11), there is no process noise,
and thus K(θ) = 0 in (3.18). For a complete description of the representation of LTI
models and the estimation of such, see, for instance, Ljung (1999) or Söderström and
Stoica (1989).
3.2.2
Linear Parameter-Varying
The model structures presented so far have all belonged to the class of LTI models, that is,
the coefficients of the polynomials in (3.6) and the matrices (3.15) have all been static. A
direct generalization of these model structures is obtained if one allows the polynomials
and matrices to depend on a known time-varying parameter p : R → P, where P ⊂ R is
the scheduling space. These model structures are called linear parameter-varying (LPV).
The LPV- ARX model structure is defined in the SISO case as
A(p, q)y(t) = B(p, q)u(t) + e(t),
(3.20)
where
A(p, q) = 1 + a1 (p)q −1 + · · · + ana (p)q −na ,
B(p, q) = b1 (p)q −nk + · · · + bnb (p)q −nk −nb +1 ,
and the coefficient functions ai , bj : P → R have a static dependence on the measured
variable p. Introduce
φ1 (p) . . .
φng (p) , a1 (p) . . .
ana (p) b1 (p) . . .
bnb (p) ,
22
3
System Identification
with ng , na + nb . A common assumption is that each of the functions φi is linearly
parametrized as
si
X
φi (p) = θi0 +
θij ψij (p),
j=1
ng ,si
(θij )i=1,j=0
n ,s
g i
where
are unknown parameters and (ψij )i=1,j=1
are basis functions chosen
by the user. In this case, straightforward definitions of the parameter vector and the regression vector allows us to rewrite (3.20) as a linear regression (2.1). Thus, the LPV- ARX
model structure can be estimated from data
N
Z N = u(t), p(t), y(t) t=1 ,
(3.21)
using the simple tools presented in Section 2.1.
The LPV state-space model (LPV- SS) form used in this thesis is given by
x(t + 1) = A(p, θ)x(t) + B(p, θ)u(t) + K(p, θ)e(t),
y(t) = C(p, θ)x(t),
(3.22a)
(3.22b)
where the matrices depend on the scheduling variable p. The relationship between the
LPV- ARX (3.20) and the LPV- SS (3.22) models is quite involved and details are given
in Tóth (2008), where also an overview of different methods for the estimation of LPV
models is given.
3.2.3
Nonlinear Time-Invariant
In later chapters, the initialization of nonlinear model structures will be considered. A
simple example of such a structure is the nonlinear ARX model structure (NARX) given by
the relation
y(t) = g(ϕ(t), θ) + e(t),
(3.23)
where g is some function and a special case was considered in Example 1.3. The regressor
vector ϕ(t) is a vector containing information about old inputs and outputs, for example,
the elements of ϕ(t) consist of old input and output components. The most general model
structure that is going to be considered in this thesis is the nonlinear state-space model
x(t + 1) = f (x(t), u(t); θ),
y(t) = h(x(t), u(t); θ) + e(t),
(3.24a)
(3.24b)
where f and h are nonlinear polynomial functions. The identification of nonlinear models
is a vast field where the utilization of structure is vital for a successful estimation of the
parameters. Details and references to different methods are given in Ljung (1999).
3.2.4
General Theory
In the sections above, we have used the concept of model structures quite freely without
formalizing its meaning. This is often sufficient when dealing with practical aspects of
system identification, but can be quite limiting when analyzing certain theoretical properties. For example, consider the concept of identifiability for a model structure, which,
loosely speaking, means that, the parameters can be determined uniquely from a model
defined by the model structure.
3.2
23
Model Structures
Example 3.2: (Identifiability)
Consider the model structure defined by the difference equation
y(t) = θ2 u(t − 1) + e(t),
(3.25)
where e(t) is white noise and the parameter θ ∈ R is free. Now, the model given by
y(t) = u(t − 1) + e(t),
(3.26)
lies in the set of models given by (3.25) for all θ ∈ R. However, when trying to determine
the parameter which gives the model (3.26), one finds that both θ = 1 and θ = −1 result
in the same model. Thus, the model structure (3.25) is not identifiable.
The example above indicates that the notion of identifiability is important, since it determines how a model structure may be used in practice. If the model is only to be used
for predicting the output of the system, then the identifiability of the model structure is
not important. For instance, in the example above, it does not matter which parameter
value one chooses since both values yield the same predicted output. On the other hand,
if the parameters represent some physical entity one must be careful when deriving conclusions about the system, since a positive parameter value may have a totally different
interpretation than a negative one.
To be able to analyze the identifiability of model structures, it is necessary to have a
well defined framework. Furthermore, the framework needs to be general so that all kinds
of model structures, both linear and nonlinear, can be analyzed in the same way.
The presentation given below follows Ljung (1999, Sections 4.5 and 5.7) quite closely
and we start by defining the concept of a model. In the following, let Z t denote a data
set (3.1) containing input and output pairs from time 1 up until time t.
Definition 3.2. A predictor model m of a dynamical system is a sequence of functions
gm : R × Rny (t−1) × Rnu (t−1) → Rny ,
t ∈ Z+ ,
which predicts the output y(t) from past data, that is,
ŷ(t|t − 1) = gm (t, Z t−1 ).
As an intermediate step to finding a rigorous definition of the notion of a model structure, we need the following concept.
Definition 3.3. A model set M∗ is a collection of predictor models
M∗ = {gmα | α ∈ A},
where A is some index set.
The predictor associated with (3.25) is easily seen to be ŷ(t|t − 1, θ) = θ2 u(t − 1),
assuming that the noise e(t) is white with zero mean. According to the definition,
M∗ = {θ2 u(t − 1) | θ ∈ R},
is an example of a model set.
24
3
System Identification
Definition 3.4. A model structure M is a differentiable mapping from a connected open
subset DM ⊂ Rnθ to a model set M∗ , such that the gradients of the predictor functions
exist and are stable.
The definition above includes a technical requirement that the gradients of the predictor functions are stable (see, for instance, Definition 5.1 in Khalil, 2002) which is used to
prove convergence and consistency properties of the PEM. This property is not essential
for this presentation and the reader is referred to Ljung (1999, Chapter 8).
Now, let us once again consider (3.25). Then
M∗ = {θ2 u(t − 1) | θ ∈ R},
is the corresponding model set. Define the mapping M as θ 7→ θ2 u(t − 1), which maps
the parameter θ to the elements of the model set M∗ . Since DM = R is an open and
connected set, and the gradient of the predictor function is given by
d
ŷ(t|t − 1, θ) = 2θu(t − 1),
dθ
it is clear that M constitutes a model structure. Thus, the opening statement in Example 3.2 is valid if one refers the model structure defined by (3.25) to M.
We are now ready to define the concept of identifiability.
Definition 3.5. A model structure M is globally identifiable at θ∗ ∈ DM if M(θ) =
M(θ∗ ) for some θ ∈ DM implies that θ = θ∗ .
The strict formulation of the meaning of the equality M(θ) = M(θ∗ ) in the definition
above is quite technical and the reader is referred to Ljung (1999) for details. For the
simple examples used in this thesis it suffices to treat the equality as meaning that the
predictor models are effectively the same.
In Example 3.2 we hinted that the model structure M generated by (3.25) is not
globally identifiable. This is in alignment with the definition above, since
M(−1) = M(1),
for all t ∈ Z+ . The definition of a model structure is quite tedious to use and we will for
the sake of convenience often refer to expressions like (3.25) as model structures.
3.3
Instrumental Variables
In the introduction of this chapter, the basics of the PEM for fitting a parametrized model
structure to a sequence of input and output pairs (3.1) was presented as the solution to a,
in general, non-convex optimization problem (see also Example 1.2). Thus, if the initial
parameters for the local search optimization routine are chosen poorly, this means that
one cannot guarantee that the globally optimal parameter values are found. Here, we
will present a popular method for finding good initial estimates called the instrumental
variables (IV) method (see Söderström and Stoica, 1980). The basics of this method for
3.3
25
Instrumental Variables
linear regression models have already been given in Section 2.2 and here we will restrict
ourselves to the estimation of single-input single-output LTI transfer-function models
y(t) = G(q, θ)u(t) + H(q, θ)e(t),
(3.27)
where G(q, θ) and H(q, θ) are rational functions in the forward shift operator q and the
coefficients are given by the elements of the parameter vector θ (see Section 3.2.1). Furthermore, we assume that e(t) is white Gaussian noise with zero mean and that the system
lies in the model structure defined by (3.27), that is, the system agrees with (3.27) for some
choice θ = θ0 . If the deterministic part of (3.27) is given by
G(q, θ) =
Bp (q, θ)
,
Ap (q, θ)
then we can rewrite (3.27) as
Ap (q, θ)y(t) = Bp (q, θ)u(t) + Ap (q, θ)H(q, θ)e(t).
(3.28)
By introducing ϕ(t) as in (3.13), we may write (3.28) as
y(t) = ϕ(t)T η + v(t),
(3.29)
where the parameter vector η only contains the parameters involved in G(q, θ) and the
noise contribution has been collected in
v(t) , Ap (q, θ)H(q, θ)e(t).
(3.30)
Now, consider the problem of estimating the parameters η given a dataset (3.1) generated
by the system, that is, the dataset agrees with (3.27) for the choice θ = θ0 . If one could
find an ζ(t) such that
N
1 X
lim
ζ(t)v(t) = 0,
(3.31)
N →∞ N
t=1
then it would, asymptotically as N → ∞, hold that
N
N
1 X
1 X
ζ(t)y(t) =
ζ(t)ϕ(t)T η.
N t=1
N t=1
and an estimate of η could be found via
IV
η̂N
N
1 X
2
T
ζ(t) y(t) − ϕ(t) η = arg min .
N
η∈Rnη
t=1
(3.32)
2
The elements of ζ(t) are referred to as the instrumental variables and the ordinary leastsquares estimate of η is retrieved by choosing ζ(t) = ϕ(t) in (3.32). So, how should the
instrumental variables ζ(t) be chosen in this case? By considering (3.30), one notices that
v(t) = Ap (q, θ0 )H(q, θ0 )e(t),
26
3
System Identification
for some realization of e(t). Thus, the optimal instruments must depend on the system
which is what we are trying to estimate. Also, one notices that (3.32) is not general
enough and that one would want to include a filter which attenuates the noise of the
residual ε(t, θ) = y(t) − ϕ(t)T η. This leads to the extended IV approach which finds
EIV
η̂N
N
1 X
2
T
ζ(t)L(q) y(t) − ϕ(t) η = arg min ,
N
η∈Rnη
(3.33)
Q
t=1
for some user defined choice of filter L(q) and weighting matrix Q. It can be shown (see,
for example, Ljung, 1999) that the optimal choices, under the assumption that the system
agrees with (3.27) for some θ = θ0 and that e(t) is white Gaussian noise with zero mean,
is given by
ζ opt (t) = Lopt (q) −G(q, θ0 )u(t − 1) · · ·
T
−G(q, θ0 )u(t − na ) u(t − 1) · · · u(t − nb )
,
where the filter Lopt (q) is chosen as
Lopt (q) =
1
,
Ap (q, θ0 )H(q, θ0 )
and Qopt as the identity matrix of appropriate dimension. These particular choices are
quite intuitive since Lopt (q)v(t) = e(t) and the instrumental variables are built up from
the noise free outputs and old inputs which are uncorrelated with e(t).
Since the optimal IV method requires knowledge of the system, it is not feasible in
practice and some approximations have to be made. To this end, a multistep IV algorithm
called IV 4 was invented which step by step removes the stochastic part of the data by
generating and applying appropriate instrumental variables (see, for instance, Söderström
and Stoica, 1980). The IV 4 method is summarized in Algorithm 1.
Algorithm 1 IV 4
Given: A dataset (3.1) and the model parameters {na , nb , nk }.
1) Rewrite (3.27) as a linear regression as in (3.28) - (3.29) and estimate θ via
(1)
η̂N = arg min
η∈Rnη
N
X
2
y(t) − ϕ(t)T η ,
t=1
(1)
(1)
(1)
b (q) = B
b (q)/A
b (q).
and denote the corresponding transfer-function by G
N
N
N
2) Generate the instruments
(1)
b (q)u(t),
x(1) (t) = G
N
T
ζ (1) (t) = − x(1) (t − 1) · · · − x(1) (t − na ) u(t − 1) · · · u(t − nb ) ,
3.3
27
Instrumental Variables
and determine the IV estimate using these instruments
(2)
η̂N
X
1 N (1)
2
T
= arg min ζ (t) y(t) − ϕ(t) η .
N
η∈Rnη
2
t=1
b (2) (q) = B
b (2) (q)/A
b(2) (q).
Denote the corresponding transfer-function by G
N
N
N
3) Let
(2)
(2)
(2)
b (q)y(t) − B
b (q)u(t).
ŵN = A
N
N
Postulate an AR process of order na + nb , that is,
(2)
L(q)ŵN = e(t),
b N (q).
and estimate it using the LS method. Denote the result by L
4) Generate the instruments
b (2) (q)u(t),
x(2) (t) = G
N
b N (q) − x(2) (t − 1) · · · − x(2) (t − na ) u(t − 1) · · · u(t − nb ) T ,
ζ (2) (t) = L
and determine the final estimate
IV
η̂N
X
1 N (2)
2
T
b
ζ (t)LN (q) y(t) − ϕ(t) η = arg min .
N
η∈Rnη
2
t=1
This algorithm is implemented in the system identification toolbox for MATLAB (see
Ljung, 2009) and is used to find initial estimates for the PEM. For an extensive coverage
of the IV method and its properties the reader is referred to, for example, Söderström and
Stoica (1980). The IV 4 method only finds an estimate of the deterministic part of (3.27),
that is, no noise model H(q, θ) is estimated. If a noise model is sought, additional steps
are needed. As an example, consider the estimation of an ARMAX model
y(t) =
Cp (q)
Bp (q)
u(t) +
e(t),
Ap (q)
Ap (q)
given a dataset (3.1). Then one can take the following steps to find an estimate.
1) Apply the IV 4 algorithm to find an estimate of the deterministic part
G(q, θ̂) =
bN (q)
B
.
bN (q)
A
2) Find the residual
bN (q)y(t) − B
bN (q)u(t).
ŵN (t) = A
(3.34)
28
3
System Identification
3) Postulate an AR process L(q)ŵN (t) = e(t) and estimate it using the least-squares
b N (q).
method. Denote the result by L
b N (q)ŵN (t) and find the least4) Determine an estimate of the innovations by êN (t) = L
squares estimate of Cp (q) via the relationship
ŵN (t) − êN (t) = (Cp (q) − 1)êN (t) + ẽ(t),
where the noise contribution has been collected in ẽ(t).
The steps 3) – 4) constitutes a method that is often called Durbin’s method and was
first presented in Durbin (1959). For different methods of estimating the stochastic part
of (3.27), see Stoica and Moses (2005).
In this section we have shown how to utilize the IV method to find estimates of
transfer-function models (3.27). The case when one is interested in obtaining estimates of
state-space models is a bit more involved and for that a different method, which utilizes
the IV philosophy, is better suited.
3.4
Subspace Identification
In this section, a presentation of the subspace identification (SID) approach to the estimation of state-space models will be given. The method is mainly based on linear algebraic
techniques, where projections are used to find the range and null spaces of certain linear
mappings (represented by matrices) which reveal information about the matrices in the
state-space model. This also enables efficient implementations based on numerical linear
algebra (see, for instance, Golub and Van Loan, 1996). A thorough treatment of the SID
framework can be found in Van Overschee and De Moor (1996a) and a nice introduction
is given in Verhaegen and Verdult (2007).
The first part of this section deals with the estimation of the discrete-time state-space
models, where our presentation is based on Verhaegen and Verdult (2007) and Viberg
et al. (1997). A different view of SID in the discrete-time case can be found in Jansson
and Wahlberg (1996) and Savas and Lindgren (2006). Next, a short description of a SID
method for continuous-time state-space models, similar to the discrete-time case, is given.
It is based on frequency-domain techniques and a full description of the algorithm can be
found in McKelvey and Akc˛ay (1994) and Van Overschee and De Moor (1996b).
The results in this presentation are not new and the reader who is already familiar
with SID might just want to browse the parts titled “estimating B(θ) and D(θ)” and
“estimating K(θ)” which are important in a later chapter.
3.4.1
Discrete Time
Consider the discrete-time state-space model (3.15). For notational convenience, we will
often not explicitly write out the dependence of the unknown parameter vector θ. Furthermore, we will assume that the system matrix A(θ) is Hurwitz and that the system operates
3.4
29
Subspace Identification
in open loop. Now, recursive application of (3.15) yields
τ
y(t + τ ) = CA x(t) +
τ
−1
X
CAτ −i−1 Bu(t + i) + Du(t + τ )
i=0
+
τ
−1
X
(3.35)
τ −i−1
CA
w(t + i) + v(t + τ ).
i=0
Introduce the vector of α stacked future outputs
yα (t) , y(t)T
y(t + 1)T
y(t + α − 1)T
···
T
,
(3.36)
where α is a user-defined prediction horizon. Similarly, define the vectors of stacked
inputs, process and measurement noise as uα (t), wα (t), and vα (t), respectively. In this
notation, (3.35) can be expressed as
yα (t) = Γα x(t) + Φα uα (t) + Ψα wα (t) + vα (t),
(3.37)
where

C
CA
..
.





Γα , 
,


α−1
CA
and

D
CB
..
.
0
D


Φα , 

CAα−2 B

0
C



Ψα , 
 CA
 .
 ..
CAα−2
...
0


,

..
CAα−3 B
0
0
...
C
..
CAα−3
...
.
.
...

(3.38)
D

0



.



0
(3.39)
The matrix Γα in (3.38) is referred to as the extended observability matrix. By stacking
the data, (3.37) can be written in matrix form as
Yβ,α,N = Γα Xβ,N + Φα Uβ,α,N + Ψα Wβ,α,N + Vβ,α,N ,
(3.40)
where we introduced the notation
Yβ,α,N , yα (β) yα (β + 1)
···
yα (β + N − 1) ,
(3.41)
with Uβ,α,N , Wβ,α,N , and Vβ,α,N defined accordingly, and
Xβ,N , x(β) x(β + 1)
···
x(β + N − 1) .
(3.42)
The parameter β is a design variable, whose interpretation will become clear in the ensuing paragraphs. Equation (3.40) is often referred to as the data equation and constitutes the foundation of modern approaches to SID (see, for instance, Van Overschee and
De Moor, 1996a).
30
3
System Identification
Estimating A(θ) and C(θ)
In this presentation, we are going to make use of the IV framework for SID given in Viberg
et al. (1997). The basic idea is to isolate the extended observability matrix Γα in (3.40) via
appropriate matrix projections explained below. Once an estimate of Γα has been found,
it is straightforward to create estimates of A(θ) and C(θ) via the relationship (3.38). First,
the operator
T
T
−1
Π⊥
Uβ,α,N ,
(3.43)
Uβ,α,N , I − Uβ,α,N (Uβ,α,N Uβ,α,N )
projects the row space of a matrix onto the orthogonal complement of the row space of
the matrix Uβ,α,N . Thus, it holds that Uβ,α,N Π⊥
Uβ,α,N = 0 and multiplying (3.40) by
from
the
right
yields
Π⊥
Uβ,α,N
⊥
⊥
⊥
Yβ,α,N Π⊥
Uβ,α,N = Γα Xβ,N ΠUβ,α,N + Ψα Wβ,α,N ΠUβ,α,N + Vβ,α,N ΠUβ,α,N .
(3.44)
Secondly, to eliminate the noise influence in (3.44), we need to find a matrix Zβ such that
1
T
Wβ,α,N Π⊥
Uβ,α,N Zβ = 0,
N
1
T
Vβ,α,N Π⊥
lim
Uβ,α,N Zβ = 0,
N →∞ N
lim
N →∞
(3.45)
and that the matrix
1
T
Xβ,N Π⊥
Uβ,α,N Zβ
N
has full rank nx . Equation (3.44) would then, asymptotically, be equivalent to
lim
N →∞
T
⊥
T
Yβ,α,N Π⊥
Uβ,α,N Zβ = Γα Xβ,N ΠUβ,α,N Zβ .
(3.46)
Hence, the column-space of Γα could then be determined from the matrix on the left
hand side of (3.46). Under certain mild assumptions on the input signal and the noise
contribution, it is shown in Viberg et al. (1997) that
U0,β,N
Zβ =
,
(3.47)
Y0,β,N
where
U0,β,N = uβ (0) uβ (1) · · ·
Y0,β,N = yβ (0) yβ (1) · · ·
uβ (N − 1) ,
yβ (N − 1) ,
(3.48a)
(3.48b)
fulfills these requirements. Thus, the user-defined parameter β defines the number of old
inputs and outputs that are going to be used as instrumental variables. The choice (3.47)
is quite intuitive since the ith column of the matrices Wβ,α,N and Vβ,α,N only contains
white noise terms from time t = β + i − 1 and onwards, so that any ZβT where the ith row
is constructed from data prior to t = β + i − 1 will satisfy (3.45).
Now, as noted earlier, the extended observability matrix Γα can be estimated by the
T
column-space of the matrix Yβ,α,N Π⊥
Uβ,α,N Zβ . This subspace can be found via the singular value decomposition (SVD)
T
Σ1 0
V1
⊥
T
Yβ,α,N ΠUβ,α,N Zβ = U1 U2
,
(3.49)
0 0
V2T
3.4
31
Subspace Identification
where the αny × nx matrix U1 determines a basis for the column-space (see, for example,
b α = U1 . Finally, from
Golub and Van Loan, 1996). Thus, an estimate of Γα is given by Γ
the relation (3.38) one can find estimates
 

b α,1 −1 Γ
b α,2
Γ
.   . 
b=
A
 ..   ..  ,
b α,α−1
b α,α
Γ
Γ

b=Γ
b α,1 ,
C
(3.50)
where


b α,1
Γ
. 
bα = 
Γ
 ..  ,
b α,α
Γ
has been partitioned into α blocks of sizes ny × nx .
Estimating B(θ) and D(θ)
b and C
b have been found according to (3.50). Setting
Assume now that the estimates A
t = 0 in (3.35) and replacing the variable τ in the resulting equation with t yields
y(t) = CAt x(0) +
t−1
X
CAt−i−1 Bu(i) + Du(t) + n(t),
(3.51)
i=0
where the noise contributions have been collected in n(t). Equivalently, (3.51) can be
rewritten as
y(t) = CAt x(0) +
X
t−1
u(i)T ⊗ CAt−i−1 vec (B)
(3.52)
i=0
T
+ u(t) ⊗ Iny vec (D) + n(t),
where ⊗ is the Kronecker product and the vec-operator stacks the columns of the argument in a column vector. Thus, since the noise n(t) is uncorrelated with the input u(t) due
to the open loop assumption, the matrices B and D can be estimated via the least-squares
problem
N
2
1 X
arg min
y(t) − ϕ(t)T θ ,
(3.53)
N
θ
t=1
where we have introduced θ = x(0)T vec (B)
ϕ(t)T =
bA
bt
C
t−1
X
i=0
T
T T
vec (D)
and
!
bA
bt−i−1
u(i)T ⊗ C
u(t)T ⊗ Iny
.
32
3
System Identification
Estimating K(θ)
Assume now that an innovation model (3.18) is to be estimated. Using the methodology
b B,
b C
b and D
b may be found. Thus, what remains to be estipresented above, estimates A,
mated is K(θ). To this end, we are going to use the approach presented in Van Overschee
and De Moor (1996a) with the notation given in Haverkamp (2001). If an estimate of the
state sequence Xβ,N was at hand, this would be a simple task. In fact, estimates of the
process noise and the measurement noise are then given by
! !
b B
b
cβ,1,N −1
bβ+1,N
bβ,N −1
A
W
X
X
=
− b b
,
(3.54)
Yβ,1,N −1
Uβ,1,N −1
C D
Vbβ,1,N −1
which in turn yields estimates of the noise covariances
!T
!
!
b Sb
cβ,1,N
cβ,1,N
1 W
Q
W
.
b = N Vbβ,1,N
SbT R
Vbβ,1,N
(3.55)
Finally, an estimate of the Kalman filter gain K(θ) is given by the solution to the corresponding Riccati equation
b = (A
bPbC
b T + S)(
b C
b PbC
b T + R)
b −1 ,
K
bPbA
bT + Q
b − (A
bPbC
b T + S)(
b C
b PbC
b T + R)
b −1 (A
bPbC
b T + S)
b T.
Pb = A
(3.56a)
(3.56b)
bβ,N
The difficult part of estimating K(θ) lies in reconstructing the state sequences X
b
b
b
b
b
and Xβ+1,N . Now, assume that estimates A, B, C and D have been found via (3.50)
and (3.53), respectively. With w(t) = Ke(t) and v(t) = e(t), the data equation (3.40)
becomes
b α Xβ,N + Φ
b α Uβ,α,N + Ψ̃α Eβ,α,N ,
Yβ,α,N = Γ
(3.57)
b α and Φ
b α are as in (3.38) with the system matrices replaced by their respective
where Γ
estimate and


Iny
0
...
0

.. 
..
 CK
.
Iny
. 

.
Ψ̃α , 
(3.58)

..
.
.
.
.

.
.
.
0 
CAα−2 K . . . CK Iny
Introduce the instrumental variables as


Uβ,α,N
Z̃β ,  U0,β,N  ,
Y0,β,N
(3.59)
with the corresponding projection matrix
ΠZ̃β = Z̃βT (Z̃β Z̃βT )−1 Z̃β .
(3.60)
Multiplying (3.57) by ΠZ̃β from the right and taking the limit as β → ∞ yields
b α Xβ,N Π + Φ
b α Uβ,α,N Π + Ψ̃α Eβ,α,N Π
lim Yβ,α,N ΠZ̃β = lim Γ
Z̃β
Z̃β
Z̃β , (3.61)
β→∞
β→∞
3.4
33
Subspace Identification
b B,
b C
b and D
b according
b α and Φ
b α have been constructed from the estimates A,
where Γ
to (3.38), respectively. Since the columns of Eβ,α,N (t) are independent of the rows of
Z̃βT , the last term of the right hand side of (3.61) will disappear. Also, since Uβ,α,N is
completely spanned by ΠZ̃β , it holds that
lim Uβ,α,N ΠZ̃β = lim Uβ,α,N ,
β→∞
β→∞
and (3.61) may be reduced to
b α lim Xβ,N Π + lim Φ
b α Uβ,α,N .
lim Yβ,α,N ΠZ̃β = Γ
Z̃β
β→∞
β→∞
β→∞
(3.62)
Thus, to find an estimate of Xβ,N , we need to be able to say something about
lim Xβ,N ΠZ̃β .
β→∞
To this end, insert (3.18b) into (3.18a) resulting in
x̂(t + 1) = Āx̂(t) + B̄u(t) + Ky(t),
(3.63)
where we have introduced
b − KC
b
Ā , A
b − K D.
b
and B̄ , B
Now, recursive application of (3.63) yields
bβ,N = Āβ X
b0,N + Cu,β U0,β,N + Cy,β Y0,β,N ,
X
(3.64)
where
Cu,β , Āβ−1 B̄
Āβ−2 B̄
Cy,β , Āβ−1 K
Āβ−2 K
B̄ ,
··· K .
···
b0,N will disappear as β → ∞ and X
bβ,N is completely
Hence, if Ā is stable then Āβ X
spanned by ΠZ̃β in (3.60). This implies that
lim Xβ,N ΠZ̃β = lim Xβ,N .
β→∞
β→∞
Thus, (3.62) can, finally, be written as
b α lim Xβ,N + lim Φ
b α Uβ,α,N
lim Yβ,α,N ΠZ̃β = Γ
β→∞
β→∞
β→∞
and an estimate of Xβ,N , for large β, may be found by
bβ,N = Γ
b −1 Yβ,α,N Π − Φ
b α Uβ,α,N .
X
α
Z̃β
(3.65)
Thus, there are several steps involved to find an estimate of the Kalman gain K(θ). First,
one needs to reconstruct the state sequences Xβ,N and Xβ+1,N to get estimates of the
b
covariance matrices (3.55). Finally, one needs to solve the Riccati equation (3.56) for K.
34
3
System Identification
Summary
The estimation of a complete state-space innovations model (3.18), as described in the
sections above, involves several steps. First, the matrices A(θ) and C(θ) are estimated
via (3.50). Then, these estimates are used to estimate the matrices B(θ) and D(θ)
via (3.53). Finally, an estimate of K(θ) is found by the Riccati equation (3.56). The
steps are summarized in Algorithm 2.
Algorithm 2 SID
Given: A dataset (3.1), the model order nx and the user-defined prediction and instrumental variables horizons {α, β}.
1) Construct the Hankel matrices Uβ,α,N and Yβ,α,N as in (3.41). Also, find the correT
sponding projection matrix Π⊥
Uβ,α,N via (3.43) and construct the instruments Zβ as
in (3.47).
T
b
2) Find the column space of Yβ,α,N Π⊥
Uβ,α,N Zβ via (3.49) as an estimate Γα of the extended observability matrix (3.38).
b and C
b of A(θ) and C(θ), respectively, via (3.50).
3) Determine the estimates A
b and D
b of B(θ) and D(θ), respectively, via (3.53).
4) Determine the estimates B
5) Reconstruct the state sequences Xβ,N and Xβ+1,N via (3.65) using the instruments
b α as in (3.38) with the system matrices replaced by the corredefined in (3.59) and Φ
sponding estimates.
b of K(θ) by solving
6) Estimate the covariance matrices via (3.55) and find an estimate K
the corresponding Riccati equation (3.56).
The precise effects of the user-defined prediction and instrumental variables horizons
α and β are not yet fully understood, other than that they must be greater than nx . For efficient implementations and suggestions of different design choices, the reader is referred
to Van Overschee and De Moor (1996a) and the references therein.
3.4.2
Continuous Time
So far we have only considered estimating discrete-time state-space models (3.15) from
sampled data (3.1). In this section, we will deal with the identification of continuous-time
state-space models
ẋ(t) = A(θ)x(t) + B(θ)u(t),
y(t) = C(θ)x(t) + D(θ)u(t) + e(t),
(3.66a)
(3.66b)
where the output y(t), the input u(t) and the noise e(t) are assumed to be differentiable.
Furthermore, the input and the noise are assumed to be uncorrelated. The presentation below is based on frequency domain techniques presented in McKelvey and Akc˛ay (1994).
The resulting algorithm is quite similar to the discrete-time SID method presented above.
3.4
35
Subspace Identification
Consider the continuous-time state-space model (3.66). Now, by introducing
T
d
dα−1
yα (t) = y(t) dt
,
y(t) · · · dt
α−1 y(t)
similarly for u(t) and e(t), it holds that
yα (t) = Γα x(t) + Φα uα (t) + eα (t).
(3.67)
Here, Γα and Φα are the same matrices as in the time-discrete case (3.38). Applying the
continuous-time Fourier transform to (3.67) yields
Wα (ω) ⊗ Y (ω) = Γα X(ω) + Φα Wα (ω) ⊗ U (ω) + Wα (ω) ⊗ E(ω),
(3.68)
where Y (ω), X(ω), U (ω) and E(ω) are the Fourier transforms of the signals y(t), x(t),
u(t) and e(t), respectively and
T
Wα , 1 iω · · · (iω)α−1 .
Assume that we have samples of the Fourier transforms of the input and output signals at
frequencies (ωk )M
k=1 . Introduce
Yβ,α,M , Wα (ωβ ) ⊗ Y (ωβ ) · · · Wα (ωM ) ⊗ Y (ωM ) ,
(3.69)
and similarly for Uβ,α,M and Eβ,α,M . Now, (3.68) can be written as
Yβ,α,M = Γα Xβ,M + Φα Uβ,α,M + Eβ,α,M ,
(3.70)
with
Xβ,M , X(ωβ ) · · ·
X(ωM ) .
Equation (3.70) is quite similar to the discrete-time data equation (3.40). The only significant difference is that the elements of the matrices in (3.70) are in general complex
numbers and certain modifications have to be made. Define the projection matrix
H
H
−1
Π⊥
Uβ,α,M ,
Uβ,α,M , I − Uβ,α,M (Uβ,α,M Uβ,α,M )
(3.71)
where superscript H denotes the complex conjugate transpose of a matrix. Multiplying (3.70) with (3.71) from the right yields
⊥
⊥
Yβ,α,M Π⊥
Uβ,α,M = Γα Xβ,M ΠUβ,α,M + Eβ,α,M ΠUβ,α,M
(3.72)
Now, it can be shown that
H
⊥
H
Yβ,α,M Π⊥
Uβ,α,M U1,β,M = Γα Xβ,M ΠUβ,α,M U1,β,M ,
(3.73)
holds asymptotically as M → ∞, where the matrix U1,β,M is the instrumental variables matrix (see, for example, McKelvey and Akc˛ay, 1994). In a similar fashion as in
the discrete-time case one can now find the column space of Γα via the matrix G ,
H
Yβ,α,M Π⊥
Uβ,α,M U1,β,M (see McKelvey and Akc˛ay, 1994)
T
Σ1 0
V1
Re (G) Im (G) = U1 U2
,
(3.74)
0 0
V2T
36
3
System Identification
where the operators Re and Im find the real and imaginary parts of a complex number,
b α of Γα is now found by U1 and the matrices A(θ) and C(θ)
respectively. An estimate Γ
can be estimated as in (3.50).
To find estimates of B(θ) and D(θ), one only needs to solve
arg min
B,D
M 2
X
b t I − A)
b −1 B U (ωt ) .
Y (ωt ) − D + C(iω
(3.75)
t=1
For an efficient implementation of the frequency domain SID method for estimating
continuous-time state-space models (3.66), the reader is referred to McKelvey and Akc˛ay
(1994). These problems are often ill-conditioned and for higher order systems certain
care is needed when implementing an algorithm (see, for instance, Van Overschee and
De Moor, 1996b).
The SID algorithms presented in this section and in the previous one have been proved
to work well in practice and have become de facto standard for initializing the PEM when
estimating state-space models given a dataset. One drawback with these methods is that
it is difficult to include prior information about the system into the model. For example,
physically motivated state-space models often contain elements in the system matrices
that are known beforehand. This limitation will be discussed in a later chapter.
3.5
An Algebraic Approach
In the well-cited paper by Ljung and Glad (1994), a new framework for analyzing the
identifiability of nonlinear model structures were introduced, based on the techniques
of differential algebra. The procedure used in Ljung and Glad (1994) is called Ritt’s
algorithm and provides means to simplify a system of differential equations via algebraic
manipulations and differentiations of the signals involved. For instance, the main theorem
in Ljung and Glad (1994) states that any nonlinear model structure
ẋ(t) = f (x(t), u(t), θ),
y(t) = h(x(t), u(t), θ),
(3.76)
where f and h are polynomials, is globally identifiable if and only if it can be rewritten
as a linear regression model
f˜ u(t), y(t), u̇(t), ẏ(t), . . . = g̃ u(t), y(t), u̇(t), ẏ(t), . . . θ,
(3.77)
for some polynomials f˜ and g̃. The implications of this result for system identification is
twofold. First of all, instead of analyzing the properties of a globally identifiable nonlinear
model structure (3.76), which may be quite difficult, one can work with an equivalent
linear regression model (3.77), which significantly simplifies the analysis. Second, since
an algorithm that performs the transformation of the nonlinear model to an equivalent
linear regression model is provided, this also implies that the estimation of the parameters
is trivial once the transformation has been made.
The price to be paid for using Ritt’s algorithm to transform the nonlinear model structure (3.76) into the linear regression model (3.77) is the introduction of derivatives of the
3.5
37
An Algebraic Approach
signals u(t) and y(t). Thus, the framework proposed in Ljung and Glad (1994) requires
noise-free data and means to find the exact derivatives of the signals involved. Let us
consider a simple example to illustrate the steps involved in the manipulations.
Example 3.3
Consider the following continuous-time model structure
y(t) = θu(t) + θ2 u̇(t),
(3.78)
where the dot operator denotes differentiation with respect to time t. As was noted in
the discrete-time case depicted in Example 1.3, applying the PEM directly to the model
structure (3.78) may result in a non-convex optimization problem, see Figure 1.3. Instead,
let us try to apply Ritt’s algorithm, as described in Ljung and Glad (1994). Via a series
of differentiations and polynomial divisions, the algorithm provides a linear regression
model
y(t)ü(t) − ẏ(t)u̇(t) = u(t)ü(t) − u̇2 (t) θ,
(3.79)
plus an additional relationship between the input and output signals that must be fulfilled
for all t ∈ R
y(t)u̇4 (t) − u(t)u̇3 (t)ẏ(t) − u̇3 (t)ẏ 2 (t) − u(t)y(t)u̇2 (t)ü(t)
+ u2 (t)u̇(t)ẏ(t)ü(t) + 2y(t)u̇2 (t)ẏ(t)ü(t) − y 2 (t)u̇(t)ü2 (t) = 0. (3.80)
Thus, to estimate the unknown parameter θ in (3.78), given a sequence of input and output
data, one can use (3.79) to get an initial estimate for the PEM which guarantees, in the
noise-free case and with exact derivatives, that the global optimum will be found.
The linear regression (3.79) can be derived in quite a simple manner by eliminating
the nonlinear term θ2 via some elementary algebra. First, differentiate (3.78) with respect
to t, remembering that θ is constant, which results in
ẏ(t) = θu̇(t) + θ2 ü(t).
(3.81)
By comparing (3.78) with (3.81) one notices that multiplying (3.78) by ü(t) and (3.81)
with u̇(t) and subtracting the results, yields the same result as in (3.79). These simple
algebraic manipulations is basically what Ritt’s algorithm does and the multiplication and
subtraction scheme used above forms what we know as polynomial division.
In the example above one notices that by making use Ritt’s algorithm to eliminate the nonlinearities associated with θ in (3.78), one increases the complexity of the interactions between the signals y(t), u(t) and the corresponding derivatives. This may not be a problem
in the noise-free case, but if a noise signal e(t) was added to (3.78) it would also undergo
the nonlinear transformations. In the general case, this implies that products between the
noise e(t), the input u(t), the output y(t), and their time derivatives would appear in the
linear regression model. This may complicate the estimation problem significantly, since
the regressor variables and the noise signal may become correlated, and Ljung and Glad
(1994) does not consider the case when noise is present. In a later chapter we will discuss
the generalization of these results to discrete time. Besides making it possible to analyze
discrete-time nonlinear systems, this opens up the possibility of dealing with noise.
38
3
System Identification
The methods of differential algebra used here was first formalized by Ritt (1950) and
then further developed by Kolchin (1973). These methods were much later introduced into
the automatic control community by Diop (1991). Further details on the use of differential
algebraic methods in automatic control problems can be found in Fliess and Glad (1994)
and algorithmic details are given in Glad (1997).
3.6
Model Validation
An important part of the system identification cycle is to ensure the validity of the model
that has been estimated. To this end, several methods have been developed (see, for
example, Ljung, 1999), and in this section we will only cover a small subset. For further
discussions and references regarding the subject of model validation, the reader is referred
to the books by Ljung (1999) and Söderström and Stoica (1989).
Many of the validation tests in system identification make, in some way, use of the
residual
ε(t, θ̂N ) = y(t) − ŷ(t|θ̂N ),
where the parameter estimate θ̂N has been found using some method. A simple, yet
effective validation procedure, is the cross-validation method. Here, the predicted output
of the model for a different dataset, referred to as validation data, than the dataset used
when estimating the model, referred to as estimation data, is compared to the measured
output. Different criteria have been proposed to quantify the closeness of the predicted
and the measured output. A popular choice is the model fit value
kε(t, θ̂N )k2
fit = 100 1 −
,
ky(t) − ȳk2
(3.82)
which gives the relative performance increase, in percent, of using the estimated model
compared to just using the mean ȳ of the output as a predictor. The model fit (3.82)
generally depends on the amount of noise in the data and more noise usually means a
lower model fit. Therefore it is recommended that one also considers the graph containing
the predicted and the measured output side by side and takes the prediction performance
into account when determining if the model is valid or not.
Another simple test is given by the cross-correlation method. This method is based
on the covariance between the residuals and past inputs
N
R̂εu
(τ )
N
1 X
ε(t)u(t − τ ).
,
N t=1
(3.83)
A large value of the covariance (3.83) for some time shift τ indicates that the corresponding shifted input is important for describing the data and should be incorporated into the
model. This is usually found out by a simple χ2 hypothesis test (see Ljung, 1999). It can
also be informative to consider the correlation among the residuals themselves
R̂εN (τ ) ,
N
1 X
ε(t)ε(t − τ ).
N t=1
(3.84)
3.6
Model Validation
39
A large value of the correlation (3.84) for some time shift τ 6= 0 indicates that some
part of ε(t, θ̂N ) could have been predicted from past data. Thus, the measured output
y(t) could have been predicted better. This effect can also be detected by a simple χ2
hypothesis test (see Ljung, 1999)
If several models, within the same model set, with different orders have been estimated
one should not only take the adaptation to data into account when deciding which of the
models to use, but also the complexity of the model. To this end, several measures that
besides taking the fit to data into account also penalizes the number of parameters used to
achieve this fit. In this thesis we will only make use of the AIC and MDL selection criteria,
which have already been presented in Section 2.4.
The methods described above are quite simple but have shown to work well in practice. It is important to note that the model validation methods usually focus on detecting
one particular deficiency of the estimated models and therefore it is important to use several methods when testing their validity. In this thesis, model validation will not play a
major role and the methods described in this section are sufficient for our purpose.
4
The Nonnegative Garrote in
System Identification
Regressor and structure selection of linear regression models have been thoroughly researched in the statistical community for some time. Different regresssor selection methods have been proposed, such as the ridge and lasso regression methods, see Section 2.3.
Especially the lasso regression has won popularity because of its ability to set less important parameters exactly to zero. However, these methods where not developed with
dynamical systems in mind, were the regressors are ordered via the time lag. For this
end, a modified variant of the nonnegative garrote (NNG) method (Breiman, 1995) will be
analyzed.
4.1
Introduction
An important subproblem of system identification is to find the model with the lowest
complexity, within some model set, which describes a given set of data sufficiently well.
There are several reasons for considering this problem. One reason is that even though
a higher model complexity will yield a better adaptation to the data used for the estimation, it might be that the model is adapting too well. Thus, the model does not properly
represent the system itself, only the data that is used for the estimation. Another aspect is
the possible constraints on the computational resources available in certain time-critical
applications. A higher model complexity usually means a higher computational cost, both
in time and in memory. Hence, a model with lower complexity might be preferred to one
with higher complexity if the loss in the ability to describe the data is not too great.
In the statistical community, this problem is often referred to as the regressor selection
problem, for which several solutions have been proposed in the literature (see Chapter 2
for a selection of these methods). For linear regression models, the complexity is defined
as the total number of parameters used. This is a good measure of the complexity when
there is no dynamical dependence between the regressors. For dynamical systems, such
as the ARX model structure (3.8), the complexity is not defined only by the number of
41
42
4
The Nonnegative Garrote in System Identification
parameters used, but also via the order of the model. Here, the order is defined by the
maximum of the highest time lag of the input and the highest time lag of the output used in
the model. Choosing the model which describes the data sufficiently well given a number
of parameters is not uniquely defined and we have already presented several criteria which
consider both the adaptation to data and the number of parameters used in Section 2.4.
Now, let us consider using the lasso and the NNG regressor selection methods presented in Section 2.3 for estimating the order of an ARX model
Ap (q)y(t) = Bp (q)u(t) + e(t),
(4.1)
given measurements from a real-life process. This model structure may be represented as
a linear regression
y(t) = ϕ(t)T θ + e(t),
(4.2)
where
ϕ(t) , −y(t − 1) · · ·
θ , a1
···
ana
−y(t − na ) u(t − 1) · · ·
T
b1 · · · bnb ,
T
u(t − nb ) ,
and the use of the regularization methods presented in Section 2.3 is straightforward.
Example 4.1: (Fan and Plate)
Let us consider using the lasso (2.10) and the NNG (2.14) regularization methods to estimate the order of a “fan and plate” process illustrated in Figure 4.1. The system consists
Figure 4.1: The “Fan and Plate” process
of a fan connected to an electrical motor and a hinged plate which is able to swing with
low friction. The electrical motor is driven by an electrical current which forces the fan to
generate wind gusts. The wind then hits the plate which starts to move accordingly. This
system is known to be well-behaved from an identification point of view and can be well
approximated by an ARX model (4.1). The input u(t) is the voltage driving the fan and
the output y(t) is the angle of the plate. Data has been collected using a random binary
telegraph signal (see Ljung, 1999) as input, switching between 2 V and 6 V with a probability of 0.08 and the sampling time was set to Ts = 0.04 s. In total, 1000 data points have
been collected where a part is illustrated in Figure 4.2. The data is split into two parts, the
43
Introduction
y(t)
4.1
5
0
5
10
15
20
15
20
t
u(t)
6
4
2
5
10
t
Figure 4.2: Parts of the data collected from the fan and plate process. The lower
plot shows the input signal and the upper plot the corresponding output.
first two thirds has been used as estimation data and the remaining for validation.
As discussed in Section 2.3, it can be shown that the solution paths of the lasso (2.10)
and the NNG (2.14) are piecewise affine functions of the penalizing parameter λ. This
property enables efficient algorithms for finding the complete solution path and the values
of λ where the local affine representation of the solution changes will be referred to as
breakpoints (see Figure 2.2). The solution paths are illustrated in Figure 4.3, where the
lasso is given in the left column and the NNG in the right. The first row shows the values
of the model fit (3.82) for the corresponding method evaluated on validation data, with
the indices of the breakpoints on the horizontal axis. The second row depicts which
parameters in the Ap (q) polynomial in (4.1) that are nonzero and similarly for the last
row where the nonzero coefficients of Bp (q) are represented.
For the lasso method (the first column of Figure 4.3), the model fit is good up to
a certain point and the number of nonzero parameters used to achieve the fit is small,
especially for the Ap (q) polynomial illustrated in the second row. In fact, minimizing the
MDL
lasso
MDL model order selection criterion WN
θ̂N
(λ) in (2.17) evaluated on estimation
data with respect to λ, chooses the 88th breakpoint which corresponds to using only five
nonzero coefficients in the Ap (q) polynomial and eight in Bp (q). Unfortunately, this
choice corresponds to using the model orders na = 20, nb = 8 and nk = 0. Thus, few
nonzero parameters does not automatically imply a low model order.
The NNG method (the second column of Figure 4.3) behaves similarly as the lasso and
the MDL choice is the 39th breakpoint. This corresponds to λ = 0.2311 which results in
the model orders na = 17, nb = 3 and nk = 1. Thus, the NNG detects that there is a
possible time delay in the system, but fails to yield a lower order model.
The example above shows that the statistical definition of model complexity does not take
the order of dynamical systems into account, that is, the regularization methods presented
in Section 2.3 only minimizes the number of parameters used and does not consider the
44
4
The Nonnegative Garrote in System Identification
Lasso
nng
Model fit
100
90
80
60
85
40
80
20
i = 1, . . . , na
70
90
100 110
1
1
5
5
10
10
15
15
20
20
70
i = 1, . . . , nb
80
80
90
100 110
1
1
5
5
10
10
15
15
20
10
20
30
40
10
20
30
40
20
70
80 90 100 110
j = 1, . . . , nλ
10
20 30 40
j = 1, . . . , nλ
Figure 4.3: The lasso and the NNG solution paths for the fan and plate data, respectively. The first row shows the values of the model fit for the corresponding methods
evaluated on validation data with the indices of the breakpoints on the horizontal
axis. The second row illustrates which coefficients of the Ap (q) polynomial that are
nonzero and similarly for Bp (q) in the last row.
order of the time lags. To take such structural information into account one needs to
incorporate an ordering of the time lags into the regularization problem. For the lasso
method, this is difficult to achieve since one works directly with the absolute values of the
regressors. The NNG allows ordering of the regressors according to their importance instead of their absolute values which is appealing. The possibility to incorporate structural
information into the NNG for linear regression models was also mentioned in Yuan et al.
(2007), but no solution was given.
The chapter is organized as follows: First we start by formulating the NNG as an
optimization problem where general linear inequality constraints can be handled. Then
a modification of the NNG is proposed to handle the ordering of ARX models via the
time lag. The proposed method is evaluated on two different sets of data and the result
4.2
45
Problem Formulation
is compared to that of an exhaustive search among all possible regressors. This follows
with a generalization of the proposed method for order selection of the LPV- ARX model
structure.
4.2
Problem Formulation
The main focus in this chapter is to incorporate structural information into the regressor
selection problem of linear regression models
y(t) = ϕ(t)T θ + e(t).
(4.3)
Here, the regression vector ϕ(t) is a vector valued, possibly nonlinear, function of old
input and output data. This description contains a wide class of model structures, for
instance, the ARX and the LPV- ARX model structures presented in Chapter 3. When selecting regressors for dynamical systems, there may be a natural or desired ordering of
the regressors, for example, in the ARX case the regressors are ordered via their time lag.
Let us now turn our attention to the formulation of the NNG problem
minimize
n
w∈R
θ
nθ
N
2
1 X
λ X
LS T
y(t) − ϕ(t) θ̂N
w +
wi
N t=1
N i=1
(4.4)
subject to w 0,
LS
where denotes element-wise multiplication and θ̂N
is the least-squares estimate of the
parameters. By stacking the output y(t) and the regressors ϕ(t)T into matrices Y and Φ,
respectively, one can rewrite the NNG (4.4) as
minimize
n
w∈R
θ
b LS wk2 + λ1T w
kY − ΦΘ
N
2
(4.5)
subject to w 0,
where the scaling factor 1/N has been removed and we have introduced
LS
LS
bN
Θ
, diag (θ̂N
).
To incorporate structural orderings of the regressors, we exchange the simple bound constraints w 0 in (4.5) for the more general linear inequality constraint Aw b, which
may, of course, contain simple bounds. Expanding the matrix norm in (4.5) yields
1 T
w Qw + f T w + λ1T w
w∈R
2
subject to Aw b,
minimize
n
θ
(4.6)
b LS ΦT ΦΘ
b LS and f = −2Θ
b LS ΦT Y . The modified formulation (4.6) of the
where Q = 2Θ
N
N
N
original NNG method (4.4) using general linear inequality constraints will in the following
be denoted by MNNG. The problem (4.6) belongs to the special class of optimization
problems referred to as parametric quadratic programs (PQPs) (see, for instance, Gould,
1991). These problems have a piecewise affine solution path, that is, the solution w(λ)
to (4.6) is a piecewise affine function of the parameter λ. This enables efficient methods
for finding the complete solution path and an algorithm for the strictly convex case, when
Q is positive definite with full rank, is given in Appendix 4.A.
46
4
The Nonnegative Garrote in System Identification
4.3 Applications to ARX Models
As previously mentioned, the regressors in the ARX model structure (3.8) are naturally
ordered by their time lag. The original NNG method (4.4) does not take such orderings into
consideration. It just sets the weights of the less important regressors low, not considering
their order. To be able to penalize higher order lags first, we will use the more general
formulation (4.6) by adding constraints on the weights. For ARX models, these constraints
are chosen as
1 ≥ w1 ≥ w2 ≥ · · · ≥ wna ≥ 0
1 ≥ wna +1 ≥ wna +2 ≥ · · · ≥ wna +nb ≥ 0.
(4.7a)
(4.7b)
In this presentation we have chosen to include the upper bound that all weights should
be less than or equal to one. There is no real theoretical motivation for this choice and
the effect of this upper bound is small in the examples that follows. This inclusion of
the constraints (4.7) into the MNNG problem (4.6) provides a natural extension of the
NNG method for order selection of ARX models in system identification, which ensures
ordering of time lags. Note that the ordering (4.7) lets the selection algorithm choose
independently between using old inputs or old outputs to describe the data. In other words,
the ordering (4.7) lets the coefficients in the zero polynomial and the pole polynomial be
chosen independently. This yields an automatic order selection and a natural way to
choose the importance between input lag and output lag.
Let us first evaluate the proposed method on a simple example where the system has
an ARX structure.
Example 4.2: (ARX system)
As a first evaluation of the proposed modification of the NNG regressor selection method,
let the system have the ARX structure
Ap (q)y(t) = Bp (q)u(t) + e(t),
(4.8a)
where e(t) is zero mean white Gaussian noise with unit variance and
Ap (q) = 1 − 1.25q −1 + 0.4375q −2 − 0.3594q −3 + 0.1719q −4 + 0.3125q −5
− 0.2764q −6 + 0.1360q −7 − 0.0769q −8 + 0.0137q −9 ,
Bp (q) = 1 + 0.25q −1 − 0.25q −2 .
(4.8b)
(4.8c)
Thus, the true model order is given by na = 9, nb = 3 and nk = 0. For data collection,
the system (4.8) is simulated using a white Gaussian noise input u(t) with zero mean and
unit variance. The data is then split into two separate sets, each of length N = 1000, one
for estimation and one for validation.
In the system identification toolbox for MATLAB there are functions for the estimation of the order of an ARX model, namely struc, arxstruc and selstruc (see Ljung,
2009). These commands implement an exhaustive search among all specified model orders. Here we let na and nb be chosen independently from the set {1, 2, . . . , 20} where
the time delay nk = 0 has been fixed. The result, where the model orders have been chosen by the minimum value of the cost function on the validation dataset, is given in the
4.3
47
Applications to ARX Models
Model fit
selstruc
mnng
45
45
40
40
35
35
i = 1, . . . , na
40
20
10
0
1
1
5
5
10
10
15
15
20
20
40
i = 1, . . . , nb
30
30
20
10
0
1
1
5
5
10
10
15
15
20
20
40
30
20
10
0
nθ = n a + n b , . . . , 1
80
90
100 110 120
80
90
100 110 120
80
90 100 110 120
j = 1, . . . , nλ
Figure 4.4: The resulting solution paths using the exhaustive search implemented
in the system identification toolbox for MATLAB and the modified NNG method for
data generated by the ARX system (4.8). The first row shows the model fit values
evaluated on validation data for the corresponding method. The second row illustrates which coefficients of the Ap (q) polynomial that are nonzero and similarly for
Bp (q) in the last row.
left column of Figure 4.4. The first row illustrates the model fit values (3.82) evaluated
on validation data with the different model orders depicted on the horizontal axis. The
middle row shows which coefficients that are nonzero in the pole polynomial Ap (q) and
similarly on the last row for the zero polynomial Bp (q). The MDL choice (2.17) of model
order for the exhaustive search implementation is given by na = 8 and nb = 3 when
evaluated on estimation data.
LS
Now, for the MNNG method, the initial least-squares parameter estimate θ̂N
is found
by solving (2.2) for the order na = 20, nb = 20, and nk = 0. Due to the presence
of noise, all estimated parameter values will be nonzero. Solving (4.6) for all λ ≥ 0
under the proposed constraints (4.7) and determining the model fit for validation data
48
4
The Nonnegative Garrote in System Identification
results in the right column of Figure 4.4. Here, the horizontal axis depicts the indices
of the breakpoints where the local affine representation of the piecewise solution path
as a function of λ changes. We notice that as λ increases, more and more parameter
values become exactly zero, until a certain point where only one parameter in the zero
polynomial Bp (q) remains. The MDL choice (2.17) yields, for this particular realization
of input and noise signals, the same model order (na = 8, nb = 3) as the exhaustive
search implementation.
In the example above we saw that the proposed modification of the NNG performed
equally well when searching for the order of an ARX model as the exhaustive search
implemented in the system identification toolbox for MATLAB. Neither of the two methods proposed the true model order, even though na = 9 and nb = 3 clearly exist as an
element of the solution path in both methods. This implies that the last parameter of the
pole polynomial Ap (q) does not yield any significant contribution to the model fit and
may thus be disregarded.
Now, let us return to the fan and plate process used in the introductory example.
Example 4.3: (Example 4.1 revisited)
Consider once again the fan and plate system in Example 4.1. Using once again the
exhaustive search implementation, letting na and nb be chosen independently from the
set {1, 2, . . . , 20} where nk = 1 has been fixed, results in the left column of Figure 4.5.
As before, the model orders have been chosen by the minimum value of the cost function
determined on validation data. The horizontal axis depicts the total number of parameters
used in the corresponding model. The maximal model fit (3.82) is seen to correspond
to choosing na = 4 and nb = 2, which coincides with the MDL choice (2.17) when
evaluated on estimation data.
Now, let us consider using the NNG (4.4) with the proposed linear inequality constraints (4.7). Solving (4.6) for all possible λ ≥ 0 yields the solution path illustrated in
the right column of Figure 4.5. The model fit values have been determined for validation
data with the indices of the breakpoints of the piecewise affine solution path on horizontal
axis.
Comparing this result with the ordinary NNG solution given in the right column of
Figure 4.3, one notices that the sparsity pattern of the coefficients has changed. The gaps
in the coefficient list of the ordinary NNG has disappeared at the cost of using more parameters in the MNNG method. Even though the number of parameters has increased, the
model order has decreased. Here, the maximal model fit value for the MNNG method corresponds to an ARX model with na = 9 and nb = 4 which coincides with the MDL choice.
Thus, we have managed to decrease the model order chosen by the NNG method by adding
the linear inequality constraints (4.7) on the weights w when solving (4.6). Unfortunately,
this choice uses twice as many parameters than what is really needed (compare with the
result using an exhaustive search).
The example above indicates that the proposed modification of the NNG may decrease
the model order chosen by the method. One drawback is that the chosen model order,
at least in this example, is much higher than really needed. On the plus side is that
the proposed method scales much better than the exhaustive search when the number of
4.3
49
Applications to ARX Models
selstruc
Model fit
100
90
80
80
70
60
60
40
i = 1, . . . , na
40
30
20
10
0
1
1
5
5
10
10
15
15
20
20
40
i = 1, . . . , nb
mnng
100
30
20
10
0
1
1
5
5
10
10
15
15
20
20
40
30
20
10
0
nθ = n a + n b , . . . , 1
120
130
140
150
120
130
140
150
120
130 140 150
j = 1, . . . , nλ
Figure 4.5: The resulting solution paths using the exhaustive search implemented
in the system identification toolbox for MATLAB and the MNNG method for the fan
and plate data in Example 4.1. The first row shows the model fit values for the corresponding methods evaluated on validation data. The second row illustrates which
coefficients of the Ap (q) polynomial that are nonzero and similarly for Bp (q) in the
last row.
parameters increases. The computational complexity of the MNNG method increases only
linearly as the number of parameters increases while the complexity of the exhaustive
search increases quadratically. This should yield a significant computational advantage
for the NNG compared to the exhaustive search for high initial model orders or when
multiple inputs and outputs are present.
Remark 4.1. It has recently been brought to the author’s attention that the NNG can be
seen as a special case of the adaptive lasso (see, for instance, Zou, 2006) for which consistency and other desirable statistical properties have been provided. What implications
this has for the modifications proposed for dynamical systems is not clear and it remains
as a topic for future research.
50
4
The Nonnegative Garrote in System Identification
4.4 Applications to LPV-ARX Models
Let us consider using the MNNG for order selection of LPV- ARX models, see Section 3.2.2.
This problem adds an extra dimension to the selection problem, compared to the ordinary
ARX case, since it is also important to select which basis functions that should be included
in the model. The problem of choosing which basis functions to incorporate into the
model is referred to as structure selection.
The LPV- ARX model structure is defined in the SISO case as
A(p, q)y(t) = B(p, q)u(t) + e(t),
(4.9)
where
A(p, q) = 1 + a1 (p)q −1 + · · · + ana (p)q −na ,
B(p, q) = b1 (p)q −nk + · · · + bnb (p)q −nk −nb +1 ,
and the coefficient functions ai , bj : P → R have a static dependence on the measured
variable p. Introduce
φ1 (p) . . . φng (p) , a1 (p) . . . ana (p) b1 (p) . . . bnb (p) ,
with ng , na + nb . Here, we assume that each of the functions φi may be linearly
parametrized as
si
X
θij ψij (p),
φi (p) = θi0 +
j=1
ng ,si
(θij )i=1,j=0
n ,s
g i
where
are unknown parameters and (ψij )i=1,j=1
are basis functions. The
LPV- ARX model structure can be written as a linear regression model (4.2) with
ϕ(t) , −y(t − 1) −ψ11 (p(t))y(t − 1) · · · −ψ1s1 (p(t))y(t − 1)
T
−y(t − 2) · · · −ψna sn (p(t))y(t − na ) u(t) . . .
, (4.10)
a
θ , θ1,0
...
θ1,s1
θ2,0
...
θng ,sng
T
.
From the structure of the linear regression model above, one notices that the regressors
are grouped. That is, each time shifted signal x(t − i), where x(t) denotes either input
u(t) or output y(t), is involved with si + 1 regressors, where si is the number of basis
functions ψij , j = 1, . . . , si , within the group. The groups correspond to the scheduling variable dependent coefficients ai (p(t)) and bj (p(t)) of the polynomials A(p, q) and
B(p, q), respectively.
As in the ARX case, we want to use as few time lags of the input u(t) and the output
y(t) as possible. This can be accomplished by adding the constraints
s1
X
w1j ≥
j=0
s2
X
w2j ≥ · · · ≥
j=0
j=0
wna j ,
j=0
s
sna +1
X
sna
X
w(na +1)j ≥ · · · ≥
ng
X
j=0
wng j ,
(4.11)
4.4
51
Applications to LPV-ARX Models
to the NNG method (4.4). This choice is similar to the constraints on the weights in the
ARX case, see (4.7), but now with grouped variables. There are several possible choices
of constraints, where (4.11) is only one of them. Another possible constraint is that the
minimum value of the weights in one group should be greater than the maximum value
of the following. This choice is more restrictive than (4.11) and probably not suitable in
this application. The effect of different choices is not known and is a subject for future
research.
Now, let us evaluate the proposed method in some simulations. Unfortunately, it
seems to be difficult to find data from real-life LPV systems and the evaluation has to be
done on simulated data.
Example 4.4: (LPV- SS system)
As a first evaluation of the proposed modification of the NNG regressor selection method,
let the system be given by the LPV- SS model
0 p(t)
1
1
x(t + 1) =
x(t) +
u(t) +
e(t),
1 p(t)
1
1
(4.12)
y(t) = 1 0 x(t).
Here, we assume that the scheduling variable p(t) is contained within the closed interval
[−0.4, 0.4] and that e(t) is white Gaussian noise with zero mean and unit variance. The
frozen poles of the system, when letting p vary within the interval [−0.4, 0.4] are given
in the upper left corner of Figure 4.6. Using the method for transforming an LPV- SS
model to an equivalent LPV- ARX model described in Tóth (2008), the model (4.12) may
be rewritten as
y(t) − p(t − 1)y(t − 1) − p(t − 1)y(t − 2) = u(t − 1) + e(t − 1).
(4.13)
Thus, the true model order is given by na = 2, nb = 1 and nk = 1.
For data collection, the system (4.12) is simulated using a white Gaussian noise input
u(t) with zero mean and unit variance and a uniformly distributed scheduling variable
p(t) in the interval [−0.4, 0.4]. The data is then split into two separate sets, each of length
N = 5000, one for estimation and one for validation. In the initial least-squares estimate,
we let na = 5, nb = 5 and the basis functions be given by
ψi1 (p) = p(t),
ψi2 (p) = p(t − 1),
ψi3 (p) = p(t − 2),
ψi4 (p) = p(t − 3).
The result of running the NNG with the additional constraints (4.7) is shown in Figure 4.6
where the model fit in the upper right corner has been determined on validation data.
The two figures in the last row of Figure 4.6 show which parameters θ in (4.10) corresponding to the coefficients in the A(p, q) and the B(p, q) polynomials that are nonzero,
respectively. The estimated coefficients of the A(p, q) and the B(p, q) polynomials for
the maximal model fit corresponds to the 83rd breakpoint and are given by
a1 (p(t)) = −0.9999p(t − 1),
b0 (p(t)) = −0.0099 − 0.0435p(t − 1),
a2 (p(t)) = −0.9385p(t − 1),
b1 (p(t)) = 0.9932 − 0.0001p(t − 2),
and b2 (p(t)) = −0.0642p(t − 3), where the remaining coefficients are zero. This is
quite close to the system (4.13). The exceptions are that the polynomials b0 , b2 and the
52
4
The Nonnegative Garrote in System Identification
1
Model fit
70
0
−1
60
−1
0
1
1
1
6
6
11
11
θB
θA
65
16
16
21
21
50
60 70 80 90
j = 1, . . . , nλ
50
60 70 80 90
j = 1, . . . , nλ
50
60 70 80 90
j = 1, . . . , nλ
Figure 4.6: The result of the MNNG for the LPV- SS system given by (4.12). The
upper left corner depicts the frozen poles of the system (4.12) when the scheduling
variable p(t) ∈ [−0.4, 0.4] varies. In the upper right corner the model fit for the
estimated models evaluated on validation are shown. The lower left corner illustrates
which coefficients of the Ap (q) polynomial that are nonzero and similarly for Bp (q)
in the lower right corner.
coefficient in the b1 polynomial corresponding to the basis function p(t − 2) should be
zero. This is an effect of the choice of ordering (4.11) of the weights of the regressors
which inflicts that the sum of the weights for the b0 coefficient should be greater than the
sum for b1 . Since b1 plays an important role in the system (4.13), the sum of the weights
for this coefficient must be large which in turn forces the sum of the weights for the b0
coefficient to be large.
The example above indicates that the MNNG method might be useful for finding the order
of the system and also the structural dependence, that is, which basis functions to use in
the model. It should be noted that the good results in the example are due to that the
system lies in the model set and the fact that the noise level is quite low compared to the
strength of the output signal.
Now, let us evaluate the proposed method on a more complex problem where we make
use of nonlinear basis functions.
4.4
53
Applications to LPV-ARX Models
Example 4.5: (LPV- ARX system)
As a second evaluation of the proposed modification of the
method, let the system be given by the LPV- ARX model
NNG
regressor selection
A(q, p)y = B(q, p)u + e,
(4.14a)
where
√
A(q, p) = 1 + (0.24 + 0.1p)q −1 − (0.1 −p − 0.6)q −2 + 0.3 sin (p)q −3
+ (0.17 + 0.1p)q −4 + 0.3 cos (p)q −5 − 0.27q −6
+ (0.01p)q −7 − 0.07q −8 + 0.01 cos (p)q −9 ,
√
B(q, p) = 1 + (1.25 − p)q −1 − (0.2 + −p)q −2 .
(4.14b)
Here, we assume that the scheduling variable p(t) is contained within the interval [−2π, 0]
and that e(t) is white Gaussian noise with zero mean and variance 0.1. Thus, the true
model order is given by na = 9, nb = 3 and nk = 0. The frozen poles of the system,
when letting p vary in the interval [−2π, 0] are given in the upper left corner of Figure 4.7.
For data collection, the system (4.14) is simulated using a white Gaussian noise input
u(t) with zero mean and unit variance and a uniformly distributed scheduling variable
p(t) in the interval [−2π, 0]. The data is then split into two separate sets, each of length
N = 5000, one for estimation and one for validation. In the initial least-squares estimate,
we let na = 9, nb = 3 and the basis functions be given by
ψi1 (p) = p,
ψi2 (p) =
√
−p,
ψi3 (p) = sin (p),
ψi4 (p) = cos (p).
Thus, the system is in the model set. The result of running the NNG with the additional
constraints (4.7) is shown in Figure 4.7 where the model fit in the upper right corner
has been determined on validation data. The two columns in the last row of Figure 4.7
show which parameters θ in (4.10) corresponding to the coefficients in the A(p, q) and
the B(p, q) polynomials that are nonzero, respectively. The estimated coefficients for the
maximal model fit and the one suggested by MDL contains far to many nonzero parameter
values to be presented here. Instead, let us consider the 145th breakpoint where the model
fit starts to decrease significantly. The parameter values of the A(p, q) polynomial are then
given by
a1 (p) = 0.2107 + 0.0807p,
√
a3 (p) = 0.0083 −p + 0.3045 sin p,
a5 (p) = 0.2955 cos p,
√
a7 (p) = 0.0026p − 0.0048 −p,
√
a2 (p) = 0.5501 − 0.0683 −p,
a4 (p) = 0.1153 + 0.0837p,
a6 (p) = −0.2643,
a8 (p) = −0.0562,
where the remaining coefficients are zero. This is close to the system (4.14) with the
exceptions of some small parameter values that should not be there for example
the
√ coefficient in the a3 (p) polynomial corresponding to the basis function −p and the
lack of a9 . The latter is due to that it lies at the end of (4.11), and is therefore penalized
54
4
The Nonnegative Garrote in System Identification
100
Model fit
1
0
80
70
−1
−1
0
1
130 140 150 160
j = 1, . . . , nλ
1
1
6
6
11
11
16
16
21
21
θB
θA
90
26
26
31
31
36
36
41
41
130 140 150 160
j = 1, . . . , nλ
130 140 150 160
j = 1, . . . , nλ
Figure 4.7: The result of the MNNG for the LPV- SS system given by (4.14). The
upper left corner depicts the frozen poles of the system (4.14) when the scheduling variable p(t) ∈ [−2π, 0] varies. In the upper right corner the model fit for the
estimated models evaluated on validation data are shown. The lower left corner illustrates which coefficients of the Ap (q) polynomial that are nonzero and similarly
for Bp (q) in the lower right corner.
more than the others. The corresponding estimate of the B(p, q) polynomial is also quite
close to the system and is given by
b0 (p) = 0.9873 − 0.0024 sin p,
b1 (p) = 1.2309 − 1.0136p + 0.0001 sin p,
√
b2 (p) = −0.3319 − 0.7094 −p.
Here we notice that certain parameters are present that should not be there, for example
the parameter in the b0 (p) polynomial corresponding to sin p.
The two examples for the LPV- ARX case presented above indicate that the proposed mod-
4.5
Discussion
55
ification of the NNG may be used successfully when determining which basis functions
that should be used to describe the data. Also, the example suggests that the choice of
ordering (4.11) might not optimal since it does not find the system even in such simple
examples. Finding a better ordering instead of (4.11) remains as future work.
4.5
Discussion
In this chapter a method for order and structural dependence selection of ARX and LPVARX models was introduced. The method is a modified variant of the NNG method introduced by Breiman (1995), where constraints on the weights are added according to the
natural ordering of the regressors in ARX models. Different examples were given, both
on data from simulations and from a real-life system, and the results look promising. The
choice of ordering in the LPV- ARX is still an open question and further analysis of the
effects of different choices should be analyzed.
The regularization parameter λ in (4.6) was used to penalize both the pole and the
zero polynomials, thus letting the ordering of the output lags and the input lags be independent. Another possibility is to make use of two penalizing parameters λy and λu ,
one for weights on the coefficients of the pole polynomial and one for the weights on the
coefficients of the zero polynomial. This idea would need to make use of the methods
in multi-parametric quadratic programming and although they are not as simple as the
one presented for a single parameter, efficient algorithms exist (see, for instance, Tøndel
et al., 2003). The proposed method is easily extended to the multivariable case. Here, one
could still use one regularization parameter λ and it would be interesting to observe the
behavior of the proposed method in this context.
The order and structure selection problem of LPV- ARX models is a special case of the
related NARX problem and the possibility of extending the ideas to the NARX case should
be considered in future work. Furthermore, it would also be interesting to include the
instrumental variables framework into the regression selection algorithms presented in
this chapter. This would enable automatic order selection of, for example, output error
models.
Finally, the classical NNG method as presented by Breiman (1995) has a close relationship with the adaptive lasso (see, for example, Zou, 2006) for which some desirable
statistical properties have been proved. The consequences for the MNNG approach proposed here is not clear and should be investigated.
Appendix
4.A Parametric Quadratic Programming
The nature of the PQP problem is quite similar to the ordinary QP problem, for which
the active-set method is a common choice (see, for instance, Nocedal and Wright, 2006).
Thus, it is only natural to incorporate these ideas into the PQP algorithms as much as
possible. In this section, we will derive the formulas for the PQP which is a special case
of the more general setting presented in Roll (2008). The resulting algorithm can also be
found in Gould (1991) and Stein et al. (2008). A different perspective of the solution to
the PQP problem is given in Romanko (2004). From here on, we will assume convexity,
that is, the matrix Q is nonsingular and positive definite.
The algorithm derived below is based on the fact that the solution to (4.6) is a continuous piecewise affine function of λ with a finite number of breakpoints (see, for example,
Roll, 2008). Thus, a simple procedure to find the solution path w(λ) for all λ ≥ 0, is
to first find the initial solution w(0). Then one determines the direction for which the
solution decreases the most as a function of λ and continues in this direction until one hits
a constraint. The solution is then updated and a new direction is found. The procedure
is repeated until λ reaches infinity. This simple idea is the foundation of the algorithm
derived below.
The Lagrangian (Boyd and Vandenberghe, 2004) associated with (4.6) is
L(w, µ) =
1 T
w Qw + f T w + λ1T w + µT (Aw − b),
2
(4.15)
where µ is a vector of Lagrangian multipliers. This yields the Karush-Kuhn-Tucker (KKT)
conditions (Boyd and Vandenberghe, 2004):
Qw + f + λ1 + AT µ = 0
Aw − b 0
µi (Ai w − bi ) = 0, i ∈ I
µ 0, λ ≥ 0,
(4.16a)
(4.16b)
(4.16c)
(4.16d)
where I is the index set of the inequality constraints. Now, let W ⊂ I denote the set of
56
4.A
57
Parametric Quadratic Programming
active1 constraints. Solving (4.16) is then equivalent to solving
w
−f
−1
Q ATW
=
+λ
.
µW
bW
0
AW
0
(4.17)
Until now, no consideration of the effect of the parameter λ has been taken. Differentiation with respect to λ yields
∂w −1
Q ATW
∂λ
=
.
(4.18)
∂µW
0
AW
0
∂λ
The question that remains to be answered is how to use (4.17) and (4.18) to get an efficient algorithm. Remembering the initial discussion, the initial solution is found by solving (4.17) for λ = 0, where the active set W is found, if not apparent, by a Phase I linear
program (Nocedal and Wright, 2006). The search direction is clearly found via (4.18),
so the final detail is to find the step length δλ ≥ 0. There are, in principal, two different
ways of hitting a constraint. Firstly, the updated w should be feasible, that is,
∂w δλ = bi , i ∈ I,
Ai w +
∂λ
(4.19)
which is always the case for i ∈ W. Thus, we should find the smallest δλ ≥ 0 satisfying (4.19) for some i ∈
/ W. Furthermore, since (4.19) can be rewritten as
Ai
∂w
δλ = bi − Ai w ≥ 0,
∂λ
it must additionally hold that Ai ∂w
∂λ > 0. Secondly, there might be a constraint that
i
becomes inactive. This can only happen when there is an i ∈ W such that ∂µ
∂λ < 0. In
this case, one should find the smallest δλ ≥ 0 satisfying
µi +
∂µi
δλ = 0,
∂λ
(4.20)
for some i ∈ W. Finally, the step length is given by the smallest δλ ≥ 0 such that
both (4.19) and (4.20) are fulfilled. The results are gathered in Algorithm 3.
Algorithm 3 Parametric Quadratic Programming
Given: A convex parametric quadratic program (4.6).
1) Initialization: Let λ = 0 and find the solution to (4.6) via a Phase I linear programming. Set W = W0 , w = w0 and µW = µW0 where W0 , w0 and µW0 are the results
from the linear program. Finally, let S = {(λ, w)} = {(0, w0 )}.
2) Directions: Solve (4.18) for
∂w
∂λ
and
∂µW
∂λ
.
3) Step length: Find the minimal δλ ≥ 0 satisfying one of the following:
1 The correct notion of W is the working set, which in general is a subset of the active set. The primary function of the working set, instead of using the full active set, is to avoid primal degeneracy in the implementation
of the algorithm, see Nocedal and Wright (2006) for details.
58
4
i) If Ai (w +
ii) If µi +
∂w
∂λ δλ)
∂µi
∂λ δλ
The Nonnegative Garrote in System Identification
= bi and Ai ∂w
/ W, then W ← W ∪ {i}.
∂λ > 0 for some i ∈
= 0 and
∂µi
∂λ
< 0 for some i ∈ W, then W ← W \ {i}.
If no feasible solution δλ ≥ 0 exists, set δλ ← ∞.
4) Update: Set λ ← λ + δλ, w ← w +
S ← {S, (λ, w)}.
∂w
∂λ δλ
and µW ← µW +
∂µW
∂λ
δλ. Add new pair
5) Termination: Stop if λ = ∞, else continue from step 2.
It is worth noting that the index set W used in Algorithm 3, does not necessarily contain
all active constraints. This becomes clear when several constraints become active at once.
Then only one constraint is added to W, while the active set contains them all. This
technicality prohibits the addition of linear dependent constraints to the reduced KKT
system (4.17), so that it always has full rank.
For the particular choice of ordering (4.7), the initial solution to (4.6) is easy to find.
Let λ = 0, then (4.6) is an ordinary least square problem with inequality constraints and
the obvious solution is given by w = 1. To find µW we need to solve (4.16). Since
LS T
LS
LS T
LS T
LS
bN
bN
bN
bN
bN
Qw + f = 2Θ
Φ ΦΘ
1 − 2Θ
Φ Y = 2Θ
Φ (ΦΘ
1−Y)
LS
T
LS
b N Φ (Φθ̂N − Y ) = 0,
= 2Θ
(4.21)
where the last equality follows from the fact that the regression matrix Φ is orthogonal to
LS
the residuals Φθ̂N
− Y , one finds that µW = 0.
5
Utilizing Structure Information in
Subspace Identification
The prediction-error approach to parameter estimation of linear time-invariant models
often involves solving a non-convex optimization problem (as discussed in Section 1.1). It
may therefore be difficult to guarantee that the global optimum will be found. A common
way to handle this problem is to find an initial estimate, hopefully lying in the region of
attraction of the global optimum, using some other method. The prediction-error estimate
can then be obtained by a local search starting at the initial estimate. In this chapter, a new
approach for finding an initial estimate of polynomial linear models utilizing structure and
the subspace method is presented. The polynomial models are first written on the observer
canonical state-space form, where the specific structure is later utilized, rendering leastsquares estimation problems with linear equality constraints.
5.1
Introduction
The estimation of linear time-invariant polynomial models (3.6), given a sequence of input
and output pairs (3.1), is a classical problem in system identification. Several methods
exist, where the PEM (Section 3.1) or the IV (Section 3.3) are common choices. Especially,
the PEM methods have shown to have desirable statistical properties and have also been
proved to work in practice. One drawback with the PEM is that the procedure typically
involves solving a non-convex optimization problem (see Example 1.2). It is therefore
important to have a good initial estimate of the parameters, from which the PEM estimate
can be found by a local search. To this end, several different methods have been developed
which, in one way or another, utilize the structure of the problem. For instance, the
system identification toolbox (Ljung, 2009) in MATLAB uses a variant of the multi-step IV
algorithm presented in Section 3.3 as the choice of initialization method for the PEM for
a wide variety of linear model structures.
In the last three decades, the subspace identification (SID) methods (see Section 3.4)
have been developed and become an important set of tools for estimating state-space mod59
60
5
Utilizing Structure Information in Subspace Identification
els. These methods have been proved to yield reliable estimates in a numerically stable
and efficient manner. In this chapter, we are going to consider how to incorporate certain
problem specific structure information into the SID methods. Consider, for example, using a SID method as an initialization method when estimating the linear transfer-function
model structures described in Section 3.2.1.
Example 5.1
Assume that one would like to find an OE model (3.11), given a data set (3.1), with two
parameters in the numerator and three in the denominator
y(t) =
b1 q −1 + b2 q −2
u(t) + e(t).
1 + a1 q −1 + a2 q −2 + a3 q −3
(5.1)
Writing the above model in the observer canonical form (OCF) (3.19) yields

−a1
x(t + 1) = −a2
−a3
y(t) = 1
0

 
1 0
b1
0 1 x(t) + b2  u(t),
0 0
0
0 x(t) + e(t).
(5.2a)
(5.2b)
Using a SID method will lead to a third order state-space model in some unknown basis.
This implies that, by converting the resulting state-space model to the OCF, the third
element of the B(θ) will in general not be zero. Thus, to use a SID method to find an
initial estimate for the PEM, it would be desirable to incorporate the knowledge of the
structure of the B(θ) matrix into the SID scheme.
Motivated by the example above, the goal of this chapter is to explore the possibilities of
incorporating structure information into the SID scheme. A theoretical analysis of the use
of SID methods to estimate (unstructured) ARMAX models is given in Bauer (2005, 2009).
Even though the main focus of this chapter is on the discrete-time problem formulation, the continuous-time case follows readily if one substitutes the use of a discretetime SID method for a continuous-time equivalent, such as the ones presented in, for
instance, McKelvey and Akc˛ay (1994) and Van Overschee and De Moor (1996b).
The structure of this chapter is as follows: Firstly, we are going to consider the estimation of OE models, both in discrete time and continuous time. Then follows the estimation
of discrete-time ARMAX models, for which the OE estimation is an intermediate step. Finally, the ideas are further developed to handle certain gray-box model structures.
5.2
OE Models
In this section and the following, the SID algorithm, described in Section 3.4, will be
modified to handle the identification of general OE and ARMAX models. The method for
OE models is described first, since it constitutes an intermediate step when estimating an
ARMAX model. To avoid notational difficulties, only SISO systems are initially discussed.
5.2
61
OE Models
5.2.1
Discrete Time
To illustrate the basic idea of using a SID method for estimating a discrete-time transferfunction OE model (3.11), let us return to the example in the introduction.
Example 5.2: (Example 5.1 revisited)
Assume that one wants to estimate an OE model (5.1) or its equivalent (5.2) using the SID
b
method presented in Section 3.4.1. After the first three steps in Algorithm 2, estimates A
b
and C of the A(θ) and C(θ) matrices, respectively, have been derived in some unknown
basis. If these estimates could be transformed into the OCF (5.2), one would know that the
last element of B(θ) should be zero. This can be achieved in several ways and probably
the simplest way is to remove the column in the regression matrix, corresponding to the
parameter that should be zero, before solving the least-squares problem (3.53). In this
way, the remaining parameters are estimated as though this parameter truly was zero.
A different approach, yielding the same result, is to add a linear equality constraint to
the least-squares problem (3.53) and solve it with, for instance, the method presented in
Section 2.1. This method will be illustrated below since it is more general and remains
valid if the known parameter is nonzero.
b and C
b to the OCF without
Thus, the question is how to transform the estimates A
knowledge of the remaining matrices in the state-space equation. This is simply done
b which can be done with the following
by determining the characteristic equation of A,
MATLAB pseudo-code:
b
b
a = poly(A),
bOCF = [ −b
A
a ( 2 : na + 1 )T , eye (na , na − 1) ] ,
(5.3)
bOCF = eye (1, na ), where na is the degree of the pole polynomial.
and C
The next step is to find an estimate of the B(θ) matrix via the least-squares probb and C
b matrices now are in the OCF (5.2), we know that the third
lem (3.53). Since the A
element of the B(θ) matrix is zero. Thus, we add the linear equality constraint
0 0 1 0
B
0
=
,
0 0 0 1
D
0
when solving the least-squares problem (3.53) and a solution is attained using the method
presented in Section 2.3 or with the lsqlin in the MATLAB optimization toolbox. Note
that the same result is achieved by removing the column of the regression matrix corresponding to the parameter that should be zero before solving (3.53).
It is known that the transformation of a state-space model to these kind of canonical
forms is not numerically sound, since the similarity transform may be badly conditioned.
Here, we do not need to explicitly find the transformation matrix and its inverse, since the
structure of the problem is so simple. Furthermore, even though the problem of finding
the eigenvalues of a matrix is numerically ill-conditioned, using the eigenvalues to find
the coefficients of the characteristic equation is within the round-off error of the matrix
in question. Thus, the method used above to transform the estimated matrices to the OCF
is numerically stable. This does not, of course, change the fact that the OCF is extremely
sensitive to perturbations in the parameter values. For example, a small perturbation of
the last coefficient ana may yield a big change in the eigenvalues.
62
5
Utilizing Structure Information in Subspace Identification
The method used in the example above is easily seen to be valid whenever nb ≤ na ,
where the only difference is the number of known zeros in the B(θ) and D(θ) matrices in
the final step. But what if one needs a model with nb > na ?
Example 5.3: (OE model with nb > na )
Consider an discrete-time OE model with nb = 5, na = 2 and nk = 3:
y(t) =
b1 q −3 + b2 q −4 + b3 q −5 + b4 q −6 + b5 q −7
u(t) + e(t).
1 + a1 q −1 + a2 q −2
Rewriting the above equation to OCF (3.19) with delayed input yields

 

b1
−a1 1 0 0 0
b2 
−a2 0 1 0 0
 


 
0 0 1 0
x(t + 1) = 
 x(t) + b3  u(t − 2),
 0

b4 
 0
0 0 0 1
0
0 0 0 0
b5
y(t) = 1 0 0 0 0 x(t) + e(t).
This means that if one wants to estimate this model, given data, using a SID method,
then one needs to estimate a fifth order state-space model. Furthermore, one needs to
be able to constrain some of the elements of A(θ) to be zero, that is, the coefficients of
the characteristic polynomial. To the author’s knowledge, no existing subspace method
can handle such constraints and the estimates of the coefficients âi in the characteristic
equation of A(θ) will in general all be nonzero.
A possible way to solve this problem would be to estimate the A(θ) matrix, transform
it to OCF and just truncate the characteristic polynomial by setting unwanted coefficients
to zero. This might work well in some cases, but is probably a bad idea in the case of
undermodeling.
Here, we are going to propose a different solution by introducing virtual inputs. Let
us rewrite the original equation by splitting the rational expression into two separate terms
b1 + b2 q −1 + b3 q −2 + b4 q −3 + b5 q −4 −3
q u(t)
1 + a1 q −1 + a2 q −2
b1 + b2 q −1 + b3 q −2 −3
b4 + b5 q −1
=
q u(t) +
q −6 u(t).
−1
−2
1 + a1 q + a2 q
1 + a1 q −1 + a2 q −2
y(t|t − 1, θ) =
Now, polynomial division of the rational expressions yields
(b2 − b1 a1 )q −1 + (b3 − b1 a2 )q −2
y(t|t − 1, θ) = b1 +
u(t − 3)
1 + a1 q −1 + a2 q −2
(b5 − b4 a1 )q −1 + (−b4 a2 )q −2
+ b4 +
u(t − 6),
1 + a1 q −1 + a2 q −2
which can be written as
−a1
x(t + 1) =
−a2
y(t) = 1
1
b − b1 a1 b5 − b4 a1
u1 (t)
x(t) + 2
,
0
b3 − b1 a2
−b4 a2
u2 (t)
u1 (t)
0 x(t) + b1 b4
+ e(t),
u2 (t)
(5.4)
5.2
63
OE Models
where u1 (t) , u(t − 3) and u2 (t) , u(t − 6). Thus, the original model can now be
estimated as a second order state-space model with two input signals instead of one. The
constraints on the characteristic polynomial have been implicitly taken care of.
It is worth noting that other choices of virtual inputs are possible, for example, one
could have chosen u1 (t) , u(t − 2), u2 (t) , u(t − 4) and u3 (t) , u(t − 6). In this way,
the polynomial divisions performed above would not be necessary and the D(θ) matrix
in the state-space model (5.4) would become zero. The latter choice of virtual inputs is
probably preferable when nk > 0, but the former choice, resulting in (5.4), is also valid
when nk = 0.
Hence, after this initial transformation, we can proceed as before. First, the estimates
b and C
b are found and transformed into OCF by the same method as in Example 5.3.
A
Now, since the estimate of a2 is known, it is clear that the last element of B(θ) in (5.4) is
related to the last element of D(θ). This relation may be enforced by adding the constraint
vec B
0 0 0 1 0 −â2
= 0,
vec D
where the vec-operator stacks the columns of the argument in a row vector, to the leastsquares problem (3.53).
From the simple example above, one can see how the general algorithm works. In fact,
consider a general OE model
y(t) =
b1 q −nk + · · · + bnb q −nk −nb +1
u(t) + e(t).
1 + a1 q −1 + · · · + ana q −na
Then the number of virtual inputs is nu = dnb /(na + 1)e and the total number of parameters in B(θ) and D(θ) is np = nu na + nu , where nl = np − nb parameters are fixed by
a linear constraint. The predictor can be rewritten as
n
ŷ(t|t − 1, θ) ,
u
X
b1 q −nk + · · · + bnb q −nk −nb +1
u(t)
=
Gk (q)uk (t),
1 + a1 q −1 + · · · + ana q −na
(5.5a)
k=1
where virtual inputs uk (t) , u t − nk − (k − 1)(na + 1) have been introduced. With
mk , (k − 1)(na + 1),
the transfer-functions are given by
Gk (q) = bmk +1 +
1 + a1
q −1
Pk (q)
,
+ · · · + ana q −na
k = 1, . . . , nu ,
(5.5b)
where
Pk (q) =
Pnu (q) =
na
X
(bmk +i+1 − ai bmk +1 )q −i ,
i=1
nX
a −nl
i=1
k = 1, . . . , nu − 1,
bmnu +i+1 − ai bmnu +1 q −i +
na
X
(5.5c)
(−ai bmnu +1 )q −i . (5.5d)
i=na −nl +1
64
5
Utilizing Structure Information in Subspace Identification
The first sum in (5.5d) is left out if na = nl . Writing the above transfer-function form
into the OCF (3.19), the columns of the B(θ) matrix are given by


bmk +2 − a1 bmk +1
 bmk +3 − a2 bmk +1 


Bk (θ) = 
(5.6a)
,
..


.
bmk +na +1 − ana bmk +1
for columns k = 1, . . . , nu − 1 and the last column can be written as

bmnu +2 − a1 bmnu +1
..
.







bmn +na −nl +1 − ana −nl bmn +1 
u
u
.
Bnu (θ) = 


−ana −nl +1 bmnu +1




..


.
−ana bmnu +1
(5.6b)
Finally, one gets
D(θ) = bm1 +1
...
bmnu +1 .
(5.7)
b and C,
b which have been transformed into the
Assume now that we have estimates A
OCF (3.19), that is,
b = −â In ×(n −1) , C
b = I1×n ,
A
(5.8)
a
a
a
where â is the vector containing the estimated coefficients of the pole polynomial.
The linear equality constraints when estimating B(θ) (5.6) and D(θ) (5.7) in the leastsquares problem (3.53) can be expressed as
vec B(θ)
0nl ×(nu na −nl ) Inl ×nl
0nl ×(nu −1) −âl
= 0nl ×1 ,
(5.9)
vec D(θ)
where 0m×n denotes the zero matrix of dimensions m × n and
T
âl , âna −nl +1 . . . âna .
It is worth noting that there is some freedom in choosing the virtual inputs and the particular choice above is just one way of dealing with the case when nk might be zero.
Now, let us return to the previous example and see how the general formulas work.
Example 5.4: (Example 5.3 revisited)
Consider once again the discrete-time OE model with nb = 5, na = 2 and nk = 3:
y(t) =
b1 q −3 + b2 q −4 + b3 q −5 + b4 q −6 + b5 q −7
u(t) + e(t).
1 + a1 q −1 + a2 q −2
Now, let us use the general formulas (5.5) and (5.9) to find the linear equality constraints.
It holds that nu = dnb /(na + 1)e = 2, and we should introduce the virtual inputs
uk (t) , u t − nk − (k − 1)(na + 1) , k = 1, . . . , nu ,
5.2
65
OE Models
that is u1 (t) = u(t − 3) and u2 (t) = u(t − 6). The total number of parameters in B(θ)
and D(θ) is given by np = nu na + nu = 6 and the number of linear constraints is
nl = np − nb = 1. Now, applying the problem specific parameters {na , nb , nk , nl } to the
general predictor (5.5) yields
(b2 − b1 a1 )q −1 + (b3 − b1 a2 )q −2
y(t|t − 1, θ) = b1 +
u1 (t)
1 + a1 q −1 + a2 q −2
(b5 − b4 a1 )q −1 + (−b4 a2 )q −2
+ b4 +
u2 (t),
1 + a1 q −1 + a2 q −2
which coincides with the result in Example 5.3 and may be rewritten as (5.4). The linear
constraints (5.9) become
vec B(θ)
0 0 0 1
0 −b
a2
= 0,
vec D(θ)
which equals the choice made in Example 5.3. Thus, (5.5) and (5.9) are direct generalizations of the choices made in Example 5.3.
The ideas presented so far for the initialization of a discrete-time
rized in Algorithm 4.
OE
model are summa-
Algorithm 4 OE estimation using SID.
Input: A data set (3.1) and the model parameters {na , nb , nk }.
1. Introduce the virtual inputs as described in (5.5a).
b and C
b of the A(θ) and C(θ) matrices of order na via the Steps 1-3
2. Find estimates A
in Algorithm 2.
b and C
b to the OCF (3.19) using (5.3).
3. Transform the estimates A
4. Estimate the B(θ) and D(θ) matrices via (3.53) with equality constraints (5.9).
5. Identify the polynomials Fp (q) and Bp (q) in (3.11) with the elements of the estib B
b and D,
b respectively, using (3.19) and (5.6)-(5.7).
mated matrices A,
To evaluate the algorithm, let us consider a Monte Carlo study of the estimation of a
discrete-time OE model.
Example 5.5
Let the true system be given by (3.11) with
Bp (q) = q −2 − 0.4834q −3 − 0.2839q −4 − 0.02976q −5
Fp (q) = 1 + 0.7005q −1 − 0.2072q −2
(5.10)
and let e(t) be white Gaussian noise with unit variance and zero mean. For the Monte
Carlo study, output data y(t) have been generated with M = 1000 realizations of a zero
66
5
Utilizing Structure Information in Subspace Identification
mean white Gaussian process, with unit variance and length 1000, as input u(t). The
noise also has a different realization for each Monte Carlo run. The result is given in
Table 5.1. The results using the proposed subspace method are denoted by SID, and IV
denotes the results using the standard initialization method in the system identification
toolbox (Ljung, 2009) in MATLAB, that is, the IV estimate is found by
oe(data, [nb, nf, nk],0 maxiter0 , 0).
The first two columns (BIAS) give a measure of the bias of the parameter estimates
M
X
d θ̂ , 1 (θ0 − θ̂i ).
Bias
M i=1
(5.11)
The last two columns (VAR) present estimates of the variances of the parameter estimates
M
d θ̂ ,
Var
1 X
¯
(θ̂i − θ̂)2 ,
M − d i=1
(5.12)
¯
where d is the number of parameters and θ̂ is the mean value of parameter estimates.
For this example, the SID method yields lower variance when estimating the numerator
polynomial Bp (q) than the traditional IV estimator. One notices that some outliers were
present in the IV estimates of the Bp (q) polynomial which is the cause of the high variance
values. On the other hand, the IV estimator yields lower bias than the proposed SID
method for some of the parameters.
Table 5.1: Monte Carlo analysis of the BIAS estimate (5.11) and the
mate (5.12) when estimating the parameters in the model (5.10).
BIAS
b1
b2
b3
b4
a1
a2
VAR
esti-
VAR
SID
IV
SID
IV
0.0026
0.0667
0.1055
0.0862
0.0642
0.0042
0.0233
0.1660
0.0684
0.0810
0.0491
0.0455
0.0013
0.2266
0.0648
0.0739
0.2275
0.1808
0.2223
176.9809
15.3032
29.9095
0.2035
0.1777
Empirical studies by the author have shown that the proposed method works quite well
in general and, more often than not, yields a higher model fit (3.82) on validation data
than the IV method. Occasionally, the method breaks down, but in these cases it is the
underlying SID method that fails to estimate the matrix A(θ) properly.
Even though only single-input single-output notation was used throughout the section,
it is clear that the method also works for multiple-input single-output (MISO) systems.
This is due to the use of the OCF which may be used for representing MISO systems.
Since multiple-input multiple-output (MIMO) systems can be seen to consist of ny MISO
systems, the above procedure may also be used to get an initial estimate of MIMO systems.
5.2
67
OE Models
5.2.2
Continuous Time
Finally, let us evaluate the proposed method when estimating a continuous-time OE model
from discrete-time data. For this end, a frequency-domain SID method is needed, for
instance, the method presented in Section 3.4.2. This algorithm differs only slightly from
the time-domain methods and the order in which one estimates the system matrices is the
same. Thus, the method developed in the discrete-time case applies readily.
Example 5.6
Let the true system be given by the Rao and Garnier (2002) test system
y(t) =
−6400p + 1600
u(t),
p4 + 5p3 + 408p2 + 416p + 1600
where p is the (ordinary) differential operator. This system is non-minimum phase, it
has one fast oscillatory mode with relative damping 0.1 and one slow mode with relative
damping 0.25. The system can be written on the OCF as




0
−5
1 0 0
 0 
 −408 0 1 0



ẋ(t) = 
 −416 0 0 1 x(t) + −6400 u(t),
1600
−1600 0 0 0
y(t) = 1 0 0 0 x(t) + e(t).
For data collection, the system is sampled with a sampling time of Ts = 10 ms and
simulated with a random binary input of length N = 1023 and white Gaussian noise of
unit variance is added to the output, which gives a signal to noise ratio of 7.35 dB. The data
is then Fourier transformed into the frequency domain and pre-filtered before estimation
where only the data lying in the frequency window [0, 30] rad/s are kept, resulting in a data
set of length M = 512. To get a continuous-time model, the sampling-time is assumed
to be zero. This will introduce some bias in the estimates, but since the sampling rate is
rather high this effect will be negligible.
The standard initial estimator in MATLAB, which will be denoted by IV, is determined
via the command
oe(data, [2, 4],0 focus0 , [0, 30],0 maxiter0 , 0)
Now, the SID estimates of the A(θ) and C(θ) matrices can be determined by
model = n4sid(data, 4,0 maxiter0 , 0,0 focus0 , [0, 30])
and transformed into the OCF via
a = poly(model.A)
A = [−a(2 : end).0 , eye(na, na − 1)]
C = eye(1, na)
68
5
Utilizing Structure Information in Subspace Identification
Now, the regressor matrix for estimating B(θ) is created by
H = data.Y ./ data.U
y = [real(H); imag(H)]
for k = 1 : M
Phi(i, :) = C ∗ inv(i ∗ data.Frequency(k) ∗ eye(na) − A)
end
Phi = [real(Phi); imag(Phi)]
where we have scaled (3.75) with U (iω). Since the first two elements of B(θ) should be
zero, the estimate of B(θ) is found by
B = [0; 0; Phi(:, [3, 4])\y]
The validation of the methods is given in Figure 5.1 where the amplitude and phase are
0◦
101
−100◦
100
−200◦
10−1
−300◦
−400◦
10−2
10−1
100
101
10−1
102
100
101
4
102
Measured
True (34.2%)
iv (3.5%)
sid (32.8%)
2
0
−2
−4
2
3
4
5
6
Figure 5.1: Top: The amplitude (left) and the phase (right) plot of the true system
and the estimates given by the SID and the IV methods. Bottom: A simulation
on validation data of the true system and the estimated models (zoomed) with the
corresponding model fit values (3.82) in parenthesis.
5.3
69
ARMAX Models
illustrated for the different estimates. Also, a cross-validation simulation is given. One
notices that the SID method captures the two modes correctly, but the static gain estimate
is biased. The IV estimate does not perform as well and it has some difficulties estimating
the phase.
Further information on different continuous-time identification methods using sampled
data can be found in Garnier and Wang (2008) and the references therein. It is worth
mentioning that, even though the IV (using oe with maxiter equal to zero) seems to
perform poorly compared to the SID method in the example above, the estimate is good
enough to be used as an initial starting value for the PEM, which converges to the global
optimum in just a few iterations.
5.3
ARMAX Models
To find an initial estimate of the ARMAX model (3.10), the only thing that remains, after
using the proposed procedure for the OE model identification, is to estimate the Kalman
gain K(θ) in (3.15), which has the form (3.19):


c1 − a1


..


.


cnc − anc 

.
K(θ) = 
(5.13)

 −anc +1 


..


.
−ana
Here, we must have nc ≤ na , which limits the flexibility of the noise model. Now, the
estimation of K(θ) requires an estimate of the process and measurement noises. For this
end, the state matrix Xβ,N in (3.40) is reconstructed as outlined in Section 3.4.1. Once
the estimates of the system matrices and the state sequence have been found, the process
noise and the measurement noise can be approximated by the residuals in (3.54), that is,
! !
b B
b
c
bβ+1,N
bβ,N −1
A
W
X
X
=
− b b
.
(5.14)
Yβ,1,N −1
Uβ,1,N −1
C D
Vb
As mentioned in Section 3.4.1, the common way to estimate K(θ) is to form the covariance matrices of (5.14) and solve the corresponding Riccati equation. In this presentation,
we are going to make use of the fact that our estimated model is on the OCF. From (3.15)
and (3.18) it holds that
c,
K(θ)Vb = W
(5.15)
where K(θ) has the structure (5.13). Since, at this point, estimates âi of the elements ai
in (5.13) already have been determined, the least-squares estimate of K(θ) can be found
via (5.15) with equality constraints
0(na −nc )×nc I(na −nc )×(na −nc ) K(θ) = −ânc +1:na ,
(5.16)
70
5
Utilizing Structure Information in Subspace Identification
where
ânc +1:na = ânc +1
...
âna
T
.
This yields in turn an estimate of the Cp (q) polynomial in (3.10) as shown in (3.19). The
proposed procedure for the estimation of ARMAX models is summarized in Algorithm 5.
Algorithm 5 ARMAX estimation using SID.
Given: A data set (3.1) and the model parameters {na , nb , nc , nk }. Limitation: nc ≤ na .
b B,
b C
b and D
b by Algorithm 4.
1. Find estimates of the system matrices A,
2. Reconstruct the state matrix Xβ,N , see Section 3.4.1.
c and Vb of the residuals from (5.14).
3. Find estimates W
4. Estimate K(θ) by solving (5.15) in a least-squares sense with the equality constraints (5.16).
5. Transform the estimated state-space model to the transfer-function form (3.10).
Now, let us consider a simple example.
Example 5.7
Assume that one wants to estimate an ARMAX model (3.10)
y(t) =
b1 q −1 + b2 q −2
1 + c1 q −1
u(t) +
e(t),
−1
−2
1 + a1 q + a2 q
1 + a1 q −1 + a2 q −2
which can be rewritten as
−a1
x(t + 1) =
−a2
y(t) = 1
1
b1
c1 − a1
x(t) +
u(t) +
e(t),
0
b2
−a2
0 x(t) + e(t).
The use of Algorithm 5 is straightforward and the equality constraint in step 4 is given by
0 1 K(θ) = −â2 ,
where â2 is found in the first step of the algorithm.
Thus, the estimation procedure for the ARMAX model structure is only a minor extension
of the discrete-time OE identification algorithm given in Section 5.2. Now, let us compare
the results using Algorithm 5 and the IV based estimator given by
armax(data, [na, nb, nc, nk],0 maxiter0 , 0)
on some simulated data.
5.4
71
Special Gray-Box Models
Example 5.8
Let the true system be given by (3.10) with
Ap (q) = 1 − 0.06353q −1 + 0.006253q −2 + 0.0002485q −3
Bp (q) = 1 − 0.8744q −1 − 0.3486q −2 + 0.331q −3
Cp (q) = 1 + 0.3642q −1
and let e(t) be white Gaussian noise with unit variance and zero mean. For the Monte
Carlo analysis, output data y(t) have been generated with M = 1000 realizations of a
zero mean white Gaussian process, with unit variance and length 1000, as input u(t).
The noise also has a different realization for each Monte Carlo run. The result is given in
Table 5.2 and the two methods seem to have comparable performance, where the proposed
SID method only has a slight advantage in the bias and the variance.
Table 5.2: Monte Carlo analysis of the BIAS estimate (5.11) and the
mate (5.12) when estimating the parameters in Example 5.8.
BIAS
a1
a2
a3
b1
b2
b3
b4
c1
VAR
esti-
VAR
SID
IV
SID
IV
0.0137
0.0025
0.0040
0.0000
0.0137
0.0083
0.0018
0.0141
0.0329
0.0100
0.0072
0.0001
0.0367
0.0190
0.0157
0.0141
0.0101
0.0016
0.0004
0.0001
0.0102
0.0027
0.0036
0.0102
0.0447
0.0047
0.0021
0.0001
0.0560
0.0153
0.0125
0.0213
The results in the example above are quite typical for the ARMAX case. It seems, to the
author’s experience, that the IV based initialization method (implemented in, Ljung, 2009)
for ARMAX models often works better than the corresponding method for OE models.
Thus, the benefit of using the proposed SID method, compared to the IV method, seems
to be greater in the OE case than in the ARMAX case.
5.4
Special Gray-Box Models
The ideas presented so far have been based on a coordinate change of the state-space
model to the OCF, which works well for certain black-box model structures. But what
if a different structure is present in the state-space form, like the ones in linear gray-box
models?
72
5
Utilizing Structure Information in Subspace Identification
Example 5.9: (DC motor)
A simple example of a gray-box structure is given by the state-space model
0 1
0
ẋ(t) =
x(t) +
u(t)
0 θ1
θ2
y(t) = 1 0 x(t) + e(t),
(5.18a)
(5.18b)
which is often used to represent a DC motor (see, for instance, Ljung, 1999, page 95). The
states x1 and x2 represent the angle and the angular velocity, respectively, of the motor
shaft and the parameters θ1 and θ2 contain the information about values of the resistance,
inductance, friction, and the inertia of the motor.
b and C
b of the matrices A(θ) and C(θ) in (5.18) have
Now, assume that estimates A
been found using some continuous-time SID method:
â11 â12
b
b = ĉ11 ĉ12 .
A=
, C
â21 â22
Then, asymptotically, there is a similarity transform T satisfying
â11
â21
ĉ11
â12
â22
=
t11
t21
ĉ12 = 1
t12
0 1
t11
t22
0 θ1
t21
−1
t11 t12
0
.
t21 t22
t12
t22
−1
,
Multiplying the above relation with T −1 from the left yields
â11 t̃11 + â21 t̃12 â12 t̃11 + â22 t̃12
t̃21
t̃22
=
,
â11 t̃21 + â21 t̃22 â12 t̃21 + â22 t̃22
t̃21 θ1 t̃22 θ1
ĉ11 ĉ12 = t̃11 t̃12 ,
where t̃ij denotes the elements of T −1 . The equations not containing θ1 become
â11 t̃11 + â21 t̃12 − t̃21 = 0
â12 t̃11 + â22 t̃12 − t̃22 = 0
t̃11 = ĉ11
t̃12 = ĉ12
which can easily be solved for t̃11 , t̃12 , t̃21 and t̃22 . Thus, a similarity transform, which
b and C
b as close as possible to the gray-box structure represented
takes the estimates A
by (5.18) may be found, unless the system of equations is under-determined.
Inspired by the example above, let us consider a more general case. Here, let A(θ) and
C(θ) represent the gray-box structure of system we would like to estimate. By using
b and C
b of A(θ) and C(θ), respectively, can be found.
some SID method, estimates A
Asymptotically, as the number of data points tends to infinity, the matrix estimates are
5.4
73
Special Gray-Box Models
related with the structured matrices via a similarity transform T from the true values, that
is, it holds that
b = T A(θ)T −1 , C
b = C(θ)T −1 .
A
(5.19)
If one would be able to determine the transform T , then one could find the parameters θ
by solving some system of equations. Multiplying (5.19) with T from the right yields
b = T A(θ),
AT
b = C(θ).
CT
This may be written as
b − A(θ)T ⊗ I
I ⊗A
b
I ⊗C
!
vec T =
0
,
vec C(θ)
(5.20)
where ⊗ denotes the Kronecker product. For certain linear gray-box structures, where all
the unknown parameters of A(θ) lie in one row or one column and all the elements of
C(θ) are known, the equation (5.20) can be solved for T .
Example 5.10: (OCF)
In the Sections 5.2 and 5.3, dealing with the initialization of the OE and the ARMAX
model structures, the OCF (3.19) was used as a state-space representation of the transferb matrix to transform the
function models. Instead of determining the eigenvalues of the A
estimated state-space model to the “gray-box” structure (3.19), one can determined the
similarity transform via (5.20) to achieve the same result. Doing this, the left hand side
of (5.20) becomes
 A+a

b
I a I a I ... a
I a I
3
na −1
na
b 0 ...
A
b ...
−I A
0 −I ...
0
0
0
0
0
0
..
.
.. . . . .
. .
.
..
.
..
.
0
b
C
0
0
0
0
b
C
0
−I
0
0
0
b
A
0
0
0
..
.
.. .. ..
. . .
..
.
..
.
b
C
0
0
b
C
1














−I
0
0
0
0
2
0
0
0
0
0
b
C
0
0
...
...
...
...
...
...







 vec T,






and the right hand side is a vector with all elements equal to zero except the n2a + 1
element that is equal to one. Thus, it is only the first na equations that depend on the
unknown parameters. The remaining number of equations sum up to n2a , which is equal
to the number of unknowns in T . Thus, (5.20) is solvable for T whenever the lower part
of the matrix above has full rank.
The procedure for finding T may work when A(θ) only has one column depending on the
unknown parameters. If A(θ) has all parameters in one row one should try to solve (5.19)
for T −1 instead, which is done by multiplying the equation with T −1 from the left and
then vectorizing the result with respect to T −1 .
We are now ready to give a procedure to find an initial estimate of the special graybox structures discussed above. Assume that a dataset (3.1) and a gray-box structure of
74
5
Utilizing Structure Information in Subspace Identification
order nx have been given. Furthermore, assume that all unknown parameters in A(θ) are
in either one row or one column.
b and C
b of the A(θ) and C(θ) matrices of order nx × nx and ny × nx ,
1) Find estimates A
respectively, where nx = dim (x) and ny = dim (y), via (3.50).
b and C
b as close as possible
2) Find the similarity transform T that takes the estimates A
b ← T AT
b −1 and
to the gray-box structure via (5.20) and transform the estimates A
−1
b ← CT
b
C
.
3) Solve the least-squares problem (3.53) with equality constraints on the known elements in the B(θ) and D(θ) matrices.
4) Reconstruct the state matrix Xβ,N via (3.65).
5) Re-estimate the system matrices according to
b
bβ,N −1 A(θ) B(θ)
Xβ+1,N
2
X
arg min
−
,
C(θ) D(θ)
Yβ+1,α,N
Uβ,α,N −1
A(θ),B(θ),C(θ),D(θ)
with equality constraints on all known elements.
The last step of the procedure ensures that the constraints on the elements in the A(θ)
matrix are, in some way, taken into account. Let us test the above procedure in a simple
example.
Example 5.11
Let the true system be given by

0
x(t + 1) =  0
θ1
y(t) = 1

 
1 0
θ3
0 1  x(t) +  0  u(t)
0 θ2
θ4
0 0 x(t) + e(t)
(5.21a)
(5.21b)
with θ = (−0.3, 0.3, 0.1, −0.1)T . Here, e(t) is white Gaussian noise of unit variance.
For the Monte Carlo analysis, output data y(t) have been generated with M = 500 realizations of a zero mean white Gaussian process, with unit variance and length 10000,
as input u(t). The parameters θ have been estimated for each realization with the SID
procedure presented above. The mean value and the variance of the parameter estimates
were given by
¯
θ̂ = −0.2873
0.2675
d θ̂ = 10−2 × 0.2395
Var
T
0.1003 −0.0921 ,
0.6518
0.0096
T
0.0144 .
(5.22a)
(5.22b)
If no consideration is given to the zero at the (3, 2) element in the system matrix A(θ)
in (5.18), that is, the final step 5 in the proposed procedure is not performed, the mean
values of the parameter estimates become
¯
θ̃1 = −0.2049,
¯
θ̃2 = 0.3957,
(5.23a)
5.5
75
Discussion
and their variances
d θ̃1 = 0.0293,
Var
d θ̃2 = 0.1004.
Var
(5.23b)
This shows, as could be expected, a slight increase in variance when not taking the structure of A(θ) into account. Furthermore, there also seems to be a significant increase in
the bias of the parameter estimates.
The procedure presented above is similar to the method proposed in Xie and Ljung (2002).
There, an initial model of the system is first found by an unstructured subspace method
and then the similarity transform is found by solving a least-squares problem. The method
then alternates between solving a nonlinear parameter estimation problem and improving
the similarity transform.
5.5
Discussion
In this chapter, a new algorithm for initial parameter estimation of certain linear model
structures which makes use of the standard SID methods has been presented. The algorithm is valid for both discrete-time and continuous-time identification.
The proposed method might be generalized to handle Box-Jenkins models by using
the method presented in Reynders and De Roeck (2009) or similar SID methods, which
is a topic for future work. Furthermore, SID methods for LPV state-space models and
Wiener-Hammerstein systems might benefit from the ideas presented.
The original SID methods, presented in Section 3.4, contain means to estimate the
number of states needed to describe the system. This is done by analyzing the singular values of the estimated observability matrix Γα , that is, the relation in (3.49). With
the ideas presented in this chapter, this may be extended to automatic order selection of
OE and ARMAX models, where the number of parameters used in the estimated B(θ),
D(θ) and K(θ) matrices may be determined by some order selection algorithm for linear
regression models.
The use of the OCF to represent state-space models is convenient in terms of simplicity in choosing the linear equality constraints that should be used when estimating
the B(θ) and D(θ) matrices in the state-space model. The drawback is the sensitivity to
small permutations in the parameters of the OCF. Thus, it would be interesting to find a
different representation which does not suffer as much from this drawback. Furthermore,
a different representation may enable a one-shot method for MIMO systems, instead of
estimating several MISO systems.
Finally, in addition to the structure information we have incorporated into the SID
method in this chapter, there might also be some other prior information that is known.
For example, if the static gain of the system is known a priori, how does one incorporate
such structure information?
6
Difference Algebra and
System Identification
The framework of differential algebra, especially Ritt’s algorithm, has turned out to be a
useful tool when analyzing the identifiability of certain nonlinear continuous-time model
structures (Ljung and Glad, 1994). This framework is conceptual rather than practical and
it provides means to analyze complex nonlinear model structures via the much simpler
linear regression models
y(t) = ϕ(t)T θ + e(t).
(6.1)
One difficulty when working with continuous-time signals is dealing with white noise
signals and in Ljung and Glad (1994) these effects are ignored. In this chapter, difference algebraic techniques, which mimics the differential algebraic techniques, will be
developed. Besides making it possible to analyze discrete-time systems, this opens up the
possibility of dealing with noise.
6.1
Introduction
The analysis of nonlinearly parametrized model structures is an important branch of system identification. For instance, many physically motivated model structures fall into this
category (see, for instance, Ljung, 1999). In this chapter, we will try to generalize the
differential algebraic framework presented in Ljung and Glad (1994) for discrete-time
systems, see also Section 3.5. Let us first consider a simple example.
Example 6.1: (Example 1.3 revisited)
Consider the problem of estimating the parameter θ in
y(t) = θu(t) + θ2 u(t − 1) + e(t),
(6.2)
where y(t) is the output, u(t) is the input, and e(t) is the noise. This is the same model
structure as in Example 1.3. Also, by exchanging the time shift u(t − 1) in (6.2) with the
77
78
6
Difference Algebra and System Identification
derivative u̇(t), this corresponds directly to the model structure in Example 3.3. Let us,
inspired by the discussion in Example 3.3, time shift (6.2)
y(t − 1) = θu(t − 1) + θ2 u(t − 2) + e(t − 1).
(6.3)
By examining the equations, we see that by multiplying (6.2) by u(t − 2) and (6.3) by
u(t − 1), and then subtracting the result, we obtain
u(t − 2)y(t) − u(t − 1)y(t − 1)
= (u(t − 2)u(t) − u2 (t − 1))θ
+ u(t − 2)e(t) − u(t − 1)e(t − 1). (6.4)
This may be written as a linear regression model if one replaces y(t) in (6.1) with the left
hand side of (6.4) and ϕ(t) in (6.1) with the coefficient in front of θ in (6.4), respectively.
Thus, instead of analyzing the properties of the model structure defined by (6.2) one may
work with a linear regression model for which there are many well known properties.
Furthermore, if one formulates the estimation problem as minimizing the squared
prediction error (3.5) with respect to the parameter θ using the representation (6.2) and
the same signals as in Example 1.3, one notices that the optimization problem becomes
non-convex, even for such a simple model structure, see Figure 6.1.
VN (θ, Z N )
2
1
0
−3
−2
−1
0
θ
1
2
3
Figure 6.1: The cost function for the problem of estimating the parameter θ in (6.2)
given the same data as in Example 1.3. The solid line represents using (6.2) directly
and the dashed line using the reformulated description (6.4).
On the other hand, formulating the equivalent estimation problem, but now with the
equivalent representation (6.4), one gets a convex optimization problem as illustrated in
Figure 6.1. This motivates that, at least in this simple example, algebraic techniques may
be useful in finding initial estimates for certain nonlinear model structures.
The calculations in the example above are inspired by the methods of differential algebra,
especially the framework based on Ritt’s algorithm presented in Ljung and Glad (1994).
The main result in Ljung and Glad (1994) shows that a continuous-time model structure,
6.2
79
Algebraic Concepts
containing only polynomial nonlinearities, is globally identifiable if and only if it can be
written as a linear regression. In particular, this implies that once the necessary differential
algebraic manipulations have been made, an initial estimate of the unknown parameters
may be found by solving a simple least-squares problem. For further references to the
differential algebraic framework and its application to control theoretical problems, see
Section 3.5.
In this chapter, motivated by the result presented in Ljung and Glad (1994) and the
fact that most system identification problems involve sampled data, we aim to generalize
these results to discrete time. Also, if one can generalize the methods of differential
algebra to work for discrete-time systems, differentiation of noise signals will no longer
be an issue and the noise can be manipulated as any other signal. Attempts to mimic Ritt’s
algorithm for discrete-time systems have already been made by Kotsios (2001), where the
so-called δ-operators are introduced and different products of these are discussed. In
contrast, the aim in this presentation is to generalize Ritt’s algorithm with a minimum
amount of changes compared to the continuous-time case.
The structure of the chapter is as follows: In Section 6.2, the basic algebraic concepts used in this chapter are presented. Furthermore, Ritt’s algorithm is formulated and
its finite time termination is proved. The generalization of the identifiability results presented Ljung and Glad (1994) is discussed in Section 6.3. Finally, some implications for
the initialization of system identification problems are given in Section 6.4.
6.2
Algebraic Concepts
In this section, we are interested in formalizing the algebra concerning solutions to systems of polynomial equations, where the polynomials depend only on a finite number of
variables (which are themselves elements of one or several time-dependent signals), that
is, the solution to systems of difference equations. For example, the polynomial x(t)3 + 1
in the variable x(t) is a function f of the sequence Xt = (x(t − τ ))∞
τ =0 which maps Xt to
3
+1 (where Xt,1 is the first element in the sequence Xt ). The solution
the polynomial Xt,1
to the difference equation f = 0, that is x(t)3 + 1 = 0, is thus a sequence (x(τ ))∞
τ =−∞
satisfying f (Xt ) = 0 for all t. In general, the solution to a system of difference equations will be found by algebraic manipulations involving the backward shift operator q −1 ,
which applied our example polynomial results in q −1 f (Xt ) = x(t + 1)3 + 1. Thus, the
time-shifted polynomial q −1 f is a function of the second element Xt,2 of the sequence
Xt in the same way as f is a function of Xt,1 . Thus, the two polynomials f (Xt ) and
q −1 f (Xt ) are considered to be different. For the sake of notational convenience, from
here on the argument of the polynomials will be left out. Starting with the basics of difference algebra for time-dependent signals, we will move on to polynomials and finally
an algorithm is presented for systems of difference polynomials.
6.2.1
Signal Shifts
As discussed above, we are interested in systems described by polynomials in timedependent variables and their time shifts. The shifts will be denoted by
u(k) , q −k u(t),
k ∈ N,
(6.5)
80
6
Difference Algebra and System Identification
where q is the forward time shift operator and the order of the displacement is given in
the parenthesis1 . To simplify the notation in the examples to come, we also introduce
u̇ , u(1) and ü , u(2) . A fundamental concept for the algorithmic aspects of differential
algebra is ranking. This is a total ordering (see, for instance, Lang, 2002) of all variables
and their derivatives. In the discrete-time case it corresponds to a total ordering of all
time-shifted variables.
Definition 6.1 (Ordering, Ranking). A binary operator ≺, which is a total ordering
satisfying
(i) u(µ) ≺ u(µ+σ)
(ii) u(µ) ≺ y (ν) =⇒ u(µ+σ) ≺ y (ν+σ) .
for all µ, ν, σ ∈ N, is called an ordering of the signals u and y and we say that u is ranked
lower than y if u ≺ y.
There are many possible choices of orderings of signals. For example, let u and y be
two signals. Then two possible orderings are
u ≺ y ≺ u̇ ≺ ẏ ≺ ü ≺ ÿ ≺ · · ·
u ≺ u̇ ≺ ü ≺ · · · ≺ y ≺ ẏ ≺ ÿ ≺ · · ·
(6.6a)
(6.6b)
The latter ordering will often be written in short u(·) ≺ y (·) . Let us turn our attention to
polynomials with variables that are time-shifted signals. These polynomials will be used
to represent difference equations as discussed in the introduction to this section.
6.2.2
Polynomials
As with signals, polynomials can also be time-shifted. In fact, the polynomial f (σ) is the
result when all variables in the polynomial f have been shifted σ ∈ N time steps. Even
though most of the results and definitions in this section apply to polynomials in static
variables, the formulations will focus on polynomials in time-dependent variables.
To illustrate the algebraic concepts below in a simple manner, using the notation introduced in (6.5), let
f , u̇y + ü3 ẏ 2 ,
(6.7a)
g , ẏ 2 + ÿ,
(6.7b)
2
h , u + u̇ ,
(6.7c)
∞
∞
be polynomials in the sequences Ut = u(t + τ ) τ =0 and Yt = y(t + τ ) τ =0 .
To be able to order polynomials we need to find the highest ranked variable in the
polynomial.
Definition 6.2 (Leader, Degree). The highest ranked time-shifted variable in a, possibly
time-shifted, polynomial f is called the leader and is denoted by `f . The degree of a
variable x in f is the highest exponent of x that occurs in f and is denoted by degx f .
1 Alternatively, we may use shifts forwards in time u(k)
, q k u(t) for all k ∈ N and the same theory applies.
6.2
81
Algebraic Concepts
The polynomial (6.7a), with the ordering (6.6a), has the leader `f = ü with degü f =
3 and if the ordering is changed to (6.6b), the leader is given by `f = ẏ with degẏ f = 2.
Thus, the leader depends on both the polynomial at hand and the ordering of the timeshifted signals that is used. The ranking of polynomials can now be defined.
Definition 6.3 (Ranking). Let f and g be two polynomials with leaders `f and `g , respectively. We say that f is ranked lower than g, denoted f ≺ g, if either `f ≺ `g or if
`f = `g and deg`f f < deg`f g. If `f = `g and deg`f f = deg`f g, we say that f and g
have equal ranking and write f ∼ g.
The polynomials f and g in (6.7), with the ordering (6.6a), have the leaders `f = ü
and `g = ÿ, respectively. Since ü ≺ ÿ, according to (6.6a), it follows that f ≺ g.
In this section, we will be dealing with the elimination of variables in time-shifted
polynomials. In this context the following concept is important.
Definition 6.4 (Reduced). Let f be a polynomial with leader `f . A polynomial g is
said to be reduced with respect to f if there is no positive time shift of `f in g and if
deg`f g < deg`f f .
Using the ordering (6.6b) with the polynomials f and g in (6.7), the leaders are given
by `f = ẏ and `g = ÿ, respectively. Thus, in this case, f is reduced with respect to g but
not vice versa. The above concepts (ranking and reduced) are related as follows.
Lemma 6.1
Let f and g be two polynomials. If f ≺ g under some ordering, then f is also reduced
with respect to g under that ordering.
Proof: If f ≺ g, then either `f ≺ `g or `f = `g and deg`f f < deg`g g. In the former
(σ)
case, then it follows from Definition 6.1 that `f ≺ `g ≺ `g for all σ ∈ Z+ . Thus, f
(σ)
does not depend on `g for any σ ∈ N and in particular 0 = deg`g f < deg`g g. Hence,
f must be reduced with respect to g. The latter case follows in a similar fashion.
That the two concepts are not equivalent is easily seen if one chooses the ordering (6.6a) with the simple polynomials f = y and g = u. Since f does not depend
on the variable u, it holds that f is reduced with respect to g, but f 6≺ g since `g ≺ `f .
Before we continue providing the tool needed to reduce polynomials with respect to
each other, some additional concepts of difference polynomials are needed. These will
not play a major role in what follows but are used to guarantee the existence of solutions
to the resulting reduced systems of polynomials.
Definition 6.5 (Separant, Initial). The separant Sf of a polynomial f is the partial
derivative of f with respect to the leader, while the initial If is the coefficient of the
highest power of the leader in f .
The polynomial f in (6.7a) have, under the ordering (6.6a), the leader `f = ü. This
implies that the separant is given by Sf = 3ü2 ẏ 2 and the initial If = ẏ 2 . The tool needed
to reduce polynomials with respect to each other is a variant of the standard polynomial
division.
82
6
Difference Algebra and System Identification
Lemma 6.2 (Pseudo-division)
Let f and g be polynomials in the variable x of degree m and n, respectively, written in
the form
f = am xm + · · · + a0 and g = bn xn + · · · + b0 ,
where m ≥ n. Then there exist polynomials Q 6= 0, Q̄ and R such that
Qf = Q̄g + R,
where degx R < n. Furthermore, with Q given by bm−n+1
, then Q̄ and R are unique.
n
Proof: A proof can be found in Mishra (1993, p. 168).
The suggested choice of Q in the lemma above is the initial Ig to the power of m−n+
1. It is also worth noting that the polynomials f and g in Lemma 6.2 may be multivariable,
which implies that the coefficients of the variable x are polynomials in the remaining
variables. Now, let us illustrate the concept of pseudo-division on the polynomials f and
g in (6.7), where the variable in question is ẏ. Since Ig = 1, it holds that
f = ü3 g + (−ü3 ÿ + u̇y),
so that Q = 1, Q̄ = ü3 and R = −ü3 ÿ + u̇y, which satisfies Lemma 6.2 if we notice
that degẏ R = 0. During the algebraic simplifications, it is important that the solutions
to the original system of polynomial equations are preserved. To this end, the following
results show how pseudo-division can be used to eliminate variables while preserving the
solution.
Lemma 6.3
(σ)
Let g be a polynomial with leader `g and let f be a polynomial containing `g for some
σ ∈ N. Then there exist polynomials R and Q such that
(σ)
(i) R does not contain `g .
(ii) Every solution of f = 0, g = 0 is also a solution of R = 0, g = 0.
(iii) Every solution of R = 0, g = 0 with Q 6= 0 is also a solution of f = 0, g = 0.
(σ)
Proof: Let m and n be the degrees of f and g (σ) as polynomials in `g , respectively. On
one hand, if m ≥ n, then pseudo-division according to Lemma 6.2 yields polynomials
Q1 6= 0, Q̄1 and R1 such that
Q1 f = Q̄1 g (σ) + R1 ,
(6.8)
(σ)
with deg`(σ) R1 < n ≤ m. If R1 still depends on the variable `g , then further pseudog
division yields polynomials Q2 6= 0, Q̄2 and R2 such that
Q2 g (σ) = Q̄2 R1 + R̃2 ,
with deg`(σ) R̃2 < deg`(σ) R1 . Combining (6.9) and (6.8) yields
g
g
Q1 Q̄2 f = (Q2 + Q̄1 Q̄2 )g (σ) + R2 ,
(6.9)
6.2
83
Algebraic Concepts
where we have defined R2 , −R̃2 . Continuing in this manner, by recursive application
of Lemma 6.2, one may always construct polynomials Q, Q̄ and R such that
Qf = Q̄g (σ) + R,
(6.10)
(σ)
with deg`(σ) R = 0, that is, R does not depend on the variable `g . Now, assume that
g
such polynomials have been constructed. Then the first statement in the lemma holds true.
For the second statement, rewrite (6.10) as
R = Qf − Q̄g (σ) .
Since g (σ) is just g with all variables time shifted σ ≥ 0 steps, it must hold that g (σ) = 0
whenever g = 0. Thus, R = 0 whenever f = 0 and g = 0. For the final statement,
rewrite (6.10), assuming Q 6= 0, as
f=
1
Q̄g (σ) + R .
Q
Since g = 0 implies g (σ) = 0, it follows that f = 0 whenever R = 0 and g = 0,
which concludes the proof for the case when m ≥ n. On the other hand, if m < n, then
Lemma 6.2 yields polynomials Q 6= 0, Q̄ and R such that
Qg (σ) = Q̄f + R
with deg`(σ) R < m and similar arguments as in the former case can be applied.
g
In the proof of the Lemma 6.3 above, we see that it is possible that several pseudo(σ)
divisions are needed in the case σ > 0 to eliminate `g from the polynomial f . This is
the main difference between the continuous-time case and the discrete-time case. In the
(σ)
continuous-time case, g (σ) is affine in `g (due to the product rule of differentiation) and
(σ)
thus only one pseudo-division is needed to eliminate `g from f . In the discrete-time
case, time shifting a polynomial does not change its degree with respect to a variable, that
is, deg`g g = deg`(σ) g (σ) and several pseudo-divisions might be needed.
g
Now, the following extension of Lemma 6.3 provides (as we will see later) the main
step in Ritt’s algorithm.
Lemma 6.4
Let f be a polynomial which is not reduced with respect to the polynomial g. Then there
exist polynomials R and Q such that
(i) R is reduced with respect to g.
(ii) Every solution of f = 0, g = 0 is also a solution of R = 0, g = 0.
(iii) Every solution of R = 0, g = 0 with Q 6= 0 is also a solution of f = 0, g = 0.
84
6
Difference Algebra and System Identification
Proof: If f is not reduced with respect to g, then either f contains some positive timeshift of the leader `g of g or else f contains `g to a higher power than g.
In the latter case, Lemma 6.3 yields polynomials Q, Q̄ and R such that
Qf = Q̄g + R,
(6.11)
where R is reduced with respect to g. Thus, the proof of the statements follows in the
same way as in proof of Lemma 6.3.
(σ )
In the former case, the polynomial f contains `g 1 for some σ1 ∈ Z+ . Lemma 6.3
then yields polynomials Q1 , Q̄1 and R1 satisfying
Q1 f = Q̄1 g (σ1 ) + R1 ,
(6.12)
(σ )
(σ )
where R1 does not contain `g 1 . If R1 still contains `g 2 for some σ2 ∈ N with σ2 < σ1 ,
further use of Lemma 6.3 yields polynomials Q2 , Q̄2 and R2 satisfying
Q2 R1 = Q̄2 g (σ2 ) + R2 ,
(σ2 )
where R2 does not contain `g
(6.13)
. Combining (6.12) and (6.13) yields
Q2 Q1 f = Q2 Q̄1 g (σ1 ) + Q̄2 g (σ2 ) + R2 .
Continuing in this manner, by recursive application of Lemma 6.3, one can construct
polynomials Q and R such that
Qf =
n
X
Q̃i g (σi ) + R,
(6.14)
i=1
(σ)
for some sequence of polynomials Q̃i , i = 1, . . . , n, where R does not contain `g for
any σ ∈ N. Thus, we have constructed a polynomial R that is reduced with respect to g.
Since g = 0 implies that g (σ) = 0 for all σ ∈ N, the remaining statements follow in the
same way as in the proof of Lemma 6.3.
Lemma 6.4 shows how to reduce one difference equation with respect to another while
preserving the solution of both and that the main tool for achieving this is the pseudodivision concept presented in Lemma 6.2. This is important when dealing with systems
of difference equations which is the topic of the following section.
6.2.3
Systems of Polynomials
Now, we are ready to define the necessary concepts for systems of polynomial difference
equations, which will be represented by sets of difference polynomials. The following
definition generalizes the concept of reduced polynomials.
Definition 6.6 (Auto-reduced). A set A = {A1 , . . . , Ap } of polynomials is called autoreduced if all elements Ai are pairwise reduced with respect to each other. If, in addition,
the polynomials A1 , . . . , Ap in the auto-reduced set A are in increasing rank, then A is
said to be ordered.
6.2
85
Algebraic Concepts
Let us once again consider the polynomials f and g in (6.7), but now with the ordering (6.6a). Then it is easy to see that f and g are reduced with respect to each other
and the set {f, g} is auto-reduced. On the other hand, using the ordering (6.6b), we have
already seen that g is not reduced with respect to f and the set {f, g} is not auto-reduced.
The following definition generalizes the concept of ranking to auto-reduced sets.
Definition 6.7 (Ranking). Let A = {A1 , . . . , Am } and B = {B1 , . . . , Bn } be two
ordered auto-reduced sets. Then A is said to have a lower ranking than B if either there
exists an integer k such that 1 ≤ k ≤ min (m, n) satisfying
Aj ∼ Bj ,
Ak ≺ Bk ,
j = 1, . . . , k − 1
or else if m > n and Aj ∼ Bj for all j = 1, . . . , n.
The ordering (6.6a) applied to the polynomials f , g and h in (6.7) yields two ordered
auto-reduced sets of polynomials, namely A , {f, g} and B , {h, g}. Since h ≺ f , but
not vice versa, it holds that B is ranked lower than A. The concept of reduced polynomials
can be generalized as follows.
Definition 6.8 (Characteristic set). A characteristic set for a given set of time-shifted
polynomials is an auto-reduced subset such that no other auto-reduced subset is ranked
lower.
Using the same example of ordered auto-reduced sets A , {f, g} and B , {h, g}
chosen from the set {f, g, h} of polynomials (6.7), it is clear that B is the characteristic
set under the ordering (6.6a). The basic idea of Ritt’s algorithm is to reduce a set of
polynomials by the use of characteristic sets. In each iteration, the characteristic set is
used to reduce the highest ranked polynomial not in the characteristic set to a lower ranked
one. Thus, it is important to guarantee that a sequence with decreasing rank is always
finite for the algorithm to terminate in a finite number of steps.
Lemma 6.5
A sequence of characteristic sets, each one ranked lower than the preceding one, can only
have finite length.
The proof of this statement is a direct consequence of the following result.
Lemma 6.6
A sequence of time-shifted variables from a finite number of signals, each one ranked
lower than the preceding one, can only have finite length.
Proof: Let y1 , . . . , yp denote all the variables whose time shifts appear anywhere in the
sequence. For each yj let σj denote the order of the first appearing time shift. There can
then be only σj lower time-shifted yj in the sequence. The total number of elements is
thus bounded by σ1 + · · · + σp + p.
Finally, we are ready to state the elimination algorithm. Consider a system of polynomial equations on the form
f1 = 0, . . . , fn = 0,
(6.15)
86
6
Difference Algebra and System Identification
where the fi are polynomials in the variables (y, u, x, θ) and their time-shifts. The elimination procedure is given in Algorithm 6.
Algorithm 6 Ritt-Seidenberg.
Input: Two sets of polynomials F = {f1 , . . . , fn } and G = ∅ with an ordering.
Output: The updated set F as a characteristic set containing the reduced polynomials and
the set G containing information about the separants, initials and the quotients resulting
from the performed pseudo-divisions in the different steps.
1) Compute a characteristic set A = {A1 , . . . , Ap } of F: Order the polynomials in F
according to their leaders, so that the first polynomial has the lowest ordered leader
and initialize A ← {f1 }. For all fi ∈ F, i = 2, . . . , n, test if A ∪ {fi } is auto-reduced
and in that case update A ← A ∪ {fi }.
2) If F \ A 6= ∅, where \ denotes the set difference, then go to Step 4.
3) Add SA , IA for all A ∈ A to G and stop.
4) Let f be the highest ranked unreduced polynomial in F with respect to A and apply
Lemma 6.4 to get polynomials Q and R such that
Qf = Q̄A(σ) + R,
where A is the highest ordered polynomial in A such that f is not reduced with respect
to A. Update G ← G ∪ {Q}.
5) If R = 0 update F ← F \ {fk }, else F ← (F \ {fk }) ∪ {R} and continue from
Step 1.
Algorithm 6 will be referred to as Ritt’s algorithm from here on. Some important properties of the proposed algorithm are given below. The proofs of the results are similar
to corresponding proofs for the continuous-time case, but with the small distinction remarked upon in connection with Lemma 6.3.
Theorem 6.1
The Algorithm will 6 reach the stop after a finite number of steps.
Proof: The only possible loop is via Step 5 to Step 1. This involves either the removal
of a polynomial or its replacement with one that is reduced with respect to A or has its
highest unreduced time-shift removed. If R is reduced, then it is possible to construct
a lower auto-reduced set. An infinite loop would thus contradict either Lemma 6.5 or
Lemma 6.6.
Theorem 6.2
Every solution to the initial set F of the algorithm is also a solution to the final set. Every
solution to the final set for which the polynomials of G are nonzero is also a solution of
the initial set.
6.2
87
Algebraic Concepts
Proof: Follows from a repeated application of Lemma 6.4.
We conclude this section by repeating Example 6.1, but here the parameter nonlinearity is eliminated using Ritt’s algorithm.
Example 6.2: (Example 6.1 revisited)
Consider the problem of estimating the parameter θ in the model
y(t) = θu(t) + θ2 u(t − 1) + e(t),
(6.16)
given input and output data. This problem could be solved by defining θ̃ = θ2 and solving
a least-squares problem, but let us consider using Ritt’s algorithm. Define
f1 , y − uθ − u̇θ2 − e,
(6.17)
f2 , θ − θ̇,
(6.18)
where the dot operator denotes the backward shift. Let F , {f1 , f2 } and G = ∅, respectively. With the ordering
u(·) ≺ y (·) ≺ e(·) ≺ θ(·) ,
the algorithm yields:
Iteration 1. The leaders `f1 and `f2 in F are θ and θ̇, respectively. Since `f1 ≺ `f2 ,
f1 is reduced with respect to f2 . The other way around, since `˙f1 = θ̇, f2 is not reduced
with respect to f1 , that is, the largest auto-reduced set A is given by {f1 }. The division
algorithm yields Q1 f2 = Q̄1 f˙1 + R1 , with
R1 = −ẏ + ė + u̇θ + üθ̇θ,
Q1 = u̇ + üθ̇, and Q̄1 = 1. Since R1 still depends on θ̇, yet another division is needed,
that is, Q2 f2 = Q̄2 R1 + R2 , with
R2 = ẏ − ė − u̇θ − üθ2 ,
Q2 = −üθ, and Q̄2 = 1. Putting it all together yields
Qf2 = Q̄f˙1 + R,
where Q = Q2 − Q1 Q̄2 , Q̄ = −Q̄1 Q̄2 , and R = R2 . Finally, update G and set F =
{f1 , f3 } where f3 , R.
Iteration 2. The leaders in F are now θ and θ, respectively. Thus, A = {f1 } (or {f3 })
is the characteristic set of F. Division yields Q̄f3 = Qf1 + R, where
R = −üy + üe + u̇ẏ − u̇ė + (üu − u̇2 )θ,
Q̄ = u̇, and Q = ü. Update G and set F = {f1 , f4 } with f4 , R.
Thus, one of the equations that Ritt’s algorithm finds is
üy − u̇ẏ = (üu − u̇2 )θ + üe − u̇ė,
(6.19)
88
6
Difference Algebra and System Identification
which coincides with the result (6.4) in Example 6.1. The algorithm continues for another
two iterations, until only one element of F depends on the parameter θ, but these steps
are not presented here.
It is worth noting that in the first iteration, two divisions had to be made. This is one
of the differences between the discrete-time and the continuous-time Ritt’s algorithm. In
continuous-time, the derivative always leaves a polynomial which is affine in the leader
and only one division is needed in each iteration.
In Ljung and Glad (1994), the differential algebraic tools were used to derive some interesting results on the identifiability of polynomial systems. Since the discrete-time algorithm is so similar, one could expect to get the same results.
6.3
Identifiability
In Ljung and Glad (1994), necessary and sufficient conditions for global identifiability of
continuous-time model structures were given (see Section 3.5). Here, we will discuss the
generalization of these results for discrete-time systems. First, let us recall the definition
of global identifiability (see also Section 3.2.4).
Definition 6.9. A model structure M is globally identifiable at θ∗ ∈ DM if M(θ) =
M(θ∗ ) for some θ ∈ DM implies that θ = θ∗ .
Now, since the discrete-time version of Ritt’s algorithm is so similar to the continuoustime case, should we not be able to derive equivalent results concerning the identifiability
of discrete-time model structures? That is, is a discrete-time model structure, only containing polynomial nonlinearities, globally identifiable if and only if it can be written as a
linear regression model?
The first part of this statement is, at this point, quite easy to prove. Namely, if Ritt’s
algorithm results in a linear regression model, then the corresponding model structure is
globally identifiable.
Theorem 6.3 (Global identifiability)
If the output of Algorithm 6 contains an expression of the form Qθ−P , where the diagonal
matrix Q and the vector P does not depend on θ, then θ is globally identifiable, provided
that det Q 6= 0 for the measured data2 .
Proof: Since, according to Theorem 6.2, every solution of the original equations is also
a solution of the output equations, it follows that there can only be one value of θ that is
consistent with the measured values, provided that det Q 6= 0.
Now, let us consider the converse of the above statement. In the continuous-time case,
the proof of this fact is based on the following property: If f (t) and g(t) are analytical
functions satisfying f (t)g(t) = 0 for all t ∈ R, it holds that either f (t) = 0 or g(t) = 0
for all t ∈ R. Unfortunately, this property does not remain valid when the domain changes
2 The requirement det Q 6= 0 can be interpreted as a condition of persistence of excitation of the input
signal (see, for instance, Ljung, 1999).
6.3
89
Identifiability
from R to Z. A simple example of a discrete-time signal that does not satisfy the above
property is the Kronecker delta function
(
1,
δ(t) ,
0,
t=0
t ∈ Z \ {0}
For instance, it holds that δ(t)δ(t − 1) = 0 for all t ∈ Z even though δ(t) and δ(t − 1)
are nonzero for t = 0 and t = 1, respectively. The absence of the above property for
discrete-time signals hinders a straightforward generalization of the desired identifiability
result. Reasonable assumptions to be able to prove the theorem are yet to be found.
Even though we were not able to provide a general identifiability result, equivalent
to that achieved in the continuous-time case, we are still able to draw some conclusions
when the use of Ritt’s algorithm does not result in a linear regression model.
Theorem 6.4 (Unidentifiability)
Let the ranking be given by
(·)
(·)
u(·) ≺ y (·) ≺ θ1 ≺ · · · ≺ θm
≺ x(·)
where x contains any unmeasured variable. Assume that the output of the algorithm is an
auto-reduced set with the following form
p0 (u, u̇, . . . , y, ẏ, . . .), θ̇1 − θ1 ,
p2 (u, u̇, . . . , y, ẏ, . . . , θ1 , θ2 ), . . .
pm (u, u̇, . . . , y, ẏ, . . . , θ1 , . . . , θm ), . . .
Furthermore, assume that there exists a solution to this set such that all polynomials in
G are nonzero. Then there are infinitely many values of θ compatible with the u and y of
this solution, that is, the system is unidentifiable.
Proof: Fixing the values of u and y in the first polynomial p0 to those of the given solution means that the corresponding equation is always satisfied. The parameter θ1 can
now be changed to a new arbitrary constant in the second polynomial p2 . If this change
is small enough the remaining equations can now be solved successively for the leader
due to the nonvanishing of the separants (the implicit function theorem, see, for instance,
Rudin, 1976).
The above theorem implies that, in this particular case, if Ritt’s algorithm is not able
to eliminate the time-shift for at least one of the parameters, not necessarily the first
parameter, then the model structure is unidentifiable. Thus, even though we were not
able to prove that every globally identifiable model structure may be rewritten as a linear
regression model, this indicates that Ritt’s algorithm still can be useful when analyzing
the identifiability of discrete-time model structures. Ritt’s algorithm has some interesting
implications for the parameter estimation problem in system identification, since it, in
some cases, results in a linear regression.
90
6
Difference Algebra and System Identification
6.4 Identification Aspects
In this section we will discuss the use of Ritt’s algorithm from a parameter estimation
perspective. First we discuss some relations to other known methods.
An often used concept in system identification is the methodology of over-parametrization (see, for instance, Bai, 1998), which can be described as follows: in a regression
problem, where some coefficients are nonlinear functions of the parameters, replace these
functions with new parameters. This concept is easiest to illustrate via an example.
Example 6.3: (Over-parametrization)
Again, consider the problem of estimating the parameter θ in
y(t) = θu(t) + θ2 u(t − 1) + e(t),
(6.20)
given a dataset (3.1). This problem can be solved by defining θ1 = θ and θ2 = θ2
and solving the linear regression by least-squares, that is, by using over-parametrization.
Since one is only interested in the estimate θ̂1 , one can just ignore the resulting estimate θ̂2
or reconcile it with the estimate θ̂1 . Now, let us try a different approach. Time shift (6.20)
y = θu + θ2 u̇ + e,
yielding
ẏ = θu̇ + θ2 ü + ė,
keeping in mind that θ is constant. Stacking the equations results the over-parameterized
linear regression model
y
u u̇
θ
e
=
+
.
(6.21)
ẏ
u̇ ü
θ2
ė
| {z }
,Φ
Assuming that the matrix Φ is invertible, one can multiply (6.21) by (det Φ)Φ−1 to get
üy − u̇ẏ
θ
üe − u̇ė
= (üu − u̇2 ) 2 +
.
uẏ − u̇y
θ
uė − u̇e
The first equation is exactly the result of Ritt’s algorithm (see (6.19)). Thus, there seems
to be a connection between over-parametrization and Ritt’s algorithm. This possible connection needs to be analyzed further in future work.
Another popular method in elimination theory is the so called Gröbner basis (see, for example, Mishra, 1993), which is mainly used to solve systems of static polynomial equations. Let us consider a simple example.
Example 6.4: (Gröbner basis)
Let us, once again, consider Example 6.3. By time-shifting (6.20) and remembering that
θ is constant, we get the following system of equations
y = θu + θ2 u̇ + e,
ẏ = θu̇ + θ2 ü + ė.
6.4
91
Identification Aspects
Here, we will consider the variables (u, u̇, ü, y, ẏ, e, ė) as different static entities. Now,
using the Gröbner basis algorithm implemented in MATHEMATICA via the command
GroebnerBasis({y − θu − θ2 u̇ − e, ẏ − θu̇ − θ2 ü + ė}, {θ}),
results in five different polynomials where
üy − u̇ẏ = (üu − u̇2 )θ + üe − u̇ė,
is one of them. This is the same linear regression (6.19) as the one derived using Algorithm 6. The advantage of using the Gröbner basis framework, in place of the methods
presented here, is that well tested and efficient implementations are available. The drawback is that one needs to know beforehand which variables that are present in the final result, here (u, u̇, ü, y, ẏ, e, ė). These can be found by trial and error, by adding time-shifted
equations to the original until the desired result appears, but to the author’s knowledge no
one shot method exists. It would also be interesting to investigate if, using the same ordering, the Gröbner basis method and Ritt’s algorithm in general yield the same results in
cases where the variables appearing in the resulting linear regression model are known.
Now, let us consider a more complex system with two parameters using Ritt’s algorithm.
Example 6.5
Consider the problem of estimating the parameters θ = (θ1 , θ2 )T in
y + θ1 ẏ + θ1 θ2 ÿ = θ2 u + e,
(6.22)
where the dot denotes backward shift, given input and output data. Running Ritt’s algorithm on (6.22) yields the linear regression η = ϕT θ + ν where
(3)
uy − u̇ÿ + ÿ ẏ − y (3) y
η=
,
(6.23a)
u̇ÿ ẏ − uÿ 2 − ÿ ẏ 2 + ÿ 2 y
(3)
y ẏ − ÿ 2
0
ϕ=
,
(6.23b)
0
uy (3) ÿ − u̇ÿ 2 + ÿ 2 ẏ − y (3) ÿy
ėÿ − ey (3)
ν=
.
(6.23c)
eÿ 2 − ėÿ ẏ + ey (3) ÿθ2 − ėÿ 2 θ2
Notice that in the expression for θ1 , there are second order products between the different
signals, while in the expression for θ2 there are exclusively third order products. This
implies that it will be more difficult, in some sense, to estimate the second parameter than
the first parameter. Changing the ordering of the parameters when using the algorithm
will yield the opposite result.
Now, let us try to do the elimination differently. As before, time shift (6.22)
ẏ + θ1 ÿ + θ1 θ2 y (3) = θ2 u̇ + ė,
(6.24)
keeping in mind that the parameters are constant in time. Multiplying (6.22) by y (3) ,
(6.24) by ÿ and subtracting the results, the following holds
2
T ÿ − ẏy (3)
θ1
(3)
yy − ẏ ÿ =
+ ey (3) − ėÿ.
(6.25)
θ2
uy (3) − u̇ÿ
92
6
Difference Algebra and System Identification
This linear regression shares the element of the regressor vector corresponding to the first
parameter with (6.23), where now only second order products in the element corresponding to the second parameter exist. This is due to the fact that Ritt’s algorithm continues
until it has one equation for each parameter. So the question that arises is, can Ritt’s
algorithm be modified to take these things into account? There are some possibilities:
1) What enables the calculations above is that all the parameters already occur linearly
in the original equation (6.22). Thus, the only thing left to eliminate is the single
parameter nonlinearity. Therefore, if the parameter nonlinearities could be ordered
higher than the linear occurrences, then one could stop the algorithm prematurely to
receive the linear regression given in (6.25).
2) Another alternative is to stop the algorithm when the first linear regression is decided,
that is, the equation that determines θ1 . Then one could estimate this parameter and
feed it back into the original equation (6.22) and from there estimate θ2
Further analysis of possible modifications of Ritt’s algorithm, to tailor it for system identification, needs to be made.
Use of the elimination schemes presented above, that is, Ritt’s algorithm and similar methods, yields linear regression models where the noise is deformed. So when using these
equations to estimate the unknown parameters, a tool to deal with these noise transformations is needed. One such tool is the instrumental variables (IV) method, see Section 3.3.
Example 6.6
Once again, consider the estimation problem presented in Example 6.5. Depending on
the input excitation, this problem could be solved using over-parametrization, see Example 6.3. Despite this, let us try using the result (6.25).
Trying to solve (6.25) directly via the least-squares method will yield a biased estimate of the θ parameters, since the regression vector is correlated with the noise. Thus,
instruments, uncorrelated with the noise, need to be chosen.
Now, let us perform a Monte Carlo simulation study of the estimation of the parameters θ via (6.25), with the least-squares (LS) method and the IV method, respectively. The
T
true parameters are given by θ = −0.1 0.2 . Let the noise and the input be independent white Gaussian processes with zero mean and unit variance. With M = 100 Monte
Carlo runs and data lengths of 10000 yields
M
1 X LS
−0.0750
θ̂k =
,
0.1999
M
k=1
M
1 X IV
−0.0977
θ̂k =
,
0.2002
M
(6.26)
k=1
where u̇2 has been used as instruments for the first parameter and the regressor as instruments for the second parameter. As predicted, using the LS estimator, it seems that a
biased estimate is obtained while the IV estimate appears to be unbiased.
So far, only measured signals, except the additive noise, have appeared. Below, the pa-
6.5
93
Discussion
rameter estimation of a nonlinear state-space model is considered. Here, the unknown
states need to be eliminated completely, and therefore they are ordered higher than any
time shift of the parameters, that is, the following ordering is used
(·)
(·)
u(·) ≺ y (·) ≺ θ1 ≺ θ2 ≺ x(·) .
Example 6.7
Consider the problem of estimating the parameters θ = (θ1 θ2 )T in the nonlinear statespace model
x1 (t + 1) = θ1 x1 (t) + θ2 x2 (t) + u2 (t),
x21 (t),
x2 (t + 1) =
y(t) = x1 (t) + e(t).
(6.27a)
(6.27b)
(6.27c)
Using Ritt’s algorithm yields quite lengthy results and below we only present the linear
regression equation η1 = ϕ1 θ1 + ν1 for the first parameter, where
η1 = ü2 ÿ 2 − u̇2 (y (3) )2 − ÿ 2 ẏ + (y (3) )2 y,
ϕ1 = (y (3) )2 ẏ − ÿ 3 ,
and
ν1 =(e(3) )2 e − ë2 ė − ë2 ü2 + (e(3) )2 u̇2 − 2e(3) ey (3)
− 2e(3) u̇2 y (3) + e(y (3) )2 + 2ëėÿ + 2ëü2 ÿ − ėÿ 2
− (e(3) )2 y + 2e(3) y (3) y + ë2 ẏ − 2ëÿ ẏ
h
+ ë3 − (e(3) )2 ė + 2e(3) ėy (3) − ė(y (3) )2
i
− 3ë2 ÿ + 3ëÿ 2 + (e(3) )2 ẏ − 2e(3) y (3) ẏ θ1
We notice that the linear regression becomes quite complicated, but the main complexity
lies in the transformed noise. This means that the main problem lies in choosing the
appropriate instruments. If such instruments can be found, these equations enable a way
to get an initial estimate of the first parameter, which later can be refined using the PEM.
Finding good instruments when using the resulting linear regression of Ritt’s algorithm
seems difficult. Thus, an automatic method for finding instruments is needed if the method
should be practically applied.
6.5
Discussion
In this chapter a discrete-time version of Ritt’s algorithm, similar to the one given in
continuous-time, has been presented. The difference lies in the number of pseudo-divisions needed to reduce the polynomials (see the discussion following Lemma 6.3). Yet
another deviation from the continuous-time case became apparent when the generalization
94
6
Difference Algebra and System Identification
of the identifiability results presented in Ljung and Glad (1994) was attempted. In the
discrete-time case, only parts of these results could be provided, due to the lack of certain
properties of analytic functions utilized in the continuous-time case which was shown by
a simple example not to hold for discrete-time signals (see Section 6.3).
In Section 6.4 certain aspects of Ritt’s algorithm as a tool for finding equations that
simplify the finding of initial estimates of certain nonlinear model structures where analyzed. It turns out that the generalization to discrete-time enables the possibility of dealing
with noise if one is able to find instruments which can deal with the transformed noise
model. Furthermore, it was shown through examples that Ritt’s algorithm goes unnecessary far in the algebraic manipulations. The result is a linear regression model containing
higher degree polynomials than necessary.
In the future, it would be interesting to examine the possibilities to tailor Ritt’s algorithm for system identification purposes. Also, the connection between over-parametrization, Gröbner basis methods and Ritt’s algorithm, as implicated in the Examples 6.3
and 6.4, needs further thought. Furthermore, if the resulting linear regression model
should be used for parameter estimation, how should one choose the instruments to get
unbiased estimates? Also, for which model classes can one guarantee that the application
of the discrete-time Ritt’s algorithm results in a linear regression model? In such cases, is
it possible to determine if the least-squares estimate will be unbiased? These are all open
questions for future research.
Bibliography
H. Akaike. Fitting autoregressive models for prediction. Annals of the Institute of Statistical Mathematics, 21(2):243–247, 1969.
E. Amaldi and V. Kann. On the approximability of minimizing nonzero variables or
unsatisfied relations in linear systems. Theoretical Computer Science, 209:237–260,
1998.
E. Bai. An optimal two-stage identification algorithm for Hammerstein-Wiener nonlinear
systems. Automatica, 34(3):333–338, 1998.
D. Bauer. Estimating linear dynamical systems using subspace methods. Econometric
Theory, 21:181–211, 2005.
D. Bauer. Estimating ARMAX systems for multivariate time series using the state approach to subspace algorithms. Journal of Multivariate Analysis, 100:397–421, 2009.
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
L. Breiman. Better subset regression using the nonnegative garrote. Technometrics, 37
(4):373–384, 1995.
G. Casella and R. L. Berger. Statistical Inference. Duxbury, 2nd edition, 2002.
S. Diop. Elimination in control theory. Mathematics of Control, Signals and Systems, 4
(1):17–32, 1991.
N. Draper and H. Smith. Applied Regression Analysis. Wiley, 2nd edition, 1981.
J. Durbin. Efficient estimation of parameters in moving-average models. Biometrika, 46:
306–316, 1959.
95
96
Bibliography
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of
Statistics, 32(2):407–451, 2004.
M. Fliess and T. Glad. An algebraic approach to linear and nonlinear control. In Essays
on Control: Perspectives in the Theory and Its Applications, volume 14 of Progress in
System and Control Theory, pages 190–198. Birkhäuser, 1994.
G. M. Furnival and R. W. Wilson. Regressions by leaps and bounds. Technometrics, 16
(4):499–511, 1974.
H. Garnier and L. Wang. Identification of Continuous-time Models from Sampled Data.
Springer, 2008.
K. F. Gauss. Teoria Motus Corporum Coelestium in Sectionibus Conicus Solem Ambientieum. Reprinted Translation: Theory of the Motion of the Heavenly Bodies Moving
About the Sun in Conic Sections. Dover, 1809.
T. Glad. Solvability of differential algebraic equations and inequalities: An algorithm. In
Proceedings European Control Conference, Brussels, Belgium, 1997.
G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press,
3rd edition, 1996.
N. I. M. Gould. An algorithm for large-scale quadratic programming. IMA Journal of
Numerical Analysis, 11:299–324, 1991.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer, 2nd edition, 2009.
B. Haverkamp. State Space Identification, Theory and Practice. PhD thesis, Technische
Universiteit Delft, Delft, The Netherlands, 2001.
T. Hesterberg, N. H. Choi, L. Meier, and C. Fraley. Least angle and `1 penalized regression: A review. Statistics Surveys, 2:61–93, 2008.
A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 12:55–67, 1970.
M. Jansson and B. Wahlberg. A linear regression approach to state-space subspace system
identification. Signal Processing, 52:103–129, 1996.
H. K. Khalil. Nonlinear Systems. Prentice Hall, 3rd edition, 2002.
E. Kolchin. Differential Algebra and Algebraic Groups. Academic Press, 1973.
S. Kotsios. An application of Ritt’s remainder algorithm to discrete polynomial control
systems. IMA Journal of Mathematical Control and Information, 18(18):19–29, 2001.
S. Lang. Algebra. Springer-Verlag, 3rd revised edition, 2002.
L. Ljung. System Identification: Theory for the User. Prentice Hall, second edition, 1999.
L. Ljung. System Identification Toolbox, User’s Guide. MathWorks, 7 edition, 2009.
Bibliography
97
L. Ljung and T. Glad. On global identifiability for arbitrary model parametrizations.
Automatica, 30(2):265–276, 1994.
C. Lyzell and G. Hovland. Verification of the dynamics of the 5-DOF Gantry-Tau parallel kinematic machine. In Proceedings of Robotics and Applications and Telematics,
Würzburg, Germany, August 2007.
C. Lyzell, J. Roll, and L. Ljung. The use of nonnegative garrote for order selection of
ARX models. In Proceedings of the 47th IEEE Conference on Decision and Control,
Cancun, Mexico, December 2008.
C. Lyzell, M. Enqvist, and L. Ljung. Handling certain structure information in subspace
identification. In Preprints of the 15th IFAC Symposium on System Identification,
Saint-Malo, France, July 2009a.
C. Lyzell, T. Glad, M. Enqvist, and L. Ljung. Identification aspects of Ritt’s algorithm for
discrete-time systems. In Preprints of the 15th IFAC Symposium on System Identification, Saint-Malo, France, July 2009b.
T. McKelvey and H. Akc˛ay. An efficient frequency domain state-space identification
algorithm. In Proceedings of the 33rd IEEE Conferance on Decision and Control,
pages 373–378, Lake Buena Vista, Florida, 1994.
B. Mishra. Algorithmic Algebra. Texts and Monographs in Computer Science. SpringerVerlag, 1993.
S. Niu, L. Ljung, and Å. Björck. Decomposition methods for solving least-squares parameter estimation. IEEE Transactions on Signal Processing, 44(11):2847–2852, 1996.
J. Nocedal and S. Wright. Numerical Optimization. Springer, 2nd edition, 2006.
G. P. Rao and H. Garnier. Numerical illustrations of the relevance of direct continuoustime model identification. In Proceedings of the 15th IFAC World Congress, Barcelona,
Spain, July 2002.
O. Reiersøl. Confluence analysis by means of lag moments and other methods of confluence analysis. Econometrica, 9:1–23, 1941.
E. Reynders and G. De Roeck. Subspace identification of (AR)ARMAX, Box-Jenkins,
and generalized model structures. In Preprints of the 15th IFAC Symposium on System
Identification, Saint-Malo, France, July 2009.
J. Rissanen. Modelling by shortest data description. Automatica, 14:175–182, 1978.
J. F. Ritt. Differential Algebra. Dover Publications, Inc., 1950.
J. Roll. Piecewise linear solution paths with application to direct weight optimization.
Automatica, 44(11):2745–2753, 2008.
O. Romanko. An interior point approach to quadratic and parametric quadratic optimization. Master’s thesis, McMaster University, Hamilton, Ontario, 2004.
98
Bibliography
S. Rosset and J. Zhu. Piecewise linear regularized solution paths. Anals of Statistics, 35
(3):1012–1030, 2007.
W. Rudin. Principles of Mathematical Analysis. McGraw-Hill, 3rd edition, 1976.
B. Savas and D. Lindgren. Rank reduction and volume minimization approach to statespace subspace system identification. Signal Processing, 86(11):3275–3285, 2006.
T. Söderström. On the uniqueness of maximum likelihood identification. Automatica, 11:
193–197, 1975.
T. Söderström and P. Stoica. Instrumental Variable Methods for System Identification.
Springer, 1980.
T. Söderström and P. Stoica. System Identification. Prentice Hall, 1989.
M. Stein, J. Branke, and M. Schmeck. Efficient implementation of an active set algorithm
for large-scale portofolio selection. Computers & Operations Research, 35:3945–3961,
2008.
P. Stoica and R. Moses. Spectral Analysis of Signals. Prentice Hall, 2005.
R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal
Statistical Society: Series B, 58(1):267–288, 1996.
P. Tøndel, T. A. Johansen, and A. Bemporad. Further results on multiparametric quadratic
programming. In Conference on Decision and Control, pages 3173–3178, Maui,
Hawaii, 2003.
R. Tóth. Modeling and Identification of Linear Parameter-Varying Systems, an Orthonormal Basis Function Approach. PhD thesis, Technische Universiteit Delft, Delft, The
Netherlands, 2008.
R. Tóth, C. Lyzell, M. Enqvist, P. Heuberger, and P. Van den Hof. Order and structural
dependence selection of LPV-ARX models using a nonnegative garrote approach. In
Proceedings of the 48th IEEE Conference on Decision and Control, Shanghai, China,
December 2009.
P. Van Overschee and B. De Moor. Subspace Identification for Linear Systems. Kluwer
Academic Publishers, 1996a.
P. Van Overschee and B. De Moor. Continuous-time frequency domain subspace system
identification. Signal Processing, 52(5):179–194, 1996b.
M. Verhaegen and V. Verdult. Filtering and System Identification. Cambridge University
Press, 2007.
M. Viberg, B. Wahlberg, and B. Ottersten. Analysis of state space system identification methods based on instrumental variables and subspace fitting. Automatica, 33(9):
1603–1616, 1997.
Bibliography
99
L. L. Xie and L. Ljung. Estimate physical parameters by black-box modeling. In Proceedings of the 21st Chinese Control Conference, pages 673–677, Hangzhou, China,
August 2002.
M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B, 68(1):49–67, 2006.
M. Yuan, V. R. Joseph, and H. Zou. Structured variable selection and estimation. Technical report, Georgia Institute of Technology, Atlanta, Georgia, 2007. To appear in
Annals of Applied Statistics.
H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical
Association, 101:1418–1429, 2006.
100
Bibliography
Licentiate Theses
Division of Automatic Control
Linköping University
P. Andersson: Adaptive Forgetting through Multiple Models and Adaptive Control of Car Dynamics. Thesis No. 15, 1983.
B. Wahlberg: On Model Simplification in System Identification. Thesis No. 47, 1985.
A. Isaksson: Identification of Time Varying Systems and Applications of System Identification to
Signal Processing. Thesis No. 75, 1986.
G. Malmberg: A Study of Adaptive Control Missiles. Thesis No. 76, 1986.
S. Gunnarsson: On the Mean Square Error of Transfer Function Estimates with Applications to
Control. Thesis No. 90, 1986.
M. Viberg: On the Adaptive Array Problem. Thesis No. 117, 1987.
K. Ståhl: On the Frequency Domain Analysis of Nonlinear Systems. Thesis No. 137, 1988.
A. Skeppstedt: Construction of Composite Models from Large Data-Sets. Thesis No. 149, 1988.
P. A. J. Nagy: MaMiS: A Programming Environment for Numeric/Symbolic Data Processing.
Thesis No. 153, 1988.
K. Forsman: Applications of Constructive Algebra to Control Problems. Thesis No. 231, 1990.
I. Klein: Planning for a Class of Sequential Control Problems. Thesis No. 234, 1990.
F. Gustafsson: Optimal Segmentation of Linear Regression Parameters. Thesis No. 246, 1990.
H. Hjalmarsson: On Estimation of Model Quality in System Identification. Thesis No. 251, 1990.
S. Andersson: Sensor Array Processing; Application to Mobile Communication Systems and Dimension Reduction. Thesis No. 255, 1990.
K. Wang Chen: Observability and Invertibility of Nonlinear Systems: A Differential Algebraic
Approach. Thesis No. 282, 1991.
J. Sjöberg: Regularization Issues in Neural Network Models of Dynamical Systems. Thesis
No. 366, 1993.
P. Pucar: Segmentation of Laser Range Radar Images Using Hidden Markov Field Models. Thesis
No. 403, 1993.
H. Fortell: Volterra and Algebraic Approaches to the Zero Dynamics. Thesis No. 438, 1994.
T. McKelvey: On State-Space Models in System Identification. Thesis No. 447, 1994.
T. Andersson: Concepts and Algorithms for Non-Linear System Identifiability. Thesis No. 448,
1994.
P. Lindskog: Algorithms and Tools for System Identification Using Prior Knowledge. Thesis
No. 456, 1994.
J. Plantin: Algebraic Methods for Verification and Control of Discrete Event Dynamic Systems.
Thesis No. 501, 1995.
J. Gunnarsson: On Modeling of Discrete Event Dynamic Systems, Using Symbolic Algebraic
Methods. Thesis No. 502, 1995.
A. Ericsson: Fast Power Control to Counteract Rayleigh Fading in Cellular Radio Systems. Thesis
No. 527, 1995.
M. Jirstrand: Algebraic Methods for Modeling and Design in Control. Thesis No. 540, 1996.
K. Edström: Simulation of Mode Switching Systems Using Switched Bond Graphs. Thesis
No. 586, 1996.
J. Palmqvist: On Integrity Monitoring of Integrated Navigation Systems. Thesis No. 600, 1997.
A. Stenman: Just-in-Time Models with Applications to Dynamical Systems. Thesis No. 601, 1997.
M. Andersson: Experimental Design and Updating of Finite Element Models. Thesis No. 611,
1997.
U. Forssell: Properties and Usage of Closed-Loop Identification Methods. Thesis No. 641, 1997.
M. Larsson: On Modeling and Diagnosis of Discrete Event Dynamic systems. Thesis No. 648,
1997.
N. Bergman: Bayesian Inference in Terrain Navigation. Thesis No. 649, 1997.
V. Einarsson: On Verification of Switched Systems Using Abstractions. Thesis No. 705, 1998.
J. Blom, F. Gunnarsson: Power Control in Cellular Radio Systems. Thesis No. 706, 1998.
P. Spångéus: Hybrid Control using LP and LMI methods – Some Applications. Thesis No. 724,
1998.
M. Norrlöf: On Analysis and Implementation of Iterative Learning Control. Thesis No. 727, 1998.
A. Hagenblad: Aspects of the Identification of Wiener Models. Thesis No. 793, 1999.
F. Tjärnström: Quality Estimation of Approximate Models. Thesis No. 810, 2000.
C. Carlsson: Vehicle Size and Orientation Estimation Using Geometric Fitting. Thesis No. 840,
2000.
J. Löfberg: Linear Model Predictive Control: Stability and Robustness. Thesis No. 866, 2001.
O. Härkegård: Flight Control Design Using Backstepping. Thesis No. 875, 2001.
J. Elbornsson: Equalization of Distortion in A/D Converters. Thesis No. 883, 2001.
J. Roll: Robust Verification and Identification of Piecewise Affine Systems. Thesis No. 899, 2001.
I. Lind: Regressor Selection in System Identification using ANOVA. Thesis No. 921, 2001.
R. Karlsson: Simulation Based Methods for Target Tracking. Thesis No. 930, 2002.
P.-J. Nordlund: Sequential Monte Carlo Filters and Integrated Navigation. Thesis No. 945, 2002.
M. Östring: Identification, Diagnosis, and Control of a Flexible Robot Arm. Thesis No. 948, 2002.
C. Olsson: Active Engine Vibration Isolation using Feedback Control. Thesis No. 968, 2002.
J. Jansson: Tracking and Decision Making for Automotive Collision Avoidance. Thesis No. 965,
2002.
N. Persson: Event Based Sampling with Application to Spectral Estimation. Thesis No. 981, 2002.
D. Lindgren: Subspace Selection Techniques for Classification Problems. Thesis No. 995, 2002.
E. Geijer Lundin: Uplink Load in CDMA Cellular Systems. Thesis No. 1045, 2003.
M. Enqvist: Some Results on Linear Models of Nonlinear Systems. Thesis No. 1046, 2003.
T. Schön: On Computational Methods for Nonlinear Estimation. Thesis No. 1047, 2003.
F. Gunnarsson: On Modeling and Control of Network Queue Dynamics. Thesis No. 1048, 2003.
S. Björklund: A Survey and Comparison of Time-Delay Estimation Methods in Linear Systems.
Thesis No. 1061, 2003.
M. Gerdin: Parameter Estimation in Linear Descriptor Systems. Thesis No. 1085, 2004.
A. Eidehall: An Automotive Lane Guidance System. Thesis No. 1122, 2004.
E. Wernholt: On Multivariable and Nonlinear Identification of Industrial Robots. Thesis No. 1131,
2004.
J. Gillberg: Methods for Frequency Domain Estimation of Continuous-Time Models. Thesis
No. 1133, 2004.
G. Hendeby: Fundamental Estimation and Detection Limits in Linear Non-Gaussian Systems.
Thesis No. 1199, 2005.
D. Axehill: Applications of Integer Quadratic Programming in Control and Communication. Thesis
No. 1218, 2005.
J. Sjöberg: Some Results On Optimal Control for Nonlinear Descriptor Systems. Thesis No. 1227,
2006.
D. Törnqvist: Statistical Fault Detection with Applications to IMU Disturbances. Thesis No. 1258,
2006.
H. Tidefelt: Structural algorithms and perturbations in differential-algebraic equations. Thesis
No. 1318, 2007.
S. Moberg: On Modeling and Control of Flexible Manipulators. Thesis No. 1336, 2007.
J. Wallén: On Kinematic Modelling and Iterative Learning Control of Industrial Robots. Thesis
No. 1343, 2008.
J. Harju Johansson: A Structure Utilizing Inexact Primal-Dual Interior-Point Method for Analysis
of Linear Differential Inclusions. Thesis No. 1367, 2008.
J. D. Hol: Pose Estimation and Calibration Algorithms for Vision and Inertial Sensors. Thesis
No. 1370, 2008.
H. Ohlsson: Regression on Manifolds with Implications for System Identification. Thesis
No. 1382, 2008.
D. Ankelhed: On low order controller synthesis using rational constraints. Thesis No. 1398, 2009.
P. Skoglar: Planning Methods for Aerial Exploration and Ground Target Tracking. Thesis
No. 1420, 2009.
C. Lundquist: Automotive Sensor Fusion for Situation Awareness. Thesis No. 1422, 2009.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement