Institutionen för systemteknik Department of Electrical Engineering descriptive models

Institutionen för systemteknik Department of Electrical Engineering descriptive models
Institutionen för systemteknik
Department of Electrical Engineering
Examensarbete
Global optimization methods for estimation of
descriptive models
Examensarbete utfört i Reglerteknik
vid Tekniska högskolan i Linköping
av
Tobias Pettersson
LITH-ISY-EX--08/3994--SE
Linköping 2008
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping, Sweden
Linköpings tekniska högskola
Linköpings universitet
581 83 Linköping
Global optimization methods for estimation of
descriptive models
Examensarbete utfört i Reglerteknik
vid Tekniska högskolan i Linköping
av
Tobias Pettersson
LITH-ISY-EX--08/3994--SE
Handledare:
Henrik Tidefelt
isy, Linköpings universitet
Gunnar Cedersund
ibk, Linköpings universitet
Examinator:
Jacob Roll
isy, Linköpings universitet
Linköping, 28 March, 2008
Avdelning, Institution
Division, Department
Datum
Date
Division of Automatic Control
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping, Sweden
Språk
Language
Rapporttyp
Report category
ISBN
Svenska/Swedish
Licentiatavhandling
ISRN
Engelska/English
⊠
Examensarbete
⊠
C-uppsats
D-uppsats
Övrig rapport
2008-03-28
—
LITH-ISY-EX--08/3994--SE
Serietitel och serienummer ISSN
Title of series, numbering
—
URL för elektronisk version
http://www.control.isy.liu.se
http://www.ep.liu.se
Titel
Title
Globala optimeringsmetoder för identiering av beskrivande modeller
Global optimization methods for estimation of descriptive models
Författare Tobias Pettersson
Author
Sammanfattning
Abstract
Using mathematical models with the purpose to understand and store knowlegde
about a system is not a new field in science with early contributions dated back
to, e.g., Kepler’s laws of planetary motion.
The aim is to obtain such a comprehensive predictive and quantitative knowledge
about a phenomenon so that mathematical expressions or models can be used to
forecast every relevant detail about that phenomenon. Such models can be used
for reducing pollutions from car engines; prevent aviation incidents; or developing
new therapeutic drugs.
Models used to forecast, or predict, the behavior of a system are refered to
predictive models. For such, the estimation problem aims to find one model and is
well known and can be handeled by using standard methods for global nonlinear
optimization.
Descriptive models are used to obtain and store quantitative knowledge of system.
Estimation of descriptive models has not been much described by the literature
so far; instead the methods used for predictive models have beed applied. Rather
than finding one particular model, the parameter estimation for descriptive models
aims to find every model that contains descriptive information about the system.
Thus, the parameter estimation problem for descriptive models can not be stated
as a standard optimization problem.
The main objective for this thesis is to propose methods for estimation of
descriptive models. This is made by using methods for nonlinear optimization
including both new and existing theory.
Nyckelord
Keywords
systems theory, global optimization, describing models
Abstract
Using mathematical models with the purpose to understand and store knowlegde
about a system is not a new field in science with early contributions dated back
to, e.g., Kepler’s laws of planetary motion.
The aim is to obtain such a comprehensive predictive and quantitative knowledge about a phenomenon so that mathematical expressions or models can be used
to forecast every relevant detail about that phenomenon. Such models can be used
for reducing pollutions from car engines; prevent aviation incidents; or developing
new therapeutic drugs.
Models used to forecast, or predict, the behavior of a system are refered to
predictive models. For such, the estimation problem aims to find one model and is
well known and can be handeled by using standard methods for global nonlinear
optimization.
Descriptive models are used to obtain and store quantitative knowledge of
system. Estimation of descriptive models has not been much described by the
literature so far; instead the methods used for predictive models have beed applied.
Rather than finding one particular model, the parameter estimation for descriptive
models aims to find every model that contains descriptive information about the
system. Thus, the parameter estimation problem for descriptive models can not
be stated as a standard optimization problem.
The main objective for this thesis is to propose methods for estimation of
descriptive models. This is made by using methods for nonlinear optimization
including both new and existing theory.
Sammanfattning
Att använda beskrivande modeller med syfte att förstå och lagra kunskap om ett
system är inget nytt område inom vetenskapen, tidiga exempel kan dateras ända
tillbaka till Keplers modeller för planetrörelser. Det yttersta syftet är att uppnå
en sådan kunskap inom ett område så att matematiska uttryck eller modeller kan
användas för att förutsäga varje detalj inom området. Sådan kunskap kan användas
för att minska emissioner från bilmotorer; undvika flygolyckor genom diagnos och
övervakning eller för att utvecka nya läkemedel.
Modeller som används för att förutsäga, eller prediktera uppförandet för ett
system kallas predikterande modeller. För dessa är problemet att estimera modellens okända parameterar är väl behandlat i litteraturen och kan lösas genom
användning av kända metoder för ickelinjär optimering.
v
vi
Beskrivande modeller används för att samla och lagra kunskap om ett systems
inre sammansättning. Hittills har estimering av beskrivande modeller skett genom
att använda metoder avsedda för predikterande modeller. Istället för att försöka
finna en modell, som för predikterande modeller, vill vi finna en mängd av modeller
för att beskriva systemet. Varje modell som innehåller beskrivande information om
systemet är intressant.
Målet med detta arbete är att föreslå metoder för estimering av beskrivande
modeller. Detta görs genom använding av nya såväl som redan kända metoder för
global optimering.
Acknowledgments
First I want to thank Jacob Roll, Gunnar Cedersund, and Henrik Tidefelt for your
support, patience and helpful insights and for always giving good answers to my
questions. I also would like to thank my father Daniel, mother Kerstin, Magnus
and his wife Linda, David, Josef, and Elias. Your support has been, and will be,
very important to me.
vii
Contents
1 Introduction
1.1 Systems theory . . . . . . . . .
1.1.1 An overview . . . . . . .
1.1.2 Mathematical modeling
1.1.3 Model types . . . . . . .
1.1.4 Model applications . . .
1.1.5 Parameter estimation .
1.2 Systems biology . . . . . . . . .
1.2.1 Modeling process . . . .
1.3 Thesis objectives . . . . . . . .
1.4 Thesis outline . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
3
3
3
4
4
5
6
2 Thesis background
2.1 Nonlinear systems . . . . . . . . . .
2.1.1 Nonlinear State-Space Models
2.2 Parameter estimation . . . . . . . .
2.2.1 Set of possible models . . . .
2.2.2 Prediction error minimization
2.2.3 Predictive models . . . . . . .
2.2.4 Descriptive models . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
10
10
10
11
11
3 Theory
3.1 Global optimization . . . . . . . . . . . . .
3.2 Simulated Annealing . . . . . . . . . . . . .
3.2.1 Simulated Annealing Algorithm . . .
3.2.2 Cluster analysis approach . . . . . .
3.2.3 Function estimation approach . . . .
3.3 Clustering methods for Global Optimization
3.4 Test function generation . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
14
14
16
19
25
28
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
Approach (MSAC)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
31
32
34
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Results
4.1 Test function suite . . . . . . . . .
4.1.1 Building a test function . .
4.2 Optimization algorithm results . .
4.2.1 Gaussian Mixture Modeling
ix
x
Contents
4.2.2 Local Weighted Regression Approach (MSAF) . . . . . . .
4.2.3 Multilevel Single Linkage (MSL) . . . . . . . . . . . . . . .
Evaluation of the benchmark problem . . . . . . . . . . . . . . . .
35
37
37
5 Conclusions and proposals for future work
5.1 Thesis conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Proposals for future work . . . . . . . . . . . . . . . . . . . . . . .
41
41
42
Bibliography
43
4.3
Chapter 1
Introduction
A desirable aim within many science fields is to obtain such a comprehensive
predictive and quantitative knowledge about a phenomenon so that mathematical
expressions or models can be used to forecast every relevant detail about that
phenomenon. Such models can be applied to a wide field of applications e.g.,
reducing pollution from car engines; prevent aviation incidents through diagnosis
and surveillance of the airplanes; or developing new therapeutic drugs.
Clearly, this is a vast task that requires knowledge of the specific science field
e.g. biology, chemistry, or economics, as well as general knowledge of how to structure, estimate, and prove the validity of a model [4].
The main problem of this thesis is connected with the model estimation mentioned above. Throughout the presentation special focus is on identification of
biological systems, a science field named systems biology.
This introductory chapter is intended to serve as a motivation of the thesis,
including a brief introduction to the problem of system identification and the field
of systems biology. After introduction of these basic concepts the thesis objectives
and an outline of this report are presented.
1.1
1.1.1
Systems theory
An overview
Systems theory truly must be considered an interdisciplinary field of science with
applications to, e.g., mechanics, biology, economy, and sociology. There exist many
descriptions of a system. One is that a system is a group of elements forming a
complex whole [7]. In [15] a more mechanistic view is presented. A system is
described as an object in which signals interact and produce observable outputs.
The system is affected by external stimuli, which are called inputs if they are
controllable and disturbances otherwise. The disturbances are divided into the
ones that are possible to directly measure and the ones that are not. Figure 1.1
shows a graphical representation of a system.
1
2
Introduction
Clearly the concept of a system is very broad and the elements do not even
have to be of physical kind.
v
w
u
F
y
Figure 1.1: A system with controllable input u, observable output y, disturbances
that are possible to measure (w) and those that are not (v).
The problem of system identification is to obtain knowledge of the relationship between the external stimuli and the observable output of the system. Such
knowledge is called a model of the system.
Models can be of various kind and sophistication adjusted to serve the model’s
intented use. To ride a bike is an example of a mental model. The mental model describing how to ride a bike and how the manuevers are affected by, e.g., wind gusts
emerge during the learning phase. Another type of model is a graphical model
where the relationsship between the variables can be represented graphically as
plots, diagrams or tables. This thesis will deal with a third kind, mathematical models. A mathematical model aims to describe the interaction between the
external stimuli and the observable by use of mathematical expressions [15].
1.1.2
Mathematical modeling
For all the kinds of models described above, the model is built from a combination
of observed data and prior knowledge. E.g., the child learning to ride the bike
will experience repeated failures before the model is good enough for its intended
use. Identification of mathematical models is developed by mainly two different
approaches: modeling and system identification.
In modeling the system is divided into subsystems whose properties are known.
The subsystems are then put to together, forming a model for the whole system.
An introduction to modeling is presented in [16].
The other approach is system identification in which a model structure is adjusted to fit experimental data.
Please note the distinction between the model structure and the identified
model. A model structure is a set of possible candidate models, a user’s choice
build upon a priori knowledge within the specific science field. A model is a certain
candidate model included in the model structure.
The system identification process consists of three basic entities. A data set of
the input-output data is of course essential, a suitable model structure must be
chosen, and an evaluation method to determine the best model within the chosen
structure given the data set. A thorough framework for system identification is
presented in [15].
1.1 Systems theory
1.1.3
3
Model types
Each of these two different approaches, or a combination of them, gives rise to
three different kind of mathematical models.
Black-box models Models only used for prediction do not require knowledge
about the intrinsic interactions between the variables. Since it is the relation
between the external stimuli and the observable output that is of interest, the
system can be considered to be a “black box” that describes the input-output
behavior [4]. An extensive framework for black-box modelling is presented
in [15].
White-box models It can be possible to describe a system by mathematical
expressions without the need for measurement data. Such models are referred
to as white-box models. The knowledge required to build a white box model
is often application dependent, e.g., mechanical models can be built upon
the knowledge of mechanics.
Gray-box models A common case of system modelling is when the basic model
structure can be postulated but some parameters within it are unknown.
The problem is to estimate the unknown parameters so that the model fits
the measured data.
This kind of mathematical model is called gray-box models and are the kind
of models focused on in this thesis.
1.1.4
Model applications
A model can be used to forecast, or predict, the behavior of a system. Predictive
knowledge is useful in many applications, e.g., control, diagnosis, or surveillance.
Since only the input-output relation is of interest for prediction, all model types
listed above can be used. Models used for prediction of a system are refered to as
predictive models [4], [26].
Models can also be used to obtain and store quantitative knowledge of a system [15], [26]. The modeling process, further described in Section 1.2.1, aims to
explore the validity of some assumptions or hypotheses about the system. Since
the intrinsic structure of the model must be known, only white- and grey-box
models can be used for this application. This is due to that the intrinsic structure
is used for the interpretation between the identified model and the resulting quantitative knowledge of the system. Models used for this application are referred to
as descriptive models [4], [13].
1.1.5
Parameter estimation
For gray-box models, the problem to find suitable estimates of the unknown parameters in the model is referred to the problem of parameter estimation or the
inverse problem [19]. For a nonlinear system the inverse problem can be stated as
a nonlinear optimization problem. The problem of parameter estimation is further
discussed in Section 2.2.
4
Introduction
1.2
Systems biology
Systems biology aims at understanding biological networks at a system level, rather
than analyzing isolated parts of a cell or an organism. Identifying and describing
all the parts of an organism is like listing all the parts of an airplane. Analogous to
the airplane, the assemblance of a biological network has great impact on its functionality. Knowledge about the separate parts of the biological network will still be
of great importance but the focus of systems biology is mainly on understanding
a system’s structure and dynamics [13].
The dynamic modeling of a biological system can be stated as a reverse problem, i.e., from measurements describing the whole system obtain detailed knowledge about every single element for the system. This problem requires a well-suited
model structure of the system, efficient methods to identify the model parameters,
and experimental data of good quality [26].
1.2.1
Modeling process
As mentioned above, modeling descriptive models aims to reveal and store quantitative knowledge about every element and detail of the system [13].
Consider the modeling process presented in Figure 1.2, describing a modeling framework described in [4]. The framework combines strenghs from gray-box
modeling based on mechanistic knowledge with statistical methods for system
identification.
1. In Phase I a model structure is built based on mechanistic explanations and
assumptions about the system.
Parameter estimation is then used to fit the model to estimation data. In
practice, this is done by adjusting the unknown parameters according to
a parameter estimation method. One of the methods used for parameter
estimation is performed by minimizing the prediction errors of the model [15],
and is further explained in Section 2.2.2.
If an acceptable agreement between the model and the estimation data can
be achieved, the model validity should be further established by the process
of model validation. One way to prove the validity of a model is to compare
the model outputs with a set of fresh data, the validation data. Another approach, preferably used if no validation data is available, is to use statistical
hypothesis tests to find the likelihood that the data has been generated by
the estimated model.
Model validation includes a number of aspects, e.g., the subjective intuitive
feeling whether the model behaves as expected, whiteness tests of the residuals, or checks of parameter confidence intervals [15].
• If the model turns out to pass the validation process it is accepted,
which indicates that the postulated model structure has captured at
least some true property of the system.
1.3 Thesis objectives
5
• If the model exposes considerable inconsistencies with the validation
data it must be rejected or modifed.
Note that both of these lead the modeling process forward and new knowledge of the system is obtained in both cases [4].
2. The result derived in Phase I is translated into a core-box model, which is
a model consisting of a minimal core model with all details of the original
gray-box models combined with quality links from the minimal core model to
its possible predictions [4]. As shown in Figure 1.2, this procedure is used to
reveal necessary properies for the remaining explanations about the system.
Experimental
data
Mechanistic
explanations
Model based
hypothesis
testing
Phase I
rejection of
explanations
Core−box
modelling
necessary
properties in
remaining
explanations
Phase II
Figure 1.2: The modeling process.
1.3
Thesis objectives
For predictive models, the parameter estimation problem can be stated as finding a
parameter vector that minimizes the prediction errors according to some criterion.
Hence, the parameter estimation problem for predictive models can be stated as
a standard nonlinear optimization problem with main purpose to locate a global
minima for some cost function.
For estimation of descriptive models, every parameter vector that gives a model
predictor that is “good enough” is of interest. Hence, the parameter estimation
problem for descriptive models is to find a set of parameter vectors such that the
corresponding model predictors are “good enough”. For that reason, the parameter estimation problem for descriptive models can not be stated as a standard
nonlinear optimization problem.
However, we will in this thesis describe how to use nonlinear optimization
methods for estimation of descriptive models. In Section 2.2 a formal description
of the problem is formulated, and in Chapter 3 we present the methods used.
One of the methods used for global optimization is the “Annealed Downhill
Simplex” algorithm. This algorithm and global optimization in general will be
further described in Chapter 3. The idea of this thesis is to modify the algorithm
so that it can be used to handle the parameter estimation problem for descriptive
models. The thesis objectives are:
• Modify the “Annealed Downhill Simplex” algorithm so that it is possible to
restart the algorithm at multiple points at each temperature.
6
Introduction
• Find methods to solve the problem that arises when the parameters are of
different scaling.
• Include the possibility to find sub-optimal parameter sets.
1.4
Thesis outline
As stated in this introductory chapter, the problem of parameter estimation (inverse problem) can be stated as a nonlinear optimization problem.
Chapter 2 surveys previous research in the fields parameter estimation in nonlinear systems, nonlinear optimization, and methods for evaluating optimization
algorithms.
The theory and methods used for the thesis are presented in Chapter 3.
Chapter 4 contains the thesis results. Chapter 5 contains a discussion of the
results, followed by some conclusions and suggestions for further work.
Chapter 2
Thesis background
The intention of this chapter is to review the most basic concepts used in this
thesis. Further, this chapter will describe the parameter estimation problem when
applied to descriptive modeling.
According to Chapter 1, modeling a dynamic system requires three basic entities; a well-suited model structure, an efficient parameter estimation method,
and experimental data of good quality. The last of them, the experiental data, is
in this thesis supposed to be given. However, the first and second entity will be
further described in this chapter.
2.1
Nonlinear systems
As stated in Section 1.1.4, descriptive models are used to reveal and store knowledge of a system [4]. Hence, the model stucture to use must have the ability to
incorporate physical insights.
2.1.1
Nonlinear State-Space Models
A model structure that fulfills those requirements is the state-space form. For
a state-space model, the system inputs u, noise, and outputs y are related by
differential or difference equations. Using an auxiliary vector x(t), the system of
equations can be written in a very compact and appealing way [15]. A nonlinear state-space model with process noise w(t), measurement noise v(t) and the
unknown parameters of the system gathered in θ is in continuous time presented
as:
ẋ(t) = f (t, x(t), u(t), w(t); θ)
(2.1)
y(t) = h(t, x(t), u(t), v(t); θ).
Note in (2.1) the separation of the system dynamics description from the measurements equation. This is a very natural and appealing way to incorporate physical
insights of the system into the model. As mentioned in Chapter 1, setting up a
7
8
Thesis background
model structure requires detailed knowledge about the science field for which the
model is used. In this thesis we will not go further into any specific science field.
Instead, we will use a benchmark problem to evaluate the parameter estimation
methods.
The benchmark problem aims to describe the process of insulin signaling in a
fat cell. An extensive discussion is presented in [28] and [4]. In Example 2.1 the
model stucture and a picture of the signaling pathway are presented.
Example 2.1: Insulin Signaling
The benchmark problem of this thesis is to describe and store knowlegde about
the biological pathway shown in Figure 2.1. The pathway is an assumption about
how to describe insulin signaling in fat cells.
IR
k1
IRins k2
k−1
ECM
Cytosol
k−3
kR
IRPTP
IRP
kD
k3
IRpPTP
Figure 2.1: The assumed model structure for the problem of insulin signaling in
a fat cell.
The model structure is given by:
d(IR)
dt
d(IRins)
dt
d(IRP )
dt
d(IRpPTP )
dt
d(IRPTP )
dt
measIRp
= −k1 · IR + kR · IRPTP + k−1 · IRins
= k1 · IR − k2 · IRins − k−1 · IRins
= k2 · IRins − k3 · IRP + k−3 · IRpPTP
(2.2)
= k3 · IRP − k−3 · IRpPTP − kD · IRpPTP
= kD · IRpPTP − kR · IRPTP .
= kY 1 · (IRP + IRpPTP )
Note that the model stucture is presented in non-linear state space form. The last
of the equations is the measured quantity, in this case a sum of the two variables
IRP and IRpPTP . The system input is given by instantly adding a certain amount
of insulin to the fat cell. For modeling purposes this can be described by setting
an initial condition on the used model states:
2.1 Nonlinear systems
9
IR(0)
=
10.000000
IRins(0)
IRP (0)
IRpPTP (0)
=
=
=
0
0
0
IRPTP (0)
=
0.
In Figure 2.1 the measured data is presented. Clearly, the claim that a large
amount of high-quality data is needed for modeling has to be relaxed for this
problem.
100
90
80
measIRp
70
60
50
40
30
20
10
0
0
5
t
10
15
Figure 2.2: Available measurement data for the benchmark problem.
When observing a system the signals can only be measured in discrete time.
In discrete time a state-space model has the form:
x(t + 1) = fd (t, x(t), u(t), w(t); θ)
(2.3)
y(t) = hd (t, x(t), u(t), v(t); θ).
For abbreviation, the past-time input-output measurements available at time
t are denoted:
Z t = (y t , ut ) = (y(1), u(1)...y(t), u(t)).
The selected model structure usually includes some unknown parameters, e.g., in
Example 2.1 we can find eight unknown kx . It is often favourable to collect those
parameters in one vector θ. As mentioned in Chapter 1, the problem of parameter
estimation is a problem of, given an experimental data set Z t , find values of θ so
that the the selected model is able to reproduce the experimental results [15].
10
2.2
Thesis background
Parameter estimation
Section 2.1 described the problem of how to select a suitable model structure
built upon physical insight. Hence, two of the three main entities of the system
identification problem have been described. The remaining task is then to determine suitable values for the unknown parameters of the model, i.e., the parameter
estimation.
2.2.1
Set of possible models
Every different choice of θ in 2.3 gives rise to a new model. If we denote the set
of possible θ as DM , the set of possible models is written as:
M ∗ = {M (θ)| θ ∈ DM }.
(2.4)
The parameter estimation problem can be stated differently for predictive and
descriptive models and this will be further explored below. However, for both
approaches the problem is to find an adequate mapping from the estimation data
Z N to the set DM :
Z N → θ̂N ∈ DM .
2.2.2
(2.5)
Prediction error minimization
There exist several methods for estimation of the unknown parameters θ in the
models. The prediction-error identification approach includes procedures such as
least-squares and maximum-likelihood for parameter estimation [15].
Prediction-error methods estimate the unknown parameters for the model by
minimizing the difference between measurements and model predictions. For that
reason, a requirement is that a predictor for the model structure of interest is
available. The process of how to build a predictor is treated in [15] for a number
of model structures. A possible predictor for the time discrete state-space model
structure (2.3) is given by simply disregarding the process noise w(t):
x(t + 1, θ) = fd (t, x(t, θ), u(t), 0; θ)
ŷ(t, θ) = hd (t, x(t, θ), u(t), 0; θ).
(2.6)
Now we possess all the tools available to form the prediction error of a model
at time t:
ε(t, θ) = y(t) − ŷ(t, θ).
(2.7)
A measure of the size for the complete sequence of prediction errors at time
t = 1, . . . ,N is given by:
V (θ, Z N ) =
N
1 X 2
ε (t, θ).
N t=1
(2.8)
The process to find the θ that minimizes (2.8) is called parameter estimation by
the non-weighted least-square method.
2.2 Parameter estimation
2.2.3
11
Predictive models
For predictive models, the main objective is to find the parameter vector θ that
minimizes the prediction errors, i.e., find the parameter vector θ̂N that minimizes (2.8). Hence, for predictive models the parameter estimation problem can
be stated as a global optimization problem with the aim to find the global minimum of (2.8). If we again denote the parameter vector as θ, this can be stated
through:
θ̂N = {θ ∈ DM : ∀ θ∗ ∈ DM : V (θ) ≤ V (θ∗ ).
(2.9)
In Figure 2.3 the parameter estimation problem for predictive models is illustrated. As shown by the figure, the intention is to locate one parameter vector θ̂N
that optimize the objective function.
V
θ̂N
θ
Figure 2.3: The result of the parameter estimation problem for predictive models
is a single point.
Equation (2.9) is a standard nonlinear optimization problem for which several
methods exist. A short survey of global nonlinear optmization is presented in
Chapter 3.
2.2.4
Descriptive models
As mentioned in Section 1.3, the models that prove to have “enough” predictive
power are of interest. For the case of descriptive models we are not searching for
∗
one particular value of the parameter set θ, but rather for a subset θ̂N
such that
∗
θ̂N
= {θ ∈ DM : ∀ θ∗ ∈ DM : V (θ) ≤ V (θ∗ ) + κ}.
(2.10)
The parameter κ denotes the threshold that determines if a solution is “good
enough”. (2.10) is illustrated in Figure 2.4. As can be seen, the parameter estimation problem for descriptive models is to find the values of θ that result in models
that are “good enough” rather than finding the best model.
For predicitive models, the parameter estimation problem can be stated as
a global optimization problem through (2.9). The corresponding expression for
12
Thesis background
V
κ
∗
θ̂N
∗
θ̂N
θ
Figure 2.4: The result of the parameter estimation problem for descriptive models
is a set of points.
descriptive models given in (2.10) is rather a description for the set of parameter
vectors we would like to find. Hence, we have no methods for finding the complete
∗
∗
. However, instead of finding a complete description for θ̂N
, it is possible
set of θ̂N
∗
to find parameter vectors θ̂k ∈ θ̂N , k = 1, . . . , K. Since we want the parameter
∗
vectors θ̂k to describe θ̂N
as well as possible, the resulting parameter vectors should
be “significantly different”. Figure 2.5 show how the parameter vectors θ̂1 and θ̂2
∗
can be used as an approximation for θ̂N
.
V
κ
θ̂1
θ̂2
θ
Figure 2.5: The parameter vectors θ̂1 and θ̂2 are used as an approximation for
∗
θ̂N
.
The problem of how to find suitable θ̂k can stated as finding nonlinear optimization methods able to localize suboptimal minima of a function. The main
content of Chapter 3 is to propose methods able to handle this problem.
Chapter 3
Theory
The main objective for this thesis is, as stated in Section 1.3, connected with
parameter estimation for descriptive models. As mentioned in Section 2.2.4, the
parameter estimation problem can be stated as a nonlinear global optimization
problem. Hence, this chapter will deal with nonlinear global optimization, with
the main objective to present a toolbox for solving the problem defined by (2.10).
3.1
Global optimization
Global optimization can be described as the task of locating the absolutely best
set of parameters that optimize an objective function. While global optimzation
aims to find a global minimum, the corresponding aim for local optimization is to
locate a local minimum of the objective function.
Several different suggestions how to classify the methods used for global optimization have been made and the most used is to roughly divide the methods into
two different approaches [12], [19].
Deterministic or exact methods aim to guarantee that the global minimum
will be located. However, the computational effort for this is high and it grows
rapidly with the problem size [19]. One deterministic approach is covering methods
which include the grid-based search procedure. The grid-based search method
relies on the fact that practical objective functions normally have bounded rate
of change. Hence, a sufficiently dense grid would guarantee the detection of the
global minimum with a prescribed accuracy. However, it is easy to realize that
with growing dimensionality the number of grid points will be very large [27].
This is due to a feared phenomenon referred to as curse of dimensionality, which
means that the number of samle points within a volume of certain size must be
increased exponentially with the number of dimensions to retain the same density
of samples.
Stochastic methods are all based on some sort of random generation of feasible
trial points followed by a local optimization procedure. In contrast to deterministic
metods, many stochastic methods can locate the vicinity of the global minimum
13
14
Theory
with relative effiency. However, the price paid is that the global optimality cannot be assured [19]. The toolbox of stochastic methods includes procedures like
random search methods, clustering methods, and evolutionary computation.
The first method examined in this thesis is the Simulated Annealing (SA) algorithm which is a random search method [27]. In its original formulation, the
algorithm determines the location of the global minimum. Hence, the algorithm
must be modified to fulfil the requirements of Section 2.2.4, i.e., to find also suboptimal local minima. A more thorough survey of SA is presented in Section 3.2.
The second stochastic method used is Multi-Level Single Linkage which is a
clustering method. The algorithm finds, at least in theory, the location of all local
minima of the objective function. Section 3.3 surveys clustering methods in general
and also describes the Multi-Level Single Linkage algorithm more extensively.
3.2
Simulated Annealing
As stated in Section 1.3, one idea of this thesis is to modify the “Annealed Downhill
Simplex” algorithm so that it is possible to find suboptimal local minima of a
function. This section contains a short historic review of the algorithm. After
that, the modified algorithm is presented.
3.2.1
Simulated Annealing Algorithm
The Simulated Annealing algorithm is a random search method originally presented by Metropolis, which attempts to simulate the behavior of atoms in equilibrium at a given temperature. It has been named after the procedure of metal
annealing [3]. The algorithm was first used to solve combinatorial optimization
problems, e.g., the traveling salesman problem. Later on, numerous different variations of the algorithm have been developed and some of them are applicable to
continuous as well as constrained problems.
The SA algorithm included in the Systems Biology Toolbox for Matlab [24] is
based on the Nelder-Mead Downhill Simplex algorithm (NMDS) where a simplex
transforms in the function space to find the optimal function value [18]. At high
temperature the simplex moves over the function space almost randomly. However,
as the temperature is lowered the algorithm behaves more and more like a local
search.
NMDS was presented in [22] and has gained in popularity due to its numerical stability. The method requires only function evaluations, not derivatives. A
simplex is the geometrical figure which, in N dimensions, consists of N + 1 corners, i.e., for two dimensions a simplex is a triangle. NMDS performs function
optimization by simplex transformation in the function space. Depending on the
function values for the simplex points the algorithm reflects, expands, or contracts
the simplex with intention to perform a structured examination of the simplex surroundings. The idea is to, repeatedly, upgrade the simplex point with the worst
function value to a new better point. In [22] an annealing scheme is applied to
NMDS, here referred to by the name “Annealed Downhill Simplex”.
3.2 Simulated Annealing
15
In Algorithm 1 the structure of the Nelder-Mead Simulated Annealing is presented.
Algorithm 1 Nelder-Mead Simulated Annealing
Global minimization by the “Nelder-Mead Simulated Annealing”. Only the basic
structure of the algorithm is presented here, for further details see [22].
Requirements: Initial starting guess X0 , start temperature T1 , stop temperature
Ts , and the number of iterations used for each temperature N .
1: Initiate parameters and put k = 1.
2: while Tk > Ts do
3:
Perform N iterations of the Annealed Downhill-Simplex at temperature Tk
starting from the best point available so far. This is done by adding a
positive logarithmically distributed random variable, proportional to T , to
each function value associated with every vertex of the simplex. For every new point tried as a replacement point according to the Nelder-Mead
scheme, a similar random variable is subtracted. This procedure results in a
simplex with almost random behavior for higher temperature and for lower
temperature behaviour like ordinary local search.
4:
Set Tk+1 lower than Tk and put k = k + 1.
5: end while
6: return Best point found.
The idea is to modify the algorithm presented in Algorithm 1, which originally
is a singlestart method, i.e., one point is used to restart the algorithm at each
temperature level, to a multistart method, i.e., the algorithm restarts in N points
at each temperature level. If the points used to restart the algorithm are chosen
correctly, this should make it possible to find sub-optimal local minima of the
objective function.
We will define the region of attraction of a local minimum x∗ as the set of
points in X from which a local search procedure converges to x∗ .
An ideal case, which is generic for all multistart global optimization methods,
is to start exactly one local search procedure in every region of attraction. Hence,
the most demanding problem of multistart methods is to recognize the region of
attraction for every local minimum [27].
The first idea is based on the approach that if one could retain only points
with relatively low function values, then these points would form groups around
some of the local minima. This idea is illustrated in Figure 3.1.
Such groups can be located by use of a clustering method, and finally one search
procedure is started from the center of each identified cluster. We have evaluated
an algorithm based on this procedure which is further described in Section 3.2.2.
Another approach explored in Section 3.2.3 is to estimate the objective function
that is to be minimized by using a function regression method. The idea is to
simply to identify the regions of attraction for the function by exploring how the
function behaves between points with relatively low function values.
16
Theory
V
Vγ
θ
Figure 3.1: By retaining function values with lower function value than a threshold Vγ groups of points are formed around local minima.
3.2.2
Cluster analysis approach
As mentioned in Section 3.2, one idea is to use clustering analysis to identify
accumulations of points with relatively low function value. For each cluster, the
center is used to initiate a downhill simplex search. Hence, not only the area
close to the best point found so far will be explored, but a more “broad” search
is performed including suboptimal function areas. In Algorithm 2 we present the
basic structure of the algorithm.
Algorithm 2 Multistart Simulated Annealing — The clustering approach (MSAC)
Requirements: Initial starting guess X0 , start temperature T1 , stop temperature
Ts , and the number of iterations used for each temperature N .
1: Initiate parameters and put k = 1.
2: while Tk > Ts do
3:
Perform N iterations of Annealed Downhill-Simplex at temperature Tk .
4:
Set Tk+1 lower than Tk and put k = k + 1.
5:
Apply the condition given by 3.1 to the N points and use clustering analysis on the retained points Lγ to find suitable starting points at the next
temperature.
6: end while
7: return The best points found.
As illustrated in Figure 3.1 one way to concentrate the points around the local
minima is to only retain values with relatively low function value:
Lγ = {x|V (θ) < Vγ }.
(3.1)
γ is a positive constant defining which points that will be included into the set.
The problem of how to obtain the intrinsic structure of clustered data when
3.2 Simulated Annealing
17
no other prior information is available is referred to as unsupervised cluster analysis [9]. The litterature within the area is extensive and the number of clustering
algorithms is numerous.
In this thesis we use a method called model-based clustering where the data
is assumed to be generated by some parametric probability distribution, and the
problem is then to identify the most likely distribution for data [23].
Model-based clustering
The model-based clustering method is based on the assumption that the data
can be described as a mixture of models in which each model corresponds to one
cluster. One approach is to use Gaussian distributions for describing the models.
Hence, the method is referred to as Gaussian mixture modelling.
A Gaussican density in a d-dimensional space, with mean µ ∈ R d and d × d
covariance matrix Σ, is given by the probability density function:
φ(x; θ) =
T −1
1
1
exp− 2 (x−µ) Σ (x−µ) ,
(2π)d/2 |Σ|
(3.2)
where θ denotes the parameters µ and Σ.
A Gaussian mixture with K components is then given as a convex combination
of Gaussian densities:
f (x) =
K
X
j=1
πj φ(x; θj ), with
K
X
πj = 1 and ∀ j : πj ≥ 0,
j=1
where πj are called the mixing weights and φ(x; θj ) the components of the mixture.
The mixture likelihood approach aims to maximize
LM (θ1 , · · · , θK ; π1 , · · · πK |x) =
N
Y
f (xi )
i=1
=
K
N X
Y
πj φ(xi ; θj ).
(3.3)
i=1 j=1
Equation (3.3) provide a measure of how well the mixture model is adjusted
to sample data. Hence, given sample data, we would like to maximize LM .
A popular method for maximization of (3.3) is the Expectation-Maximum (EM)
algorithm [29]. Given a K-component mixture with respect to a dataset XN =
{x1 , . . . , xN }, xi ∈ R d , the model mixture parameters can be updated by itera-
18
Theory
tively application of the following equations:
P (j; xi ) = πj φ(xi ; θj )/f (xi )
(3.4)
N
X
P (j; xi )/N
(3.5)
N
X
P (j; xi )xi /(N πj )
(3.6)
N
X
P (j; xi )(xi − µj )(xi − µj )T /(N πj ).
(3.7)
πj
=
i=1
µj
=
i=1
Σj
=
i=1
In (3.4), called the expectation, the term P (j; xi ) is denoted responsibility of
model j for observation xi . In the expectation the current estimates for the model
mixture are used to find responsibilities according to the point density for each
model.
The estimated responsibilites from (3.4) are used in maximization step to update the estimate of the parameters by weighted maximum-likelihood fits. The
update of the mixing weights is given by (3.5), the mean µj for model j is updated
in (3.6), and (3.7) describes the estimation of the covariance Σj [11].
It can be shown that each iteration leads to a sequence of mixtures {f i } with
nondecreasing likelihood funtion LM i . The EM-algorithm guarantees convergence
to a local optimum, i.e., a global optimal solution is not guaranteed [29]. Due to
this, the final mixture distribution is highly dependent on the initial distribution
f 0.
Another problem is how to identify the number of clusters within the data,
i.e., how to select the optimal number of components for the mixture. For a fixed
number of components, the EM-algorithm can be used to find (at least in local
sense) the θ that maximize (3.3). However, as the number of components increases
the likelihood function for the mixture will also increase. The whole idea of using
model mixtures is to group close points, but that falls if the number of mixture
models are the same as the number of data points. There are two main ways to
handle this problem.
The most favourable situation is if fresh data with the same properties, i.e., the
same number of clusters, is available. Cross-validation can then be used to select
the number of components to be used in the mixture [15]. However, this approach
is only useful for applications where the number of clusters in the data is known
to be constant. Hence, the method is not applicable to our problem since we do
not know the correct number of clusters.
In the case when no fresh data is available, the problems of how to select the
number of components in the mixture and how to estimate the parameters can be
phrased as a joint problem. The problem is then to, as previous, minimize a cost
function but now with an additional penalty on model complexity. The purpose
is to find a model that describes data sufficiently well, and at the same time not
use an unnecessary compex model [15].
3.2 Simulated Annealing
19
For this thesis, we have used two different critera and evaluated their performance when applied to our problem. For details about the critera see [15].
Aikaike’s information criterion (AIC) is the first of the methods. AIC aims at
minimizing:
1
[−LM (x|θ) + dim θ].
(3.8)
AIC = min
θ,k N
The second criterion evaluated is Rissanen’s minimum description length (MDL)
principle. MDL aims at minimizing:
M DL = min
θ,k
dim θ
1
[−LM (x|θ) +
· log N ].
N
2
(3.9)
We used Monte-Carlo simulation applied to a data set generated from Gaussian densities with known properties to compare AIC and MDL. The results are
presented in Chapter 4.
10
8
6
4
2
0
−2
−4
−4
−2
0
2
4
6
8
10
Figure 3.2: An example of an estimated gaussian mixture with 5 components.
Using the cluster analysis method described above, all components needed to
run Algorithm 2 have now been presented. We have evaluated the algorithm using
test functions presented in Section 3.4 and the results are presented in Chapter 4.
3.2.3
Function estimation approach
Another approach to the problem of how to find suitable points to restart the
algorithm from is to estimate the true objective function between the already
evaluated points using some function regression method.
Locating the regions of attraction by function estimations
The idea is to examine the topology of the function by analysing the function
values on straight lines drawn between pair of points. If no significant function
20
Theory
“peak” can be found between the points, the points are supposed to be contained
in same region of attraction. In Figure 3.3 this procedure is explained, where
the “peak” between the pair of points is denoted by γ. Consider the situation to
examine the topology between a pair of points denoted by x1 and x2 :
1. A function estimation method is used to calculate function predictions between the points. In Figure 3.3 the function estimations are dotted.
2. The function predictions are used to calculate two measures we denote by
“min from left” (Ml ) and “min from right” (Mr ). They are formed by
stepping between the points using the function predictions. At each point,
the smallest value for the function prediction found so far is stored. In
Figure 3.3 Ml measure is dashed and Mr is dash-dotted.
3. For every evaluation point between x1 and x2 find the largest different between the function prediction and maximum of Ml and Mr . Denote the
largest value found by γ.
4. If γ is larger than a threshold α both points are used to restart the algorithm
from at next lower temperature. In the following text we will present a
motivation for the selection of α.
Remember that SA attempts to simulate the behavior of atoms in equilibrium at a given temperature. The probability of making a transition from a
current atom state s1 to a new candidate state s2 at temperature T is given
by the boltzmann factor:
−(E(s1 )−E(s2 ))
P (E(s1 ))
T
=e
,
P (E(s2 ))
(3.10)
where E(s1 ) and E(s2 ) are the system energy states associated with s1 and
s2 [14].
We will use this theory to answer a question consisting of two main parts.
First, do the two points x1 and x2 belong to the same region of attraction? If
they do, what is the possibility to, given that the algorithm only is restarted
in one of these points, still find both function “valleys” at next temperature
level? The reason for stating this questions is that we do not want to use
unnecessarily many restart points for the algorithm. If the possibility to find
the function “valley” at a lower temperature is high, then we do not have to
restart the algorithm from both x1 and x2 at this temperature.
Using this theory as a motivation, we have determined the threshold α
through:
−α
Pdef ault = e Tk+1
(3.11)
where Pdefault is an a priori specified parameter. Equation (3.11) can be
formulated as
1
α = Tk+1 · log
,
Pdefault
giving an explicit expression for the threshold α.
3.2 Simulated Annealing
21
y
γ
x
Figure 3.3: Function predictions, dotted in the figure, are used to examine the
function behavior between a pair of points represented by circles.
The distance γ is used to determine if the points belong the the same
region of attration. The “min from left” measure is dashed and the
“min from right’ measure is dash-dotted.
Locally Weighted Regression
Given sample inputs x ∈ R m and outputs y ∈ R, consider the mapping
f : R m → R, and assume that the output is generated as noisy measurements of
a function:
yi = f (xi ) + ǫi
(3.12)
The problem of regression analysis is to estimate the function f based on the samples. A common approach to function regression is to build the function estimate
using local information, i.e., use adjacent sample points to build a local model.
One such method is Locally Weighted Regression which fits local models to nearby
data. The data used for building the model are weighted according to the distance
from the query point xq [1].
Suppose that we are given a set of sample data {(x(i), y(i))}N
i=1 produced
by (3.12).
We will show how an estimate can be formed using linear local models. Locally
weighted regression aims to solve a weighted least square problem at each query
point xq :
min
α(xq ),β(xq )
N
X
wi (xq , xi )[yi − α(xq ) − β(xq )xi ]2 ,
(3.13)
i=1
with the estimate then given by fˆ(xq ) = α̂(xq ) + β̂(xq )xq .
Equation (3.13) is a quadratic function which can be minimized analytically
through the weighted normal equations [15]. Let the vector-valued function b(x) =
[1, x], and B be the N × 2 regression matrix with ith row b(xi )T . Also let W(xq )
be the N × N diagonal matrix with ith element wi (xq , xi ). Then, the estimate
22
Theory
that minimized (3.13) is given by
fˆ(xq ) = b(xq )T (BT W(xq )B)−1 BT W(xq )y,
(3.14)
where the measured outputs are given by y. Equation (3.14) shows how the
estimate fˆ(xq ) is built by local models [11].
For a more extensive explanation of local regression, see [1], which contains a
comprehensive tutorial of the method. Essential contributions to the method are
also presented in [5] and [6].
When using local regression in practice, three basic components must be chosen; the bandwidth, i.e., how many data points to be used to build the estimate;
the weight function wi , i.e., how to calculate the data weights; and the parametric
family to use for the local model [6].
Selection of the parameteric family is to simply to make the choice of which
polynomial degree to use for the local models. This selection is a trade off between
variance and bias. Using polynomials with high degree leads to a less biased, but
more variable estimate [1]. The most common choices are to use polynomials of
first or second degree. The above choices must be tailored to each data set used
in practice.
A central issue is how many sample points should be used to build the estimate,
i.e., how to select the bandwidth. Basically two choices for selection of bandwidth
exist. One choice is to use a fixed bandwidth. However, this can lead to a bad
estimate due to large changes in density of the sample data.
Another method to use an adjustable bandwidth, based on the values of xq .
The advantage of this method is that if the part of the input space surrounding the
query point xq is very sparse, more sample data can be included into the estimate.
One example of adjustable bandwidth is the k-nearest neighbour selection, where
k can be selected using leave-one-out validation selection [6].
The requirements on a weighting function are straightforward. The function maximum should be located at zero distance and the function should decay smoothly as the distance increases [6]. Several different standard choices for
weighting functions exist, some of the most common used are Gaussian kernels,
quadratic kernels, or tricube kernels.
For this thesis, we have evaluated two different implementations of the Local
Weighted Regression method. First the The Lazy Learning Algorithm is presented,
and then the Locally weighted projection regression algorithm is described.
The Lazy Learning Algorithm
By application of the recursive least square algorithm (RLS), a fast implementation
of the Local Weighed Regression method is presented by the authors of [2]. The
method is explained in detail in [15]. Recursive methods process the input-output
data one by one as they become available.
For the Lazy Learning Algorithm, all data are available at the same time but
according to RLS new data points are included in the estimate recursively, i.e.,
the estimate is incremented with points one by one.
3.2 Simulated Annealing
23
When building a function estimate at a query point xq , the available data
points are sorted with respect to distance from xq . The data point most close to
xq is denoted xd1 . Then, the estimate is built by sequentially adding data points
ordered by the distance from xq .
The RLS algorithm leaves no possibility of using an arbitrary weight function.
It would however be possible to use a weight function given by wi = ai , 0 < a < 1.
This possibility is not used for the Lazy Learning algorithm, but instead every
data point included into the estimate is given equal weight. This leads to a very
unsmooth estimate, particularly if the data is very sparse or in presence of noisy
data. The most obvious advantage of the algorithm is with no doubt that it is
very time efficient.
In Figure 3.4 the algorithm is used to produce estimates based on sampels from
a function with additive noise. Note that the function estimate is very notchy
which is a big problem for our application.
Noisy data samples
The fitted function: nMSE=0.110
1.5
3
1
2
0.5
1
0
−0.5
1
0
1
1
0
0
−1
1
0
0
−1
−1
−1
The true function
3
2
1
0
1
1
0
0
−1
−1
Figure 3.4: A function estimate produced by the RLS-LWR algorithm, from noisy
samples.
Incremental Online Learning in High Dimensions
Locally weighted projection regression (LWPR) is an algorithm that forms nonparametric function estimates built from local linear models [30]. The algorithm
is incremental, i.e. the local models can be updated whenever fresh new data is
avaliable.
In accordance with the description of Local Weighed Regression, the algorithm
comprise three basic identies for learning a function estimate.
1. LWPR algorithm uses linear functions for describing the data locally. Hence,
24
Theory
the relation between the output (yi ) and the input (xi ) is given by:
yi = βi xi + ǫi ,
(3.15)
where ǫi denotes the measurement noise. The function coefficients βk are
updated using partial least squares regression (PLS). PLS uses the principal
components of xi to predict a score on the corresponding principal component for yi . PLS determines the coefficients β by iteratively maximizing the
covariance between the scores for the inputs (xi ) with the output (yi ) [30].
LWPR uses PLS to, if possible, reduce the dimension for the input space.
This makes it possible to remove possibly redundant inputs. The number
of projections to use for each linear model is controlled by keeping track of
the mean-squared error (MSE) as a function of projections included for the
local model. Each local model is initiated with two (R = 2) local model
projections, and the algorithm will stop adding local projections if the MSE
at the next projection does not decrease more than a predefined percentage
r+1
of the previous MSE, i.e. if MMSE
SEr > φ, where φ ∈ [0, 1].
2. For the RLS algorithm the selection of weighting function is to assign the
included data points equal weight. LWPR includes the possibility use an
arbitrary weighting function, parameterized as a distance metric Dk for the
k:th local model. Hence, given a query point xq the weight wk for a local
model centered in ck is given by
wk = h(xq , Dk , ck ).
(3.16)
LWPR only allows a certain degree of overlap, limited by a predefined threshold. A high degree of overlap can result in a better estimate, but also makes
the algorithm more computaionally expensive.
The final function approximation evaluated for a query point xq is then given
by
PK
wk ŷk
,
(3.17)
ŷ = Pk=1
K
k=1 wk
where ŷk is the prediction for each local model.
3. The question of how many data points to use for each function estimate is
refered to the bandwidth problem. LWPR determines for each local model a
region of validity, called receptive field (RF), parametrized using the distance
metric D.
The shape and size of the local models are determined using a penalized
leave-one-out cross-validation error,
M
X
N
γ X 2
J = PM
wi (yi − ŷi,−i ) +
D ,
N i,j=1 i,j
i=1 wi i=1
1
2
(3.18)
3.3 Clustering methods for Global Optimization
25
where M denotes the number of data points in the training set, N denotes
the number of local models used, and wi are the weights associated with
each data point.
The first term of 3.18 is the mean leave-one-out cross-validation error (indicated by the subscript (i, −i) of the local model. The second term is a
penalty term which ensures that the receptive fields cannot be indefinitly
small, which would be statistically correct for building an asymptotically
unbiased function approximation but also make the algorithm too computationally expensive.
Adding a new data point x involves examination if a new RF has to be
created. If the point x has sufficient activation given by (3.16), a new RF is
not created. Otherwise, a new RF is created with c = x, D = Ddefault .
As written above, compared to Lazy Learning, any weighting function can be
applied to the LWPR algorithm. Hence, the ability to handle sparse data samples
and noisy data improves considerably. This is demonstrated in Figure 3.5 which
shows an estimate with the same input data as in Figure 3.4. Note that the
estimate produced with the LWPR algorithm is much smoother and the impact of
the noise is reduced compared to Figure 3.4. Further, the RFs used to build the
function estimates are shown in the figure.
We have implemented Algorithm 3 using the LWPR algorithm to estimate the
objective function. Evaluation of the algorithm has been made using the scalable
test functions presented in Section 3.4. The result is presented in Chapter 4.
Algorithm 3 Multistart Simulated Annealing — Function Estimation Approach (MSAF)
Requirements: Initial starting guess X0 , start temperature T1 , stop temperature
Ts , and the number of iterations used for each temperature N .
1: Initiate parameters and put k = 1.
2: while Tk > Ts do
3:
Perform N iterations of Annealed Downhill-Simplex at temperature Tk .
4:
Update the LWPR-model using the N function evaluations evaluated in
step 3.
5:
Set Tk+1 lower than Tk and put k = k + 1.
6:
Using the method presented in Section 3.2.3 and LWPR, find suitable points
to restart the algorithm from at next lower temperature.
7: end while
8: return The best points found.
3.3
Clustering methods for Global Optimization
Clustering methods for global optimization are based on two simple assumptions;
first that it is possible to obtain a sample of points which is concentrated in
the neighbourhood of the local minimizers, and further that this sample can be
26
Theory
The fitted function: nMSE=0.123
Noisy data samples
3
1.5
1
2
0.5
0
1
−0.5
1
1
0
0
1
1
0
0
−1
0
−1
−1
The true function
−1
Input space view of RFs
1
3
0.5
2
0
1
−0.5
0
1
1
0
0
−1
−1
−1
−1
0
1
Figure 3.5: The LWPR-algorithm used on a noisy dataset.
analysed with a clustering method giving the vicinicy of the local minimizers. The
idea is then to start local optimizations applied to the center of each cluster. If
this procedure is employed successfully every local minimum, and hence the global
minimum will be found [27].
There are many ways to perform grouping of sample points around the local
minima. One approach is to simply retain points with relatively good function
value. This is the method used in Section 3.2.2. Another appoach is to push
the sample values closer to the local minimas by performing a few steps of local
optimization [27].
The other important step of a clustering method is the clustering analysis.
We have already scratched at the surface of this field in Section 3.2.2. However,
it is worth mentioning again that there exist numerous approaches to clustering
analysis and the number of algorithms is large.
Of course, this property of listing all local minima is in agreement with our demands to find suboptimal minimizers of the objective function presented by (2.10).
We will not go deeper into the theory about clustering methods. For a more
extensive tutorial we recommend the survey [27]. However, as comparision to the
methods presented above, we have chosen to implement a clustering method known
to be more efficient and accurate than Simulated Annealing [21]. The algorithm
named Multi-Level Single Linkage, is further described in Algorithm 4.
As written above, we have implemented Algorithm 4 and used it as reference
for the other two approaches of Multistart Simulated Annealing.
3.3 Clustering methods for Global Optimization
27
Algorithm 4 The Multilevel Single-Linkage Algorithm
1: Generate N sample points from a uniform distribution over
SN = {x|li ≤ xi ≤ hi , i = 1, . . . , K}
and calculate the corresponding function values at these K points.
Calculate the critical distance rk given by (3.19).
Define a level set Sd such that Sd = {x ∈ SN |f (x) ≤ d} and sort the points in
order of descending function values. A point is chosen as a start point for a
local search if no point with better function value and within a critical distance
rk has already been used as a start point for a local search.
4: Add the found local minimum x̂ to the set of already found minima if it is not
located closer than a predefined distance α from any already found minima.
5: Stop the algorithm if the stopping rule (3.20) is satisfied, otherwise set k = k+1
and goto 1. The variable k denotes the iterator for the algorithm.
The critical distance rk is given by
2:
3:
1
rk = π − 2
Γ(1 +
K
log kN
)ψ(S)ρ
2
kN
K1
(3.19)
where Γ denotes the gamma function, ψ(SN ) denotes the Lebesgue measure of
SN , and ρ is a positive constant.
According to [17], when ρ > 0, all local minima of the function will be located
in a finite number of iterations with probability one. This results is only valid
for a function with a finite number of local minima. Consider, e.g., an objective
function with n local minima, n ∈ Z. If the algorithm is used to optimize the
function, the number of iterations required would be at least n since the rate of
location new local minima is maximum one per iteration.
The stopping rule applied in step 5) uses a Bayesian estimate of the total number
of local minima. W denotes the number of local minima found after k iterations,
and D denotes the number of points on Sd . A Bayesian estimate of the number of
local minima found so far is given by
Wexp =
W (kD − 1)
.
kD − W − 2
The search procedure is terminated if
Wexp < W + 0.5.
(3.20)
28
3.4
Theory
Test function generation
Testing is an instrumental part of developing an optimization algorithm. The purpose is twofold; its first function is to verify the performance of the algorithm, and
its other benefit is to determine an adequate range of problems where the algorithm
is applicative. Hence, selection of suitable test functions is very important.
Publications presenting nonlinear optimization algorithms are far more numerous than publications concerning testing of optimization software. It seems to be
more fun constructing the algorithms than testing them.
In [20] the authors list 36 different unconstrained optimization problems with
solutions. The article also includes a comparison between two different optimization algorithms. In [8] the authors have gathered a set of constrained optimization
problems. Each includes a known solution and a short description.
However, since most of the global optimization algorithms only aims to determine the global minimum, the locations and function values of the local minima
are seldom listed. Of course, this is not acceptable for our case since the purpose
of our testing is to locate also suboptimal function minima. We would like to have
a set of test functions with completely known properties for all local minima of
the function. It would be even better if we could specify such properties ourselves.
Fortunately, the method for generation of test functions presented in [10] fulfills
these requirements. A framework is presented for generation of tailor-made test
functions with a priori defined: (i) problem dimension, (ii) number of local minima,
(iii) the local minimum points, (iv) the function values of the local minima, and
(v) variable size of the region of attraction for each local minimum. The technique
is to systematically distort a convex quadratic function by introduction of local
minima. Figure 3.6 shows an example of a generated test function.
We have used the tailor-made functions to evaluate the algorithms described
above. The result is presented in Chapter 4 but it is worth noting again that
without the method for generating testfunctions presented in [10] verification of
the algorithms would be much harder to perform.
3.4 Test function generation
29
The true function
2.5
2
1.5
1
0.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
Figure 3.6: A 2-dimensional test function with five local minima.
Chapter 4
Results
In the previous chapters we have presented methods to solve the objective of this
thesis, presented in Section 1.3. In this chapter we will evaluate the methods
when applied to test functions which are produced using the method presented in
Section 3.4.
4.1
Test function suite
As mentioned in Section 3.4, the choice of a suitable test function suite is essential
for development of optimization algorithms. In this section the test functions
selected for evaluation of the optimization algorithms described in Chapter 3 are
presented.
4.1.1
Building a test function
As described in Section 3.4, the method for generation of test functions presented
in [10] is to generate tailor-made functions by defining a number of parameters.
We will demonstrate the metod by generation of a 2-dimensional test function.
The local minima for the generated function are given by the minimum for the
undisturbed quadratic function, T , and the additional function minima added by
disturbing the quadratic function.
We will build a test function by disturbing a quadratic function with local minimum defined by T . By this disturbance, three additional local minima defined by
Mk , k = 1, . . . , 3 with corresponding function values fk are added to the function.
The size of the region of attraction for each introduced local minimum is defined
by ρk . In Table 4.1 the selected parameter values are presented.
The parameter values in Table 4.1 are used to generate a test function which
is presented in Figure 4.1. Note that the local minimum represented by T is fully
absorbed by M3 . Hence, the total number of local minima is three for this function.
Besides Test function Nr. 1 another two test functions are used. In Table 4.2
the parameters for a 3-dimensional function with four local minima is presented.
31
32
Results
Table 4.1: Test function Nr. 1 (T1 )
Problem dimension nd = 2 and Nr. local minima m = 3.
Local minima of undisturbed function T = [0 0.1] with function value t = 1.1.
The local minima introduced by function disturbance are given by:
M1 = [−0.7 − 0.7] f1 = 0.1
ρ1 = 0.3
M2 = [0.7 0.7]
f2 = 0.15 ρ2 = 0.4
M3 = [0 0]
f3 = 0.2 ρ3 = 0.4.
The true function
2
1.5
1
0.5
0
1
0.5
0
−0.5
−1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 4.1: Test function generated by using the parameters defined in Table 4.1.
Three of the minima are represented by M1 , M2 , and M3 , while the fourth is given
by T .
Table 4.3 contains the parameters for a 4-dimensional test function with three
local minima. The local minima of this function is defined by M1 , M2 , and T ,
giving a total of three local minima for the function.
4.2
Optimization algorithm results
This section contains the evaluation of the optimization algorithms presented in
Section 3. The algoritms have been evaluated using the test function suite presented in Section 4.1. Table 4.4 presents the results for the algorithms applied to
Test function Nr. 1, while the results for Test function Nr. 2 and Test function
Nr. 3 are presented in Table 4.5 and Table 4.6. The tables show the rate of finding
N correct local minima for the function, where N is presented by the column, e.g.,
MSL found three correct local minima at a rate of 100% for Test function Nr. 1.
The following sections evaluate the results for each algorithm.
4.2 Optimization algorithm results
33
Table 4.2: Test function Nr. 2 (T2 )
Problem dimension nd = 3 and Nr. local minima m = 3.
Local minima of undisturbed function T = [0 0.1 0] with function value t = 1.1.
The local minima introduced by function disturbance are given by:
M1 = [−0.7 − 0.7 − 0.7] f1 = 0.2 ρ1 = 0.5
M2 = [−0.7 0.7 0.7]
f2 = 0.3 ρ2 = 0.5
M3 = [0.7 − 0.7 0.5]
f3 = 0.4 ρ3 = 0.5.
Table 4.3: Test function Nr. 3 (T3 )
Problem dimension nd = 4 and Nr. local minima m = 2.
Local minima of undisturbed function T = [0 0.1 0 0] with function value t = 2.1.
The local minima introduced by function disturbance are given by:
M1 = [−0.7 − 0.7 − 0.7 − 0.7] f1 = 0.1 ρ1 = 0.5
M2 = [0.7 0.7 0.7 0.7]
f2 = 0.2 ρ2 = 0.5.
Algorithm
Table 4.4: Test function Nr. 1 — Results
Rate of correct minima found
on a test cycle of 100 simulations.
1
2
3
MSAC
11% 27 %
62 %
MSAF
4 % 16 %
80 %
MSL
0%
0%
100 %
Algorithm
Table 4.5: Test function Nr. 2 — Results
Rate of correct minima found
on a test cycle of 100 simulations.
1
2
3
4
MSAC
12% 46 % 26 % 16 %
MSAF
4%
14 % 33 % 49 %
MSL
0%
0%
0%
100%
Algorithm
Table 4.6: Test function Nr. 3 — Results
Rate of correct minima found
on a test cycle of 100 simulations.
1
2
3
MSAC
MSAF
51 % 15 % 34 %
MSL
0%
8%
92 %
34
Results
4.2.1
Gaussian Mixture Modeling Approach (MSAC)
In Section 3.2.2 we presented an algorithm based on the Simulated Annealing
algorithm that uses Gaussian mixture modeling to identify suitable points for
restarting the algorithm at next lower temperature.
Number of model components
A requirement for using the algorithm is to identify a feasible number of mixture model components to describe the data. For this two different methods are
presented, Aikaike’s criterion (3.8), and Rissanens MDL measure (3.9).
The criteria can be used to estimate the number of components in a Gaussian
mixture, an estimation we denote by K̂. By a tailor-made d-dimensional Gaussian
mixture consisting of K components it is possible to evalute how well the critera
can estimate the predefined number of mixture components K. We have used this
method to evaluate how well K̂ estimates K for different K and dimensions d.
Table 4.7 presents the mean and variance for K̂ when MDL is used to estimating the number of components K. The corresponding result when using AIC is
presented in Table 4.8.
According to the tables, we can conclude that the estimates produced by AIC,
at least for this data in this situation, prove to have better quality than the estimates produces by MDL. According to the tables AIC seems to be more consistent
when varying the dimension for the data.
MDL tend to overestimate the number of components for low dimension data,
and underestimates for data with high dimension.
Note that a correct estimate of data clusters quantity does not imply an optimal
distribution of model components. And still, the initiation of the EM-algorithm
is a problem.
Due to these results, for Algorithm 2 we have chosen to use AIC to determine
the number of mixture model components.
Algorithm Results
We have used the suite of test functions presented for evaluation of Algorithm 2.
From the tables we can conclude that:
• Overall the algorithm has a low rate of finding the location of all local minima.
• The rate of finding all local minima is decreasing when the problem dimension increases. Probably this is due to the fact that the accumulations of
points which are to be identified by the clustering method become more and
more diffuse. This problem is caused by the “curse of dimensionality”.
• Since the algorithm cannot even produce good results for problems with
dimension three and four, it will probably not be a sufficient tool for solving
the benchmark problem. Hence, we have not spent time on evaluation of the
algorithm for problems of higher dimensions.
4.2 Optimization algorithm results
35
Table 4.7: Estimated number of cluster components using MDL. The tables
present the mean and variance for K̂, evaluated on 35 simulations.
Mean
Mixture dimension (d)
2
3
4
5
6
7
8
Mixture components (K)
1
2
3
4
5
6
7
8
9
10
1.0000
2.0000
3.2000
4.8400
6.2800
6.8000
9.0000
12.0800
11.5600
12.9200
2
Mixture components (K)
1
2
3
4
5
6
7
8
9
10
4.2.2
0
0
0.1667
0.8067
1.3767
0.5833
1.5000
3.5767
2.0067
2.7433
1.0000
2.0000
3.4800
5.0000
6.3600
7.1200
8.5200
10.5600
8.7600
12.7200
1.0000
2.0000
3.2400
4.8400
6.2400
7.1200
7.9600
10.0800
10.6000
11.6000
1.0000
2.0000
3.4000
4.6400
5.6400
7.2800
7.8400
8.4400
9.9200
11.6000
1.0000
2.0000
3.4000
4.5600
5.0000
6.2800
7.4000
9.2400
9.3200
9.8000
Variance
Mixture dimension (d)
3
4
5
6
0
0
0.7600
1.0000
1.0733
2.4433
1.6767
2.8400
2.2733
2.8767
0
0
0.1900
0.5567
1.4400
1.1933
2.1233
1.0767
2.6167
2.9167
0
0
0.3333
0.8233
1.6567
0.9600
1.2233
1.4233
1.9100
2.3333
0
0
0.4167
0.5900
1.2500
1.4600
1.6667
2.1067
2.8100
3.1667
1.0000
2.0000
3.4000
4.4800
5.4800
6.4400
6.9200
7.9200
8.1600
8.9200
1.0000
2.0000
3.3200
4.2000
4.9600
5.6000
5.9600
7.3200
7.7600
7.8000
7
8
0
0
0.5000
0.5933
0.7600
1.0900
1.5767
1.6600
1.5567
1.4100
0
0
0.2267
0.4167
0.6233
1.0833
1.6233
1.0600
2.6900
1.2500
Local Weighted Regression Approach (MSAF)
Algorithm 3 presents a multistart simulated annealing algorithm that uses a local
weighted regression algorithm to identify suitable points for restarting the algorithm at next lower temperature.
Function estimation
The second method presented in Chapter 3 uses function estimation to identify
the regions of attraction for the objective function. The idea is then to use that
information when selecting points to restart the algorithm at each temperature
level. In Section 3.2.3 we concluded that the LWPR method, for our application,
is better suited than the RLS-based Lazy Learning Algorithm. This decision was
based on the fact that LWPR is better at handling noisy data, and the ability
36
Results
Table 4.8: Estimated number of cluster components using AIC. The tables
present the mean and variance for K̂, evaluated on 35 simulations.
Mean
Mixture dimension (d)
2
3
4
5
6
7
8
1.0000
2.0000
3.0000
5.4400
6.2800
7.0400
8.5600
9.6400
10.6400
11.7200
Mixture components (K)
1
2
3
4
5
6
7
8
9
10
2
Mixture components (K)
1
2
3
4
5
6
7
8
9
10
0
0
0
1.3400
2.2100
1.2067
1.0067
1.6567
1.1567
0.9600
1.0000
2.0000
3.0000
4.7600
5.9200
6.9600
8.5600
9.4000
10.5200
11.7600
1.0000
2.0000
3.0800
4.6800
6.4400
7.4800
8.5200
9.9200
10.7200
11.6000
1.0000
2.0000
3.0800
4.5600
5.6400
7.6000
8.4000
9.6000
10.8800
11.3200
1.0000
2.0000
3.9600
4.4800
6.0400
7.2000
8.5600
9.6400
10.6800
11.3600
Variance
Mixture dimension (d)
3
4
5
6
0
0
0
0.7733
0.5767
0.7900
1.1733
1.0833
1.5100
1.5233
0
0
0.0767
0.5600
1.3400
1.0100
2.2600
1.1600
1.6267
1.3333
0
0
0.0767
0.5067
0.7400
1.3333
1.9167
1.0833
0.7767
1.8100
0
0
1.2067
0.5100
0.7900
0.8333
1.3400
1.7400
1.2267
1.4067
1.0000
2.0000
3.0400
5.0800
5.8800
7.0800
7.8000
9.4800
10.6000
11.4000
1.0000
2.0000
3.5600
5.0800
6.0800
7.0800
8.5600
10.0000
10.8000
11.5600
7
8
0
0
0.0400
0.8267
0.7767
0.9933
2.0000
1.5933
1.7500
1.2500
0
0
0.4233
0.8267
0.5767
0.9933
1.8400
0.8333
1.5000
1.3400
to use not only boxcar weighting functions enhances the possibility to produce a
good estimate even with sparse data.
High quality data with quite high density is a requirement to build an estimate
good enough to mimic detailed properties about a function. This is intuitively
correct since only a data set with high density can explore every detail of a function.
For problems of low dimensions this is no problem. However, for higher dimensions
the number of data samples has to be increased exponentially to maintain the
same data density. No doubt, this will be a problem when the problem dimension
increases.
Algoritm Results
As mentioned, we have used the suit of test functions to evaluate Algorithm 4.
The results are presented in Table 4.4.
4.3 Evaluation of the benchmark problem
37
• For the Test Function Nr. 1 and Test Function Nr. 2 the algorithm can find
every local minimum at quite good rate. It is worth to note that for Test
Function Nr. 2 the minima defined by T is quite tricky to locate. This is
because the undisturbed function is quite flat, and for that reason is harder
to locate.
• The algorithm can also be applied to Test Function Nr. 3. However, for this
problem it is noted that the complexity is considerably higher due to the one
extra unknown parameter compared with Problem Nr. 2.
• Over all we must conclude that the algorithm most likely is not applicable
to the benchmark problem that has eight unknown parameters, the number
of function evaluations to make would be to high. However, the concept is
interesting and the result for problems of lower dimensions are satisfactory.
4.2.3
Multilevel Single Linkage (MSL)
In Algorithm 4 the MSL algorithm is presented. For evaluation of the algorithm
we have used the test function suit presented in Section 4.1.
• For every test function the algorithm perform good results.
• The algorithm uses many local searches which results in an extensive amount
of function evaluations. This is a drawback if evaluation of the objective
function is expensive.
• Using an Euclidean distance measure makes the method very sensitive to
scaling of the coordinates. However, this problem is present for all algorithms
presented in this chapter. The Gaussian densities use an Euclidean measure,
and the weights used for the function estimation approach are based on
Euclidean measures.
4.3
Evaluation of the benchmark problem
The objective of the benchmark problem presented in Example 2.1 is to demonstrate the modeling process, presented in Section 1.2.1, and show how to use the
methology presented in Chapter 2, and Chapter 3.
According to the theory given in Section 2.2.4, to present results for the benchmark problem presented in Example 2.2 we must handle a problem consisting of
two main parts. The first part is to find the set of models that have “enough”
predictive power, and the second part is to identify which of those models that
show “significant differently” properties.
The former part of the problem is solved through the optimization problem (2.10). To solve the later part, we analyse the models given by the first
part to select which models that give additional knowledge about the system. In
Example 4.1 we apply this methodolgy to the benchmark problem.
38
Results
Example 4.1: Insulin Signaling cont’d
As written above, the identification of descriptive models can roughly be divided into two main parts: The problem of finding suboptimal minima is solved
through (2.10). The second problem is handled by analysing the found solutions.
1. To solve the optimization problem (2.10) we have used the MSL-algorithm
presented in Algorithm 4 and evaluated in Section 4.2.3. The resulting parameter values are presented in Table 4.9.
2. Analysing the solutions found listed in Table 4.9, we can divide the found
solutions into to sets, S1 and S2 . The horizontal line in the table represent
the divider. The fact that S1 and S2 represent different properties about the
system can be studied in Figure 4.2. The figure shows the simulation results
of the system given by (2.2) using the parameter values listed in Table 4.9. It
can be seen that only one group of simulations seems to mimic the overshoot
that clearly appears in the measured data. Using that observation, we can
conclude that only S1 represents solutions with enough predictive power.
• The set S1 represents parameter values that correspond to models with
“enough” predictive power. Hence, the final set of solutions is S1 .
• For most of the parameters in S1 no clear relations can be seen between the
solutions. However, the value for the reaction velocity kR can be limited to
[0.3392, 0.5225].
• For most of the parameters no obvious relations can be found between the
parameters. One possible explanation for this can be related to the fact
that the problem is over-parameterized, at least with respect to available
measured data.
4.3 Evaluation of the benchmark problem
39
100
90
80
70
60
50
40
30
20
10
0
0
5
10
15
Figure 4.2: Simulations using parameter values from Table 4.9.
40
Results
Table 4.9: Insulin Signaling — Resulting Parameter Values
k1
k−1
k2
k3
k−3
S1
kD
kR
kY
FVAL
12.7122
76.8881
21.3379
37.1172
11.4749
1.5150
45.7914
7.9010
13.6422
9.4384
11.2442
7.9930
25.5096
9.2456
2.4979
8.9191
1.6000
9.2325
9.5359
1.8371
0.9125
13.7346
67.7109
79.5270
0.3605
37.6791
47.6811
0.1313
113.4441
0.3117
0.3470
28.2603
0.5446
0.4771
4.4083
1.4278
3.6097
4.3038
11.2218
2.6143
52.9289
98.3896
69.7877
56.2027
10.9673
52.9884
1.8184
17.6429
8.8086
0.8141
8.1062
61.9113
0.9704
17.2389
3.8308
6.9403
4.9193
3.3635
4.7083
3.7419
0.9628
45.3371
2.4754
15.3588
2.5242
17.9938
15.6874
3.6133
16.3766
9.9312
2.4293
3.6166
4.1115
2.3038
7.2801
1.2512
12.4436
3.4589
9.8515
7.4575
4.0382
222.2496
167.6284
60.1809
12.2177
123.7422
82.9557
5.8290
113.1853
0.0516
3.9089
66.8479
0.0607
14.9478
2.2564
0.7961
1.7531
9.0535
2.1359
11.8739
61.1943
5.7374
97.6481
4.4261
6.8225
79.9269
66.2795
2.5420
74.5885
10.3389
2.5682
23.6146
8.8799
5.9180
5.4400
2.5051
9.7769
7.5687
2.1667
20.2356
0.3392
0.3147
0.3431
0.3307
0.3700
0.3638
0.3653
0.3754
0.3691
0.3976
0.3871
0.3861
0.4492
0.3308
0.4896
0.4581
0.4125
0.5193
0.5225
0.4755
18.2149
19.8630
18.3060
17.8060
16.2152
129.1175
125.2594
15.6443
117.9097
101.4815
14.5333
17.4451
51.0531
13.8702
46.5937
13.3867
97.8695
21.1294
23.8604
67.3766
153.8000
153.8912
153.9990
154.4000
154.4067
155.2468
155.2681
155.3209
156.0062
156.6710
156.9818
159.0000
159.0810
172.7306
173.9222
177.9620
179.8842
190.1176
205.1252
206.8978
71.2640
59.0103
74.0597
38.8884
15.9872
9.1238
83.7368
15.8926
20.1017
23.9004
64.8329
6.8673
67.4260
36.0190
47.7393
53.9721
51.5867
63.8451
89.9858
40.7388
82.2273
39.6913
82.0409
72.1714
59.7254
54.9758
61.1508
95.1013
49.4788
32.3084
10.9673
29.6947
78.1993
35.2583
96.3209
39.2847
41.4848
13.9950
49.5911
87.4996
93.0341
27.6566
45.2720
17.4046
26.7020
30.1223
55.3228
87.7380
59.1278
46.5307
87.7380
59.3455
16.0248
31.8288
42.9153
90.5991
15.8449
14.3051
81.8379
90.7251
8.1565
21.6421
9.3438
8.3747
35.3065
35.1443
11.9345
11.6103
97.0622
7.0981
21.6116
12.0609
55.4880
15.7554
11.8423
27.1626
17.1421
19.3827
20.2341
20.5098
65.3182
31.0775
14.1162
52.5000
50.1853
35.2084
94.0648
3044.0
3044.5
3045.3
3045.5
3045.5
3045.6
3045.6
3045.6
3045.9
3045.9
3046.0
3046.0
3046.1
3046.1
3046.1
3046.1
3046.1
3046.1
3046.1
3046.1
3046.1
3046.1
3046.1
3046.1
3046.1
3046.1
3046.1
3046.1
3046.2
S2
78.1743
7.8699
23.7106
47.6760
65.1409
21.2405
5.6715
12.2756
83.6243
61.1307
5.3163
78.9665
21.0033
95.9948
5.8629
73.2632
76.5338
36.6972
59.3531
34.8573
30.8818
75.8090
6.4428
38.2266
57.4830
19.4994
18.2321
56.6723
22.1273
62.6142
90.1017
30.9054
62.9425
49.4092
17.4420
83.4845
64.4575
16.0561
87.7380
90.1345
25.6369
96.3211
1.9012
45.2306
45.0510
0.2692
70.5716
54.4532
71.5775
9.9765
76.0282
71.0485
62.9481
53.1481
82.7012
48.5701
72.4792
23.8410
4.0718
32.9018
41.0300
8.5669
53.7606
78.9244
86.4774
24.3079
56.1800
43.5592
43.4199
86.9693
45.6574
64.8499
83.6849
33.3321
95.3703
44.8605
97.1169
97.6377
89.6455
19.7466
71.8637
97.2748
87.7380
18.9923
20.8317
54.2705
2.4404
74.1380
74.6083
3.9363
20.8855
72.2233
40.9115
17.2900
85.8078
52.2323
35.1731
77.5860
95.0964
18.9420
47.2734
76.6181
43.1366
31.1600
70.7737
75.3895
93.3002
99.4068
47.0799
65.9736
71.9665
33.5264
94.1727
59.8667
82.9896
56.2392
68.6549
62.3391
52.0661
41.7194
94.1469
41.0538
19.5047
6.2636
50.3048
33.0070
71.4865
96.1961
21.8728
6.4260
31.1469
57.1467
20.9052
16.0058
23.6198
6.5427
71.7929
17.5489
47.9794
8.6357
20.8175
98.1724
5.9576
32.8710
87.1935
Chapter 5
Conclusions and proposals
for future work
5.1
Thesis conclusions
The main objective for this thesis has been to find suitable methods for parameter
estimation of descriptive models. The method has been used to present results for a
benchmark problem which was built upon a hypothesis about insuling signaling in
a fat cell. The main objective connected the thesis was to find global optimization
methods able to locate suboptimal function minima. To solve this problem three
algorithms were presented, two (MSAC and MSAF) using new theory and one
(MSL) using existing theory. The results from Chapter 4 show that:
• The MSAC-algorithm did not show any good results for find the function
minima. A probable explanation can be that Gaussian Mixture Modeling
does not exhibit the robustness that is necessary to build a multistart algorithm.
• The MSAF-algorithm showed satisfactory robustness for low dimension problems. However, building such a detailed estimate of the function that is
needed requires a function evaluation grid with high density. With growing
problem dimensions this makes the algorithm too computationally expensive
for practical use. However, the idea using a function estimate for locating
the areas of attraction is appealing.
• MSL-algorithm is no doubt the simplest algorithm examined for this thesis.
However, the algorithm shows a robustness significantly better than MSAC
and MSAF.
• As mentioned in Section 4.2.3, one drawback with the MSL-algorithm is that
it uses an extensive amount of function evaluations. For a low-dimensional
problem with really expensive function evaluations the MSAF-algorithm can
41
42
Conclusions and proposals for future work
be the best choice since every performed function evaluation is used to update
the function estimation.
Using the MSL-algorithm we have showed some results for the benchmark
problem which restrict the reaction rate for a certain part of the problem.
A comment about the algorithm results is that the MSAC and MSAF algorithms were developed to solve the joint problem of finding suboptimal solutions
and handling varable scaling. The MSL-algorithm based on an Euclidian distance
measure was never intended to solve the variable scaling problem.
5.2
Proposals for future work
One proposal for future work is to search for more knowledge within the field of
global optimization. Associate professor Dr E.M.T. Hendrix has a long experience within the field of global optimization and uniform covering, and has given
valuable input for this thesis. An additional algorithm that should be examined
is the Topographical Multilevel Single Linkage algorithm, an algorithm that combines the MSL-algorithm with the Topographical Global Optimization Algorithm
of Professor Aimo Törn [25].
Bibliography
[1] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning.
Artificial Intelligence Review, 11(1-5):11–73, 1997.
[2] M. Birattari, G. Bontempi, and H. Bersini. Lazy learning meets the recursive
least-squares algorithm. Advances in Neural Information Processing Systems,
11:375–381, 1999.
[3] I. O. Bohachevsky, M. E. Johnson, and M. L. Stein. Generalized simulated
annealing for function optimization. Technometrics, 28(3):209–217, 1986.
[4] G. Cedersund. Core-box Modelling. PhD thesis, Department of Signals and
Systems, Chalmers University of technology, Göteborg, Sweden, 2006.
[5] W. S. Cleveland. Robust locally weighted regression and smoothing scatterpoints. Journal of the American Stastical Association, 74(368):829–836,
1979.
[6] W. S. Cleveland and C. L. Loader. Smoothing by Local Regression: Principles
and Methods. In W. Härdle and M. G. Schimek, editors, Statistical Theory
and Computational Aspects of Smoothing, pages 10–49. Springer, New York,
1996.
[7] C. Engström, editor. Nationalencyklopedin. Bra Böcker, 1995.
[8] T. G. W. Epperly and R. E. Swaney. Branch and bound for global NLP:
Iterative LP Algorithm and Results, chapter 2. Kluwer Academic Publishers,
1996.
[9] C. Fraley and A. E. Raftery. How many clusters? which clustering method?
answers via model-based cluster analysis. The Computer Journal, 41(8):578–
588, 1998.
[10] M. Gaviano and D. Lera. Test functions with variable attraction regions for
globaloptimization problems. J. of Global Optimization, 13(2):207–223, 1998.
[11] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical
Learning. Springer, August 2001.
43
44
Bibliography
[12] E. M. T. Hendrix, P. M. Ortigosa, and I. García. On success rates for controlled random search. J. of Global Optimization, 21(3):239–263, 2001.
[13] H. Kitano. Systems biology: A brief overview. Science, 295(5560):1662–1664,
2002.
[14] C. Kittel and H. Kroemer. Thermal Physics. W. H. Freeman and Company,
August 2002.
[15] L. Ljung, editor. System identification (2nd ed.): theory for the user. Prentice
Hall PTR, Upper Saddle River, NJ, USA, 1999.
[16] L. Ljung and T. Glad, editors. Modellbygge och Simulering. Studentlitteratur
AB, 2003.
[17] M. Locatelli. Relaxing the assumptions of the multilevel single linkage algorithm. Journal of Global Optimization, 13:25–42, 1997.
[18] J. H. Mathews and K. D. Fink. Numerical Methods Using MATLAB. Simon
& Schuster, 2004.
[19] C. G. Moles, P. Mendes, and J. R. Banga. Parameter estimation in biochemical pathways: A comparision of glocal optimization methods. Genome
Research, 13:2467–2474, 2003.
[20] J. J. Moré, B. S. Garbow, and K. E. Hillstrom. Testing unconstrained optimization software. ACM Trans. Math. Softw., 7(1):17–41, 1981.
[21] M. Nakhkash and M. T. C. Fang. Application of the multilevel single-linkage
to one-dimensional electromagnetic inverse scattering problem. IEEE Transactions on Antennas And Propagation, 47(11):1658–1668, 1999.
[22] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical
recipes in C (2nd ed.): the art of scientific computing. Cambridge University
Press, New York, NY, USA, 1992.
[23] S. Roberts. Parametric and non-parametric unsupervised cluster analysis.
Pattern Recognition, 30:261–272, 1996.
[24] H. Schmidt and M. Jirstrand. Systems biology toolbox for matlab: a computational platform for research in systems biology. Bioinformatics, 22(4):514–515,
2006.
[25] A. Törn and S. Viitanen. Topographical global optimization. Princeton University Press, Princeton, NJ, USA, 1992.
[26] K. Tsai and F. Wang. Evolutionary optimization with data collocation for
reverse engineering of biological networks. Bioinformatics, 21(7):1180–1188,
2005.
[27] A. Törn and A. Zilinskas, editors.
Springer-Verlag, 1989.
Lecture Notes in Computer Science.
Bibliography
45
[28] E. Ulfheilm. Modeling of metabolic insulin signaling in adipocytes. MSc thesis, Department of Electrical Engineering, Linköping University, Linköping,
Sweden, 2006.
[29] J. J. Verbeek, N. Vlassis, and B. J. A. Kröse. Efficient greedy learning of
gaussian mixture models. Neural Computation, 15(2):469–485, 2003.
[30] S. Vijayakumar, A. D’souza, and S. Schaal. Incremental online learning in
high dimensions. Neural Comput., 17(12):2602–2634, 2005.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement