Institutionen för systemteknik Department of Electrical Engineering Examensarbete Global optimization methods for estimation of descriptive models Examensarbete utfört i Reglerteknik vid Tekniska högskolan i Linköping av Tobias Pettersson LITH-ISY-EX--08/3994--SE Linköping 2008 Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Linköpings tekniska högskola Linköpings universitet 581 83 Linköping Global optimization methods for estimation of descriptive models Examensarbete utfört i Reglerteknik vid Tekniska högskolan i Linköping av Tobias Pettersson LITH-ISY-EX--08/3994--SE Handledare: Henrik Tidefelt isy, Linköpings universitet Gunnar Cedersund ibk, Linköpings universitet Examinator: Jacob Roll isy, Linköpings universitet Linköping, 28 March, 2008 Avdelning, Institution Division, Department Datum Date Division of Automatic Control Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Språk Language Rapporttyp Report category ISBN Svenska/Swedish Licentiatavhandling ISRN Engelska/English ⊠ Examensarbete ⊠ C-uppsats D-uppsats Övrig rapport 2008-03-28 — LITH-ISY-EX--08/3994--SE Serietitel och serienummer ISSN Title of series, numbering — URL för elektronisk version http://www.control.isy.liu.se http://www.ep.liu.se Titel Title Globala optimeringsmetoder för identiering av beskrivande modeller Global optimization methods for estimation of descriptive models Författare Tobias Pettersson Author Sammanfattning Abstract Using mathematical models with the purpose to understand and store knowlegde about a system is not a new field in science with early contributions dated back to, e.g., Kepler’s laws of planetary motion. The aim is to obtain such a comprehensive predictive and quantitative knowledge about a phenomenon so that mathematical expressions or models can be used to forecast every relevant detail about that phenomenon. Such models can be used for reducing pollutions from car engines; prevent aviation incidents; or developing new therapeutic drugs. Models used to forecast, or predict, the behavior of a system are refered to predictive models. For such, the estimation problem aims to find one model and is well known and can be handeled by using standard methods for global nonlinear optimization. Descriptive models are used to obtain and store quantitative knowledge of system. Estimation of descriptive models has not been much described by the literature so far; instead the methods used for predictive models have beed applied. Rather than finding one particular model, the parameter estimation for descriptive models aims to find every model that contains descriptive information about the system. Thus, the parameter estimation problem for descriptive models can not be stated as a standard optimization problem. The main objective for this thesis is to propose methods for estimation of descriptive models. This is made by using methods for nonlinear optimization including both new and existing theory. Nyckelord Keywords systems theory, global optimization, describing models Abstract Using mathematical models with the purpose to understand and store knowlegde about a system is not a new ﬁeld in science with early contributions dated back to, e.g., Kepler’s laws of planetary motion. The aim is to obtain such a comprehensive predictive and quantitative knowledge about a phenomenon so that mathematical expressions or models can be used to forecast every relevant detail about that phenomenon. Such models can be used for reducing pollutions from car engines; prevent aviation incidents; or developing new therapeutic drugs. Models used to forecast, or predict, the behavior of a system are refered to predictive models. For such, the estimation problem aims to ﬁnd one model and is well known and can be handeled by using standard methods for global nonlinear optimization. Descriptive models are used to obtain and store quantitative knowledge of system. Estimation of descriptive models has not been much described by the literature so far; instead the methods used for predictive models have beed applied. Rather than ﬁnding one particular model, the parameter estimation for descriptive models aims to ﬁnd every model that contains descriptive information about the system. Thus, the parameter estimation problem for descriptive models can not be stated as a standard optimization problem. The main objective for this thesis is to propose methods for estimation of descriptive models. This is made by using methods for nonlinear optimization including both new and existing theory. Sammanfattning Att använda beskrivande modeller med syfte att förstå och lagra kunskap om ett system är inget nytt område inom vetenskapen, tidiga exempel kan dateras ända tillbaka till Keplers modeller för planetrörelser. Det yttersta syftet är att uppnå en sådan kunskap inom ett område så att matematiska uttryck eller modeller kan användas för att förutsäga varje detalj inom området. Sådan kunskap kan användas för att minska emissioner från bilmotorer; undvika ﬂygolyckor genom diagnos och övervakning eller för att utvecka nya läkemedel. Modeller som används för att förutsäga, eller prediktera uppförandet för ett system kallas predikterande modeller. För dessa är problemet att estimera modellens okända parameterar är väl behandlat i litteraturen och kan lösas genom användning av kända metoder för ickelinjär optimering. v vi Beskrivande modeller används för att samla och lagra kunskap om ett systems inre sammansättning. Hittills har estimering av beskrivande modeller skett genom att använda metoder avsedda för predikterande modeller. Istället för att försöka ﬁnna en modell, som för predikterande modeller, vill vi ﬁnna en mängd av modeller för att beskriva systemet. Varje modell som innehåller beskrivande information om systemet är intressant. Målet med detta arbete är att föreslå metoder för estimering av beskrivande modeller. Detta görs genom använding av nya såväl som redan kända metoder för global optimering. Acknowledgments First I want to thank Jacob Roll, Gunnar Cedersund, and Henrik Tidefelt for your support, patience and helpful insights and for always giving good answers to my questions. I also would like to thank my father Daniel, mother Kerstin, Magnus and his wife Linda, David, Josef, and Elias. Your support has been, and will be, very important to me. vii Contents 1 Introduction 1.1 Systems theory . . . . . . . . . 1.1.1 An overview . . . . . . . 1.1.2 Mathematical modeling 1.1.3 Model types . . . . . . . 1.1.4 Model applications . . . 1.1.5 Parameter estimation . 1.2 Systems biology . . . . . . . . . 1.2.1 Modeling process . . . . 1.3 Thesis objectives . . . . . . . . 1.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 3 3 3 4 4 5 6 2 Thesis background 2.1 Nonlinear systems . . . . . . . . . . 2.1.1 Nonlinear State-Space Models 2.2 Parameter estimation . . . . . . . . 2.2.1 Set of possible models . . . . 2.2.2 Prediction error minimization 2.2.3 Predictive models . . . . . . . 2.2.4 Descriptive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 7 10 10 10 11 11 3 Theory 3.1 Global optimization . . . . . . . . . . . . . 3.2 Simulated Annealing . . . . . . . . . . . . . 3.2.1 Simulated Annealing Algorithm . . . 3.2.2 Cluster analysis approach . . . . . . 3.2.3 Function estimation approach . . . . 3.3 Clustering methods for Global Optimization 3.4 Test function generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 14 14 16 19 25 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Approach (MSAC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 31 32 34 . . . . . . . . . . . . . . . . . . . . 4 Results 4.1 Test function suite . . . . . . . . . 4.1.1 Building a test function . . 4.2 Optimization algorithm results . . 4.2.1 Gaussian Mixture Modeling ix x Contents 4.2.2 Local Weighted Regression Approach (MSAF) . . . . . . . 4.2.3 Multilevel Single Linkage (MSL) . . . . . . . . . . . . . . . Evaluation of the benchmark problem . . . . . . . . . . . . . . . . 35 37 37 5 Conclusions and proposals for future work 5.1 Thesis conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Proposals for future work . . . . . . . . . . . . . . . . . . . . . . . 41 41 42 Bibliography 43 4.3 Chapter 1 Introduction A desirable aim within many science ﬁelds is to obtain such a comprehensive predictive and quantitative knowledge about a phenomenon so that mathematical expressions or models can be used to forecast every relevant detail about that phenomenon. Such models can be applied to a wide ﬁeld of applications e.g., reducing pollution from car engines; prevent aviation incidents through diagnosis and surveillance of the airplanes; or developing new therapeutic drugs. Clearly, this is a vast task that requires knowledge of the speciﬁc science ﬁeld e.g. biology, chemistry, or economics, as well as general knowledge of how to structure, estimate, and prove the validity of a model [4]. The main problem of this thesis is connected with the model estimation mentioned above. Throughout the presentation special focus is on identification of biological systems, a science ﬁeld named systems biology. This introductory chapter is intended to serve as a motivation of the thesis, including a brief introduction to the problem of system identiﬁcation and the ﬁeld of systems biology. After introduction of these basic concepts the thesis objectives and an outline of this report are presented. 1.1 1.1.1 Systems theory An overview Systems theory truly must be considered an interdisciplinary ﬁeld of science with applications to, e.g., mechanics, biology, economy, and sociology. There exist many descriptions of a system. One is that a system is a group of elements forming a complex whole [7]. In [15] a more mechanistic view is presented. A system is described as an object in which signals interact and produce observable outputs. The system is aﬀected by external stimuli, which are called inputs if they are controllable and disturbances otherwise. The disturbances are divided into the ones that are possible to directly measure and the ones that are not. Figure 1.1 shows a graphical representation of a system. 1 2 Introduction Clearly the concept of a system is very broad and the elements do not even have to be of physical kind. v w u F y Figure 1.1: A system with controllable input u, observable output y, disturbances that are possible to measure (w) and those that are not (v). The problem of system identiﬁcation is to obtain knowledge of the relationship between the external stimuli and the observable output of the system. Such knowledge is called a model of the system. Models can be of various kind and sophistication adjusted to serve the model’s intented use. To ride a bike is an example of a mental model. The mental model describing how to ride a bike and how the manuevers are aﬀected by, e.g., wind gusts emerge during the learning phase. Another type of model is a graphical model where the relationsship between the variables can be represented graphically as plots, diagrams or tables. This thesis will deal with a third kind, mathematical models. A mathematical model aims to describe the interaction between the external stimuli and the observable by use of mathematical expressions [15]. 1.1.2 Mathematical modeling For all the kinds of models described above, the model is built from a combination of observed data and prior knowledge. E.g., the child learning to ride the bike will experience repeated failures before the model is good enough for its intended use. Identiﬁcation of mathematical models is developed by mainly two diﬀerent approaches: modeling and system identiﬁcation. In modeling the system is divided into subsystems whose properties are known. The subsystems are then put to together, forming a model for the whole system. An introduction to modeling is presented in [16]. The other approach is system identification in which a model structure is adjusted to ﬁt experimental data. Please note the distinction between the model structure and the identiﬁed model. A model structure is a set of possible candidate models, a user’s choice build upon a priori knowledge within the speciﬁc science ﬁeld. A model is a certain candidate model included in the model structure. The system identiﬁcation process consists of three basic entities. A data set of the input-output data is of course essential, a suitable model structure must be chosen, and an evaluation method to determine the best model within the chosen structure given the data set. A thorough framework for system identiﬁcation is presented in [15]. 1.1 Systems theory 1.1.3 3 Model types Each of these two diﬀerent approaches, or a combination of them, gives rise to three diﬀerent kind of mathematical models. Black-box models Models only used for prediction do not require knowledge about the intrinsic interactions between the variables. Since it is the relation between the external stimuli and the observable output that is of interest, the system can be considered to be a “black box” that describes the input-output behavior [4]. An extensive framework for black-box modelling is presented in [15]. White-box models It can be possible to describe a system by mathematical expressions without the need for measurement data. Such models are referred to as white-box models. The knowledge required to build a white box model is often application dependent, e.g., mechanical models can be built upon the knowledge of mechanics. Gray-box models A common case of system modelling is when the basic model structure can be postulated but some parameters within it are unknown. The problem is to estimate the unknown parameters so that the model ﬁts the measured data. This kind of mathematical model is called gray-box models and are the kind of models focused on in this thesis. 1.1.4 Model applications A model can be used to forecast, or predict, the behavior of a system. Predictive knowledge is useful in many applications, e.g., control, diagnosis, or surveillance. Since only the input-output relation is of interest for prediction, all model types listed above can be used. Models used for prediction of a system are refered to as predictive models [4], [26]. Models can also be used to obtain and store quantitative knowledge of a system [15], [26]. The modeling process, further described in Section 1.2.1, aims to explore the validity of some assumptions or hypotheses about the system. Since the intrinsic structure of the model must be known, only white- and grey-box models can be used for this application. This is due to that the intrinsic structure is used for the interpretation between the identiﬁed model and the resulting quantitative knowledge of the system. Models used for this application are referred to as descriptive models [4], [13]. 1.1.5 Parameter estimation For gray-box models, the problem to ﬁnd suitable estimates of the unknown parameters in the model is referred to the problem of parameter estimation or the inverse problem [19]. For a nonlinear system the inverse problem can be stated as a nonlinear optimization problem. The problem of parameter estimation is further discussed in Section 2.2. 4 Introduction 1.2 Systems biology Systems biology aims at understanding biological networks at a system level, rather than analyzing isolated parts of a cell or an organism. Identifying and describing all the parts of an organism is like listing all the parts of an airplane. Analogous to the airplane, the assemblance of a biological network has great impact on its functionality. Knowledge about the separate parts of the biological network will still be of great importance but the focus of systems biology is mainly on understanding a system’s structure and dynamics [13]. The dynamic modeling of a biological system can be stated as a reverse problem, i.e., from measurements describing the whole system obtain detailed knowledge about every single element for the system. This problem requires a well-suited model structure of the system, eﬃcient methods to identify the model parameters, and experimental data of good quality [26]. 1.2.1 Modeling process As mentioned above, modeling descriptive models aims to reveal and store quantitative knowledge about every element and detail of the system [13]. Consider the modeling process presented in Figure 1.2, describing a modeling framework described in [4]. The framework combines strenghs from gray-box modeling based on mechanistic knowledge with statistical methods for system identiﬁcation. 1. In Phase I a model structure is built based on mechanistic explanations and assumptions about the system. Parameter estimation is then used to ﬁt the model to estimation data. In practice, this is done by adjusting the unknown parameters according to a parameter estimation method. One of the methods used for parameter estimation is performed by minimizing the prediction errors of the model [15], and is further explained in Section 2.2.2. If an acceptable agreement between the model and the estimation data can be achieved, the model validity should be further established by the process of model validation. One way to prove the validity of a model is to compare the model outputs with a set of fresh data, the validation data. Another approach, preferably used if no validation data is available, is to use statistical hypothesis tests to ﬁnd the likelihood that the data has been generated by the estimated model. Model validation includes a number of aspects, e.g., the subjective intuitive feeling whether the model behaves as expected, whiteness tests of the residuals, or checks of parameter conﬁdence intervals [15]. • If the model turns out to pass the validation process it is accepted, which indicates that the postulated model structure has captured at least some true property of the system. 1.3 Thesis objectives 5 • If the model exposes considerable inconsistencies with the validation data it must be rejected or modifed. Note that both of these lead the modeling process forward and new knowledge of the system is obtained in both cases [4]. 2. The result derived in Phase I is translated into a core-box model, which is a model consisting of a minimal core model with all details of the original gray-box models combined with quality links from the minimal core model to its possible predictions [4]. As shown in Figure 1.2, this procedure is used to reveal necessary properies for the remaining explanations about the system. Experimental data Mechanistic explanations Model based hypothesis testing Phase I rejection of explanations Core−box modelling necessary properties in remaining explanations Phase II Figure 1.2: The modeling process. 1.3 Thesis objectives For predictive models, the parameter estimation problem can be stated as ﬁnding a parameter vector that minimizes the prediction errors according to some criterion. Hence, the parameter estimation problem for predictive models can be stated as a standard nonlinear optimization problem with main purpose to locate a global minima for some cost function. For estimation of descriptive models, every parameter vector that gives a model predictor that is “good enough” is of interest. Hence, the parameter estimation problem for descriptive models is to ﬁnd a set of parameter vectors such that the corresponding model predictors are “good enough”. For that reason, the parameter estimation problem for descriptive models can not be stated as a standard nonlinear optimization problem. However, we will in this thesis describe how to use nonlinear optimization methods for estimation of descriptive models. In Section 2.2 a formal description of the problem is formulated, and in Chapter 3 we present the methods used. One of the methods used for global optimization is the “Annealed Downhill Simplex” algorithm. This algorithm and global optimization in general will be further described in Chapter 3. The idea of this thesis is to modify the algorithm so that it can be used to handle the parameter estimation problem for descriptive models. The thesis objectives are: • Modify the “Annealed Downhill Simplex” algorithm so that it is possible to restart the algorithm at multiple points at each temperature. 6 Introduction • Find methods to solve the problem that arises when the parameters are of diﬀerent scaling. • Include the possibility to ﬁnd sub-optimal parameter sets. 1.4 Thesis outline As stated in this introductory chapter, the problem of parameter estimation (inverse problem) can be stated as a nonlinear optimization problem. Chapter 2 surveys previous research in the ﬁelds parameter estimation in nonlinear systems, nonlinear optimization, and methods for evaluating optimization algorithms. The theory and methods used for the thesis are presented in Chapter 3. Chapter 4 contains the thesis results. Chapter 5 contains a discussion of the results, followed by some conclusions and suggestions for further work. Chapter 2 Thesis background The intention of this chapter is to review the most basic concepts used in this thesis. Further, this chapter will describe the parameter estimation problem when applied to descriptive modeling. According to Chapter 1, modeling a dynamic system requires three basic entities; a well-suited model structure, an eﬃcient parameter estimation method, and experimental data of good quality. The last of them, the experiental data, is in this thesis supposed to be given. However, the ﬁrst and second entity will be further described in this chapter. 2.1 Nonlinear systems As stated in Section 1.1.4, descriptive models are used to reveal and store knowledge of a system [4]. Hence, the model stucture to use must have the ability to incorporate physical insights. 2.1.1 Nonlinear State-Space Models A model structure that fulﬁlls those requirements is the state-space form. For a state-space model, the system inputs u, noise, and outputs y are related by diﬀerential or diﬀerence equations. Using an auxiliary vector x(t), the system of equations can be written in a very compact and appealing way [15]. A nonlinear state-space model with process noise w(t), measurement noise v(t) and the unknown parameters of the system gathered in θ is in continuous time presented as: ẋ(t) = f (t, x(t), u(t), w(t); θ) (2.1) y(t) = h(t, x(t), u(t), v(t); θ). Note in (2.1) the separation of the system dynamics description from the measurements equation. This is a very natural and appealing way to incorporate physical insights of the system into the model. As mentioned in Chapter 1, setting up a 7 8 Thesis background model structure requires detailed knowledge about the science ﬁeld for which the model is used. In this thesis we will not go further into any speciﬁc science ﬁeld. Instead, we will use a benchmark problem to evaluate the parameter estimation methods. The benchmark problem aims to describe the process of insulin signaling in a fat cell. An extensive discussion is presented in [28] and [4]. In Example 2.1 the model stucture and a picture of the signaling pathway are presented. Example 2.1: Insulin Signaling The benchmark problem of this thesis is to describe and store knowlegde about the biological pathway shown in Figure 2.1. The pathway is an assumption about how to describe insulin signaling in fat cells. IR k1 IRins k2 k−1 ECM Cytosol k−3 kR IRPTP IRP kD k3 IRpPTP Figure 2.1: The assumed model structure for the problem of insulin signaling in a fat cell. The model structure is given by: d(IR) dt d(IRins) dt d(IRP ) dt d(IRpPTP ) dt d(IRPTP ) dt measIRp = −k1 · IR + kR · IRPTP + k−1 · IRins = k1 · IR − k2 · IRins − k−1 · IRins = k2 · IRins − k3 · IRP + k−3 · IRpPTP (2.2) = k3 · IRP − k−3 · IRpPTP − kD · IRpPTP = kD · IRpPTP − kR · IRPTP . = kY 1 · (IRP + IRpPTP ) Note that the model stucture is presented in non-linear state space form. The last of the equations is the measured quantity, in this case a sum of the two variables IRP and IRpPTP . The system input is given by instantly adding a certain amount of insulin to the fat cell. For modeling purposes this can be described by setting an initial condition on the used model states: 2.1 Nonlinear systems 9 IR(0) = 10.000000 IRins(0) IRP (0) IRpPTP (0) = = = 0 0 0 IRPTP (0) = 0. In Figure 2.1 the measured data is presented. Clearly, the claim that a large amount of high-quality data is needed for modeling has to be relaxed for this problem. 100 90 80 measIRp 70 60 50 40 30 20 10 0 0 5 t 10 15 Figure 2.2: Available measurement data for the benchmark problem. When observing a system the signals can only be measured in discrete time. In discrete time a state-space model has the form: x(t + 1) = fd (t, x(t), u(t), w(t); θ) (2.3) y(t) = hd (t, x(t), u(t), v(t); θ). For abbreviation, the past-time input-output measurements available at time t are denoted: Z t = (y t , ut ) = (y(1), u(1)...y(t), u(t)). The selected model structure usually includes some unknown parameters, e.g., in Example 2.1 we can ﬁnd eight unknown kx . It is often favourable to collect those parameters in one vector θ. As mentioned in Chapter 1, the problem of parameter estimation is a problem of, given an experimental data set Z t , ﬁnd values of θ so that the the selected model is able to reproduce the experimental results [15]. 10 2.2 Thesis background Parameter estimation Section 2.1 described the problem of how to select a suitable model structure built upon physical insight. Hence, two of the three main entities of the system identiﬁcation problem have been described. The remaining task is then to determine suitable values for the unknown parameters of the model, i.e., the parameter estimation. 2.2.1 Set of possible models Every diﬀerent choice of θ in 2.3 gives rise to a new model. If we denote the set of possible θ as DM , the set of possible models is written as: M ∗ = {M (θ)| θ ∈ DM }. (2.4) The parameter estimation problem can be stated diﬀerently for predictive and descriptive models and this will be further explored below. However, for both approaches the problem is to ﬁnd an adequate mapping from the estimation data Z N to the set DM : Z N → θ̂N ∈ DM . 2.2.2 (2.5) Prediction error minimization There exist several methods for estimation of the unknown parameters θ in the models. The prediction-error identification approach includes procedures such as least-squares and maximum-likelihood for parameter estimation [15]. Prediction-error methods estimate the unknown parameters for the model by minimizing the diﬀerence between measurements and model predictions. For that reason, a requirement is that a predictor for the model structure of interest is available. The process of how to build a predictor is treated in [15] for a number of model structures. A possible predictor for the time discrete state-space model structure (2.3) is given by simply disregarding the process noise w(t): x(t + 1, θ) = fd (t, x(t, θ), u(t), 0; θ) ŷ(t, θ) = hd (t, x(t, θ), u(t), 0; θ). (2.6) Now we possess all the tools available to form the prediction error of a model at time t: ε(t, θ) = y(t) − ŷ(t, θ). (2.7) A measure of the size for the complete sequence of prediction errors at time t = 1, . . . ,N is given by: V (θ, Z N ) = N 1 X 2 ε (t, θ). N t=1 (2.8) The process to ﬁnd the θ that minimizes (2.8) is called parameter estimation by the non-weighted least-square method. 2.2 Parameter estimation 2.2.3 11 Predictive models For predictive models, the main objective is to ﬁnd the parameter vector θ that minimizes the prediction errors, i.e., ﬁnd the parameter vector θ̂N that minimizes (2.8). Hence, for predictive models the parameter estimation problem can be stated as a global optimization problem with the aim to ﬁnd the global minimum of (2.8). If we again denote the parameter vector as θ, this can be stated through: θ̂N = {θ ∈ DM : ∀ θ∗ ∈ DM : V (θ) ≤ V (θ∗ ). (2.9) In Figure 2.3 the parameter estimation problem for predictive models is illustrated. As shown by the ﬁgure, the intention is to locate one parameter vector θ̂N that optimize the objective function. V θ̂N θ Figure 2.3: The result of the parameter estimation problem for predictive models is a single point. Equation (2.9) is a standard nonlinear optimization problem for which several methods exist. A short survey of global nonlinear optmization is presented in Chapter 3. 2.2.4 Descriptive models As mentioned in Section 1.3, the models that prove to have “enough” predictive power are of interest. For the case of descriptive models we are not searching for ∗ one particular value of the parameter set θ, but rather for a subset θ̂N such that ∗ θ̂N = {θ ∈ DM : ∀ θ∗ ∈ DM : V (θ) ≤ V (θ∗ ) + κ}. (2.10) The parameter κ denotes the threshold that determines if a solution is “good enough”. (2.10) is illustrated in Figure 2.4. As can be seen, the parameter estimation problem for descriptive models is to ﬁnd the values of θ that result in models that are “good enough” rather than ﬁnding the best model. For predicitive models, the parameter estimation problem can be stated as a global optimization problem through (2.9). The corresponding expression for 12 Thesis background V κ ∗ θ̂N ∗ θ̂N θ Figure 2.4: The result of the parameter estimation problem for descriptive models is a set of points. descriptive models given in (2.10) is rather a description for the set of parameter vectors we would like to ﬁnd. Hence, we have no methods for ﬁnding the complete ∗ ∗ . However, instead of ﬁnding a complete description for θ̂N , it is possible set of θ̂N ∗ to ﬁnd parameter vectors θ̂k ∈ θ̂N , k = 1, . . . , K. Since we want the parameter ∗ vectors θ̂k to describe θ̂N as well as possible, the resulting parameter vectors should be “signiﬁcantly diﬀerent”. Figure 2.5 show how the parameter vectors θ̂1 and θ̂2 ∗ can be used as an approximation for θ̂N . V κ θ̂1 θ̂2 θ Figure 2.5: The parameter vectors θ̂1 and θ̂2 are used as an approximation for ∗ θ̂N . The problem of how to ﬁnd suitable θ̂k can stated as ﬁnding nonlinear optimization methods able to localize suboptimal minima of a function. The main content of Chapter 3 is to propose methods able to handle this problem. Chapter 3 Theory The main objective for this thesis is, as stated in Section 1.3, connected with parameter estimation for descriptive models. As mentioned in Section 2.2.4, the parameter estimation problem can be stated as a nonlinear global optimization problem. Hence, this chapter will deal with nonlinear global optimization, with the main objective to present a toolbox for solving the problem deﬁned by (2.10). 3.1 Global optimization Global optimization can be described as the task of locating the absolutely best set of parameters that optimize an objective function. While global optimzation aims to ﬁnd a global minimum, the corresponding aim for local optimization is to locate a local minimum of the objective function. Several diﬀerent suggestions how to classify the methods used for global optimization have been made and the most used is to roughly divide the methods into two diﬀerent approaches [12], [19]. Deterministic or exact methods aim to guarantee that the global minimum will be located. However, the computational eﬀort for this is high and it grows rapidly with the problem size [19]. One deterministic approach is covering methods which include the grid-based search procedure. The grid-based search method relies on the fact that practical objective functions normally have bounded rate of change. Hence, a suﬃciently dense grid would guarantee the detection of the global minimum with a prescribed accuracy. However, it is easy to realize that with growing dimensionality the number of grid points will be very large [27]. This is due to a feared phenomenon referred to as curse of dimensionality, which means that the number of samle points within a volume of certain size must be increased exponentially with the number of dimensions to retain the same density of samples. Stochastic methods are all based on some sort of random generation of feasible trial points followed by a local optimization procedure. In contrast to deterministic metods, many stochastic methods can locate the vicinity of the global minimum 13 14 Theory with relative eﬃency. However, the price paid is that the global optimality cannot be assured [19]. The toolbox of stochastic methods includes procedures like random search methods, clustering methods, and evolutionary computation. The ﬁrst method examined in this thesis is the Simulated Annealing (SA) algorithm which is a random search method [27]. In its original formulation, the algorithm determines the location of the global minimum. Hence, the algorithm must be modiﬁed to fulﬁl the requirements of Section 2.2.4, i.e., to ﬁnd also suboptimal local minima. A more thorough survey of SA is presented in Section 3.2. The second stochastic method used is Multi-Level Single Linkage which is a clustering method. The algorithm ﬁnds, at least in theory, the location of all local minima of the objective function. Section 3.3 surveys clustering methods in general and also describes the Multi-Level Single Linkage algorithm more extensively. 3.2 Simulated Annealing As stated in Section 1.3, one idea of this thesis is to modify the “Annealed Downhill Simplex” algorithm so that it is possible to ﬁnd suboptimal local minima of a function. This section contains a short historic review of the algorithm. After that, the modiﬁed algorithm is presented. 3.2.1 Simulated Annealing Algorithm The Simulated Annealing algorithm is a random search method originally presented by Metropolis, which attempts to simulate the behavior of atoms in equilibrium at a given temperature. It has been named after the procedure of metal annealing [3]. The algorithm was ﬁrst used to solve combinatorial optimization problems, e.g., the traveling salesman problem. Later on, numerous diﬀerent variations of the algorithm have been developed and some of them are applicable to continuous as well as constrained problems. The SA algorithm included in the Systems Biology Toolbox for Matlab [24] is based on the Nelder-Mead Downhill Simplex algorithm (NMDS) where a simplex transforms in the function space to ﬁnd the optimal function value [18]. At high temperature the simplex moves over the function space almost randomly. However, as the temperature is lowered the algorithm behaves more and more like a local search. NMDS was presented in [22] and has gained in popularity due to its numerical stability. The method requires only function evaluations, not derivatives. A simplex is the geometrical ﬁgure which, in N dimensions, consists of N + 1 corners, i.e., for two dimensions a simplex is a triangle. NMDS performs function optimization by simplex transformation in the function space. Depending on the function values for the simplex points the algorithm reflects, expands, or contracts the simplex with intention to perform a structured examination of the simplex surroundings. The idea is to, repeatedly, upgrade the simplex point with the worst function value to a new better point. In [22] an annealing scheme is applied to NMDS, here referred to by the name “Annealed Downhill Simplex”. 3.2 Simulated Annealing 15 In Algorithm 1 the structure of the Nelder-Mead Simulated Annealing is presented. Algorithm 1 Nelder-Mead Simulated Annealing Global minimization by the “Nelder-Mead Simulated Annealing”. Only the basic structure of the algorithm is presented here, for further details see [22]. Requirements: Initial starting guess X0 , start temperature T1 , stop temperature Ts , and the number of iterations used for each temperature N . 1: Initiate parameters and put k = 1. 2: while Tk > Ts do 3: Perform N iterations of the Annealed Downhill-Simplex at temperature Tk starting from the best point available so far. This is done by adding a positive logarithmically distributed random variable, proportional to T , to each function value associated with every vertex of the simplex. For every new point tried as a replacement point according to the Nelder-Mead scheme, a similar random variable is subtracted. This procedure results in a simplex with almost random behavior for higher temperature and for lower temperature behaviour like ordinary local search. 4: Set Tk+1 lower than Tk and put k = k + 1. 5: end while 6: return Best point found. The idea is to modify the algorithm presented in Algorithm 1, which originally is a singlestart method, i.e., one point is used to restart the algorithm at each temperature level, to a multistart method, i.e., the algorithm restarts in N points at each temperature level. If the points used to restart the algorithm are chosen correctly, this should make it possible to ﬁnd sub-optimal local minima of the objective function. We will deﬁne the region of attraction of a local minimum x∗ as the set of points in X from which a local search procedure converges to x∗ . An ideal case, which is generic for all multistart global optimization methods, is to start exactly one local search procedure in every region of attraction. Hence, the most demanding problem of multistart methods is to recognize the region of attraction for every local minimum [27]. The ﬁrst idea is based on the approach that if one could retain only points with relatively low function values, then these points would form groups around some of the local minima. This idea is illustrated in Figure 3.1. Such groups can be located by use of a clustering method, and ﬁnally one search procedure is started from the center of each identiﬁed cluster. We have evaluated an algorithm based on this procedure which is further described in Section 3.2.2. Another approach explored in Section 3.2.3 is to estimate the objective function that is to be minimized by using a function regression method. The idea is to simply to identify the regions of attraction for the function by exploring how the function behaves between points with relatively low function values. 16 Theory V Vγ θ Figure 3.1: By retaining function values with lower function value than a threshold Vγ groups of points are formed around local minima. 3.2.2 Cluster analysis approach As mentioned in Section 3.2, one idea is to use clustering analysis to identify accumulations of points with relatively low function value. For each cluster, the center is used to initiate a downhill simplex search. Hence, not only the area close to the best point found so far will be explored, but a more “broad” search is performed including suboptimal function areas. In Algorithm 2 we present the basic structure of the algorithm. Algorithm 2 Multistart Simulated Annealing — The clustering approach (MSAC) Requirements: Initial starting guess X0 , start temperature T1 , stop temperature Ts , and the number of iterations used for each temperature N . 1: Initiate parameters and put k = 1. 2: while Tk > Ts do 3: Perform N iterations of Annealed Downhill-Simplex at temperature Tk . 4: Set Tk+1 lower than Tk and put k = k + 1. 5: Apply the condition given by 3.1 to the N points and use clustering analysis on the retained points Lγ to ﬁnd suitable starting points at the next temperature. 6: end while 7: return The best points found. As illustrated in Figure 3.1 one way to concentrate the points around the local minima is to only retain values with relatively low function value: Lγ = {x|V (θ) < Vγ }. (3.1) γ is a positive constant deﬁning which points that will be included into the set. The problem of how to obtain the intrinsic structure of clustered data when 3.2 Simulated Annealing 17 no other prior information is available is referred to as unsupervised cluster analysis [9]. The litterature within the area is extensive and the number of clustering algorithms is numerous. In this thesis we use a method called model-based clustering where the data is assumed to be generated by some parametric probability distribution, and the problem is then to identify the most likely distribution for data [23]. Model-based clustering The model-based clustering method is based on the assumption that the data can be described as a mixture of models in which each model corresponds to one cluster. One approach is to use Gaussian distributions for describing the models. Hence, the method is referred to as Gaussian mixture modelling. A Gaussican density in a d-dimensional space, with mean µ ∈ R d and d × d covariance matrix Σ, is given by the probability density function: φ(x; θ) = T −1 1 1 exp− 2 (x−µ) Σ (x−µ) , (2π)d/2 |Σ| (3.2) where θ denotes the parameters µ and Σ. A Gaussian mixture with K components is then given as a convex combination of Gaussian densities: f (x) = K X j=1 πj φ(x; θj ), with K X πj = 1 and ∀ j : πj ≥ 0, j=1 where πj are called the mixing weights and φ(x; θj ) the components of the mixture. The mixture likelihood approach aims to maximize LM (θ1 , · · · , θK ; π1 , · · · πK |x) = N Y f (xi ) i=1 = K N X Y πj φ(xi ; θj ). (3.3) i=1 j=1 Equation (3.3) provide a measure of how well the mixture model is adjusted to sample data. Hence, given sample data, we would like to maximize LM . A popular method for maximization of (3.3) is the Expectation-Maximum (EM) algorithm [29]. Given a K-component mixture with respect to a dataset XN = {x1 , . . . , xN }, xi ∈ R d , the model mixture parameters can be updated by itera- 18 Theory tively application of the following equations: P (j; xi ) = πj φ(xi ; θj )/f (xi ) (3.4) N X P (j; xi )/N (3.5) N X P (j; xi )xi /(N πj ) (3.6) N X P (j; xi )(xi − µj )(xi − µj )T /(N πj ). (3.7) πj = i=1 µj = i=1 Σj = i=1 In (3.4), called the expectation, the term P (j; xi ) is denoted responsibility of model j for observation xi . In the expectation the current estimates for the model mixture are used to ﬁnd responsibilities according to the point density for each model. The estimated responsibilites from (3.4) are used in maximization step to update the estimate of the parameters by weighted maximum-likelihood ﬁts. The update of the mixing weights is given by (3.5), the mean µj for model j is updated in (3.6), and (3.7) describes the estimation of the covariance Σj [11]. It can be shown that each iteration leads to a sequence of mixtures {f i } with nondecreasing likelihood funtion LM i . The EM-algorithm guarantees convergence to a local optimum, i.e., a global optimal solution is not guaranteed [29]. Due to this, the ﬁnal mixture distribution is highly dependent on the initial distribution f 0. Another problem is how to identify the number of clusters within the data, i.e., how to select the optimal number of components for the mixture. For a ﬁxed number of components, the EM-algorithm can be used to ﬁnd (at least in local sense) the θ that maximize (3.3). However, as the number of components increases the likelihood function for the mixture will also increase. The whole idea of using model mixtures is to group close points, but that falls if the number of mixture models are the same as the number of data points. There are two main ways to handle this problem. The most favourable situation is if fresh data with the same properties, i.e., the same number of clusters, is available. Cross-validation can then be used to select the number of components to be used in the mixture [15]. However, this approach is only useful for applications where the number of clusters in the data is known to be constant. Hence, the method is not applicable to our problem since we do not know the correct number of clusters. In the case when no fresh data is available, the problems of how to select the number of components in the mixture and how to estimate the parameters can be phrased as a joint problem. The problem is then to, as previous, minimize a cost function but now with an additional penalty on model complexity. The purpose is to ﬁnd a model that describes data suﬃciently well, and at the same time not use an unnecessary compex model [15]. 3.2 Simulated Annealing 19 For this thesis, we have used two diﬀerent critera and evaluated their performance when applied to our problem. For details about the critera see [15]. Aikaike’s information criterion (AIC) is the ﬁrst of the methods. AIC aims at minimizing: 1 [−LM (x|θ) + dim θ]. (3.8) AIC = min θ,k N The second criterion evaluated is Rissanen’s minimum description length (MDL) principle. MDL aims at minimizing: M DL = min θ,k dim θ 1 [−LM (x|θ) + · log N ]. N 2 (3.9) We used Monte-Carlo simulation applied to a data set generated from Gaussian densities with known properties to compare AIC and MDL. The results are presented in Chapter 4. 10 8 6 4 2 0 −2 −4 −4 −2 0 2 4 6 8 10 Figure 3.2: An example of an estimated gaussian mixture with 5 components. Using the cluster analysis method described above, all components needed to run Algorithm 2 have now been presented. We have evaluated the algorithm using test functions presented in Section 3.4 and the results are presented in Chapter 4. 3.2.3 Function estimation approach Another approach to the problem of how to ﬁnd suitable points to restart the algorithm from is to estimate the true objective function between the already evaluated points using some function regression method. Locating the regions of attraction by function estimations The idea is to examine the topology of the function by analysing the function values on straight lines drawn between pair of points. If no signiﬁcant function 20 Theory “peak” can be found between the points, the points are supposed to be contained in same region of attraction. In Figure 3.3 this procedure is explained, where the “peak” between the pair of points is denoted by γ. Consider the situation to examine the topology between a pair of points denoted by x1 and x2 : 1. A function estimation method is used to calculate function predictions between the points. In Figure 3.3 the function estimations are dotted. 2. The function predictions are used to calculate two measures we denote by “min from left” (Ml ) and “min from right” (Mr ). They are formed by stepping between the points using the function predictions. At each point, the smallest value for the function prediction found so far is stored. In Figure 3.3 Ml measure is dashed and Mr is dash-dotted. 3. For every evaluation point between x1 and x2 ﬁnd the largest diﬀerent between the function prediction and maximum of Ml and Mr . Denote the largest value found by γ. 4. If γ is larger than a threshold α both points are used to restart the algorithm from at next lower temperature. In the following text we will present a motivation for the selection of α. Remember that SA attempts to simulate the behavior of atoms in equilibrium at a given temperature. The probability of making a transition from a current atom state s1 to a new candidate state s2 at temperature T is given by the boltzmann factor: −(E(s1 )−E(s2 )) P (E(s1 )) T =e , P (E(s2 )) (3.10) where E(s1 ) and E(s2 ) are the system energy states associated with s1 and s2 [14]. We will use this theory to answer a question consisting of two main parts. First, do the two points x1 and x2 belong to the same region of attraction? If they do, what is the possibility to, given that the algorithm only is restarted in one of these points, still ﬁnd both function “valleys” at next temperature level? The reason for stating this questions is that we do not want to use unnecessarily many restart points for the algorithm. If the possibility to ﬁnd the function “valley” at a lower temperature is high, then we do not have to restart the algorithm from both x1 and x2 at this temperature. Using this theory as a motivation, we have determined the threshold α through: −α Pdef ault = e Tk+1 (3.11) where Pdefault is an a priori speciﬁed parameter. Equation (3.11) can be formulated as 1 α = Tk+1 · log , Pdefault giving an explicit expression for the threshold α. 3.2 Simulated Annealing 21 y γ x Figure 3.3: Function predictions, dotted in the figure, are used to examine the function behavior between a pair of points represented by circles. The distance γ is used to determine if the points belong the the same region of attration. The “min from left” measure is dashed and the “min from right’ measure is dash-dotted. Locally Weighted Regression Given sample inputs x ∈ R m and outputs y ∈ R, consider the mapping f : R m → R, and assume that the output is generated as noisy measurements of a function: yi = f (xi ) + ǫi (3.12) The problem of regression analysis is to estimate the function f based on the samples. A common approach to function regression is to build the function estimate using local information, i.e., use adjacent sample points to build a local model. One such method is Locally Weighted Regression which ﬁts local models to nearby data. The data used for building the model are weighted according to the distance from the query point xq [1]. Suppose that we are given a set of sample data {(x(i), y(i))}N i=1 produced by (3.12). We will show how an estimate can be formed using linear local models. Locally weighted regression aims to solve a weighted least square problem at each query point xq : min α(xq ),β(xq ) N X wi (xq , xi )[yi − α(xq ) − β(xq )xi ]2 , (3.13) i=1 with the estimate then given by fˆ(xq ) = α̂(xq ) + β̂(xq )xq . Equation (3.13) is a quadratic function which can be minimized analytically through the weighted normal equations [15]. Let the vector-valued function b(x) = [1, x], and B be the N × 2 regression matrix with ith row b(xi )T . Also let W(xq ) be the N × N diagonal matrix with ith element wi (xq , xi ). Then, the estimate 22 Theory that minimized (3.13) is given by fˆ(xq ) = b(xq )T (BT W(xq )B)−1 BT W(xq )y, (3.14) where the measured outputs are given by y. Equation (3.14) shows how the estimate fˆ(xq ) is built by local models [11]. For a more extensive explanation of local regression, see [1], which contains a comprehensive tutorial of the method. Essential contributions to the method are also presented in [5] and [6]. When using local regression in practice, three basic components must be chosen; the bandwidth, i.e., how many data points to be used to build the estimate; the weight function wi , i.e., how to calculate the data weights; and the parametric family to use for the local model [6]. Selection of the parameteric family is to simply to make the choice of which polynomial degree to use for the local models. This selection is a trade oﬀ between variance and bias. Using polynomials with high degree leads to a less biased, but more variable estimate [1]. The most common choices are to use polynomials of ﬁrst or second degree. The above choices must be tailored to each data set used in practice. A central issue is how many sample points should be used to build the estimate, i.e., how to select the bandwidth. Basically two choices for selection of bandwidth exist. One choice is to use a fixed bandwidth. However, this can lead to a bad estimate due to large changes in density of the sample data. Another method to use an adjustable bandwidth, based on the values of xq . The advantage of this method is that if the part of the input space surrounding the query point xq is very sparse, more sample data can be included into the estimate. One example of adjustable bandwidth is the k-nearest neighbour selection, where k can be selected using leave-one-out validation selection [6]. The requirements on a weighting function are straightforward. The function maximum should be located at zero distance and the function should decay smoothly as the distance increases [6]. Several diﬀerent standard choices for weighting functions exist, some of the most common used are Gaussian kernels, quadratic kernels, or tricube kernels. For this thesis, we have evaluated two diﬀerent implementations of the Local Weighted Regression method. First the The Lazy Learning Algorithm is presented, and then the Locally weighted projection regression algorithm is described. The Lazy Learning Algorithm By application of the recursive least square algorithm (RLS), a fast implementation of the Local Weighed Regression method is presented by the authors of [2]. The method is explained in detail in [15]. Recursive methods process the input-output data one by one as they become available. For the Lazy Learning Algorithm, all data are available at the same time but according to RLS new data points are included in the estimate recursively, i.e., the estimate is incremented with points one by one. 3.2 Simulated Annealing 23 When building a function estimate at a query point xq , the available data points are sorted with respect to distance from xq . The data point most close to xq is denoted xd1 . Then, the estimate is built by sequentially adding data points ordered by the distance from xq . The RLS algorithm leaves no possibility of using an arbitrary weight function. It would however be possible to use a weight function given by wi = ai , 0 < a < 1. This possibility is not used for the Lazy Learning algorithm, but instead every data point included into the estimate is given equal weight. This leads to a very unsmooth estimate, particularly if the data is very sparse or in presence of noisy data. The most obvious advantage of the algorithm is with no doubt that it is very time efficient. In Figure 3.4 the algorithm is used to produce estimates based on sampels from a function with additive noise. Note that the function estimate is very notchy which is a big problem for our application. Noisy data samples The fitted function: nMSE=0.110 1.5 3 1 2 0.5 1 0 −0.5 1 0 1 1 0 0 −1 1 0 0 −1 −1 −1 The true function 3 2 1 0 1 1 0 0 −1 −1 Figure 3.4: A function estimate produced by the RLS-LWR algorithm, from noisy samples. Incremental Online Learning in High Dimensions Locally weighted projection regression (LWPR) is an algorithm that forms nonparametric function estimates built from local linear models [30]. The algorithm is incremental, i.e. the local models can be updated whenever fresh new data is avaliable. In accordance with the description of Local Weighed Regression, the algorithm comprise three basic identies for learning a function estimate. 1. LWPR algorithm uses linear functions for describing the data locally. Hence, 24 Theory the relation between the output (yi ) and the input (xi ) is given by: yi = βi xi + ǫi , (3.15) where ǫi denotes the measurement noise. The function coeﬃcients βk are updated using partial least squares regression (PLS). PLS uses the principal components of xi to predict a score on the corresponding principal component for yi . PLS determines the coeﬃcients β by iteratively maximizing the covariance between the scores for the inputs (xi ) with the output (yi ) [30]. LWPR uses PLS to, if possible, reduce the dimension for the input space. This makes it possible to remove possibly redundant inputs. The number of projections to use for each linear model is controlled by keeping track of the mean-squared error (MSE) as a function of projections included for the local model. Each local model is initiated with two (R = 2) local model projections, and the algorithm will stop adding local projections if the MSE at the next projection does not decrease more than a predeﬁned percentage r+1 of the previous MSE, i.e. if MMSE SEr > φ, where φ ∈ [0, 1]. 2. For the RLS algorithm the selection of weighting function is to assign the included data points equal weight. LWPR includes the possibility use an arbitrary weighting function, parameterized as a distance metric Dk for the k:th local model. Hence, given a query point xq the weight wk for a local model centered in ck is given by wk = h(xq , Dk , ck ). (3.16) LWPR only allows a certain degree of overlap, limited by a predeﬁned threshold. A high degree of overlap can result in a better estimate, but also makes the algorithm more computaionally expensive. The ﬁnal function approximation evaluated for a query point xq is then given by PK wk ŷk , (3.17) ŷ = Pk=1 K k=1 wk where ŷk is the prediction for each local model. 3. The question of how many data points to use for each function estimate is refered to the bandwidth problem. LWPR determines for each local model a region of validity, called receptive field (RF), parametrized using the distance metric D. The shape and size of the local models are determined using a penalized leave-one-out cross-validation error, M X N γ X 2 J = PM wi (yi − ŷi,−i ) + D , N i,j=1 i,j i=1 wi i=1 1 2 (3.18) 3.3 Clustering methods for Global Optimization 25 where M denotes the number of data points in the training set, N denotes the number of local models used, and wi are the weights associated with each data point. The ﬁrst term of 3.18 is the mean leave-one-out cross-validation error (indicated by the subscript (i, −i) of the local model. The second term is a penalty term which ensures that the receptive ﬁelds cannot be indeﬁnitly small, which would be statistically correct for building an asymptotically unbiased function approximation but also make the algorithm too computationally expensive. Adding a new data point x involves examination if a new RF has to be created. If the point x has suﬃcient activation given by (3.16), a new RF is not created. Otherwise, a new RF is created with c = x, D = Ddefault . As written above, compared to Lazy Learning, any weighting function can be applied to the LWPR algorithm. Hence, the ability to handle sparse data samples and noisy data improves considerably. This is demonstrated in Figure 3.5 which shows an estimate with the same input data as in Figure 3.4. Note that the estimate produced with the LWPR algorithm is much smoother and the impact of the noise is reduced compared to Figure 3.4. Further, the RFs used to build the function estimates are shown in the ﬁgure. We have implemented Algorithm 3 using the LWPR algorithm to estimate the objective function. Evaluation of the algorithm has been made using the scalable test functions presented in Section 3.4. The result is presented in Chapter 4. Algorithm 3 Multistart Simulated Annealing — Function Estimation Approach (MSAF) Requirements: Initial starting guess X0 , start temperature T1 , stop temperature Ts , and the number of iterations used for each temperature N . 1: Initiate parameters and put k = 1. 2: while Tk > Ts do 3: Perform N iterations of Annealed Downhill-Simplex at temperature Tk . 4: Update the LWPR-model using the N function evaluations evaluated in step 3. 5: Set Tk+1 lower than Tk and put k = k + 1. 6: Using the method presented in Section 3.2.3 and LWPR, ﬁnd suitable points to restart the algorithm from at next lower temperature. 7: end while 8: return The best points found. 3.3 Clustering methods for Global Optimization Clustering methods for global optimization are based on two simple assumptions; ﬁrst that it is possible to obtain a sample of points which is concentrated in the neighbourhood of the local minimizers, and further that this sample can be 26 Theory The fitted function: nMSE=0.123 Noisy data samples 3 1.5 1 2 0.5 0 1 −0.5 1 1 0 0 1 1 0 0 −1 0 −1 −1 The true function −1 Input space view of RFs 1 3 0.5 2 0 1 −0.5 0 1 1 0 0 −1 −1 −1 −1 0 1 Figure 3.5: The LWPR-algorithm used on a noisy dataset. analysed with a clustering method giving the vicinicy of the local minimizers. The idea is then to start local optimizations applied to the center of each cluster. If this procedure is employed successfully every local minimum, and hence the global minimum will be found [27]. There are many ways to perform grouping of sample points around the local minima. One approach is to simply retain points with relatively good function value. This is the method used in Section 3.2.2. Another appoach is to push the sample values closer to the local minimas by performing a few steps of local optimization [27]. The other important step of a clustering method is the clustering analysis. We have already scratched at the surface of this ﬁeld in Section 3.2.2. However, it is worth mentioning again that there exist numerous approaches to clustering analysis and the number of algorithms is large. Of course, this property of listing all local minima is in agreement with our demands to ﬁnd suboptimal minimizers of the objective function presented by (2.10). We will not go deeper into the theory about clustering methods. For a more extensive tutorial we recommend the survey [27]. However, as comparision to the methods presented above, we have chosen to implement a clustering method known to be more eﬃcient and accurate than Simulated Annealing [21]. The algorithm named Multi-Level Single Linkage, is further described in Algorithm 4. As written above, we have implemented Algorithm 4 and used it as reference for the other two approaches of Multistart Simulated Annealing. 3.3 Clustering methods for Global Optimization 27 Algorithm 4 The Multilevel Single-Linkage Algorithm 1: Generate N sample points from a uniform distribution over SN = {x|li ≤ xi ≤ hi , i = 1, . . . , K} and calculate the corresponding function values at these K points. Calculate the critical distance rk given by (3.19). Deﬁne a level set Sd such that Sd = {x ∈ SN |f (x) ≤ d} and sort the points in order of descending function values. A point is chosen as a start point for a local search if no point with better function value and within a critical distance rk has already been used as a start point for a local search. 4: Add the found local minimum x̂ to the set of already found minima if it is not located closer than a predeﬁned distance α from any already found minima. 5: Stop the algorithm if the stopping rule (3.20) is satisﬁed, otherwise set k = k+1 and goto 1. The variable k denotes the iterator for the algorithm. The critical distance rk is given by 2: 3: 1 rk = π − 2 Γ(1 + K log kN )ψ(S)ρ 2 kN K1 (3.19) where Γ denotes the gamma function, ψ(SN ) denotes the Lebesgue measure of SN , and ρ is a positive constant. According to [17], when ρ > 0, all local minima of the function will be located in a ﬁnite number of iterations with probability one. This results is only valid for a function with a ﬁnite number of local minima. Consider, e.g., an objective function with n local minima, n ∈ Z. If the algorithm is used to optimize the function, the number of iterations required would be at least n since the rate of location new local minima is maximum one per iteration. The stopping rule applied in step 5) uses a Bayesian estimate of the total number of local minima. W denotes the number of local minima found after k iterations, and D denotes the number of points on Sd . A Bayesian estimate of the number of local minima found so far is given by Wexp = W (kD − 1) . kD − W − 2 The search procedure is terminated if Wexp < W + 0.5. (3.20) 28 3.4 Theory Test function generation Testing is an instrumental part of developing an optimization algorithm. The purpose is twofold; its ﬁrst function is to verify the performance of the algorithm, and its other beneﬁt is to determine an adequate range of problems where the algorithm is applicative. Hence, selection of suitable test functions is very important. Publications presenting nonlinear optimization algorithms are far more numerous than publications concerning testing of optimization software. It seems to be more fun constructing the algorithms than testing them. In [20] the authors list 36 diﬀerent unconstrained optimization problems with solutions. The article also includes a comparison between two diﬀerent optimization algorithms. In [8] the authors have gathered a set of constrained optimization problems. Each includes a known solution and a short description. However, since most of the global optimization algorithms only aims to determine the global minimum, the locations and function values of the local minima are seldom listed. Of course, this is not acceptable for our case since the purpose of our testing is to locate also suboptimal function minima. We would like to have a set of test functions with completely known properties for all local minima of the function. It would be even better if we could specify such properties ourselves. Fortunately, the method for generation of test functions presented in [10] fulﬁlls these requirements. A framework is presented for generation of tailor-made test functions with a priori deﬁned: (i) problem dimension, (ii) number of local minima, (iii) the local minimum points, (iv) the function values of the local minima, and (v) variable size of the region of attraction for each local minimum. The technique is to systematically distort a convex quadratic function by introduction of local minima. Figure 3.6 shows an example of a generated test function. We have used the tailor-made functions to evaluate the algorithms described above. The result is presented in Chapter 4 but it is worth noting again that without the method for generating testfunctions presented in [10] veriﬁcation of the algorithms would be much harder to perform. 3.4 Test function generation 29 The true function 2.5 2 1.5 1 0.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 Figure 3.6: A 2-dimensional test function with five local minima. Chapter 4 Results In the previous chapters we have presented methods to solve the objective of this thesis, presented in Section 1.3. In this chapter we will evaluate the methods when applied to test functions which are produced using the method presented in Section 3.4. 4.1 Test function suite As mentioned in Section 3.4, the choice of a suitable test function suite is essential for development of optimization algorithms. In this section the test functions selected for evaluation of the optimization algorithms described in Chapter 3 are presented. 4.1.1 Building a test function As described in Section 3.4, the method for generation of test functions presented in [10] is to generate tailor-made functions by deﬁning a number of parameters. We will demonstrate the metod by generation of a 2-dimensional test function. The local minima for the generated function are given by the minimum for the undisturbed quadratic function, T , and the additional function minima added by disturbing the quadratic function. We will build a test function by disturbing a quadratic function with local minimum deﬁned by T . By this disturbance, three additional local minima deﬁned by Mk , k = 1, . . . , 3 with corresponding function values fk are added to the function. The size of the region of attraction for each introduced local minimum is deﬁned by ρk . In Table 4.1 the selected parameter values are presented. The parameter values in Table 4.1 are used to generate a test function which is presented in Figure 4.1. Note that the local minimum represented by T is fully absorbed by M3 . Hence, the total number of local minima is three for this function. Besides Test function Nr. 1 another two test functions are used. In Table 4.2 the parameters for a 3-dimensional function with four local minima is presented. 31 32 Results Table 4.1: Test function Nr. 1 (T1 ) Problem dimension nd = 2 and Nr. local minima m = 3. Local minima of undisturbed function T = [0 0.1] with function value t = 1.1. The local minima introduced by function disturbance are given by: M1 = [−0.7 − 0.7] f1 = 0.1 ρ1 = 0.3 M2 = [0.7 0.7] f2 = 0.15 ρ2 = 0.4 M3 = [0 0] f3 = 0.2 ρ3 = 0.4. The true function 2 1.5 1 0.5 0 1 0.5 0 −0.5 −1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Figure 4.1: Test function generated by using the parameters defined in Table 4.1. Three of the minima are represented by M1 , M2 , and M3 , while the fourth is given by T . Table 4.3 contains the parameters for a 4-dimensional test function with three local minima. The local minima of this function is deﬁned by M1 , M2 , and T , giving a total of three local minima for the function. 4.2 Optimization algorithm results This section contains the evaluation of the optimization algorithms presented in Section 3. The algoritms have been evaluated using the test function suite presented in Section 4.1. Table 4.4 presents the results for the algorithms applied to Test function Nr. 1, while the results for Test function Nr. 2 and Test function Nr. 3 are presented in Table 4.5 and Table 4.6. The tables show the rate of ﬁnding N correct local minima for the function, where N is presented by the column, e.g., MSL found three correct local minima at a rate of 100% for Test function Nr. 1. The following sections evaluate the results for each algorithm. 4.2 Optimization algorithm results 33 Table 4.2: Test function Nr. 2 (T2 ) Problem dimension nd = 3 and Nr. local minima m = 3. Local minima of undisturbed function T = [0 0.1 0] with function value t = 1.1. The local minima introduced by function disturbance are given by: M1 = [−0.7 − 0.7 − 0.7] f1 = 0.2 ρ1 = 0.5 M2 = [−0.7 0.7 0.7] f2 = 0.3 ρ2 = 0.5 M3 = [0.7 − 0.7 0.5] f3 = 0.4 ρ3 = 0.5. Table 4.3: Test function Nr. 3 (T3 ) Problem dimension nd = 4 and Nr. local minima m = 2. Local minima of undisturbed function T = [0 0.1 0 0] with function value t = 2.1. The local minima introduced by function disturbance are given by: M1 = [−0.7 − 0.7 − 0.7 − 0.7] f1 = 0.1 ρ1 = 0.5 M2 = [0.7 0.7 0.7 0.7] f2 = 0.2 ρ2 = 0.5. Algorithm Table 4.4: Test function Nr. 1 — Results Rate of correct minima found on a test cycle of 100 simulations. 1 2 3 MSAC 11% 27 % 62 % MSAF 4 % 16 % 80 % MSL 0% 0% 100 % Algorithm Table 4.5: Test function Nr. 2 — Results Rate of correct minima found on a test cycle of 100 simulations. 1 2 3 4 MSAC 12% 46 % 26 % 16 % MSAF 4% 14 % 33 % 49 % MSL 0% 0% 0% 100% Algorithm Table 4.6: Test function Nr. 3 — Results Rate of correct minima found on a test cycle of 100 simulations. 1 2 3 MSAC MSAF 51 % 15 % 34 % MSL 0% 8% 92 % 34 Results 4.2.1 Gaussian Mixture Modeling Approach (MSAC) In Section 3.2.2 we presented an algorithm based on the Simulated Annealing algorithm that uses Gaussian mixture modeling to identify suitable points for restarting the algorithm at next lower temperature. Number of model components A requirement for using the algorithm is to identify a feasible number of mixture model components to describe the data. For this two diﬀerent methods are presented, Aikaike’s criterion (3.8), and Rissanens MDL measure (3.9). The criteria can be used to estimate the number of components in a Gaussian mixture, an estimation we denote by K̂. By a tailor-made d-dimensional Gaussian mixture consisting of K components it is possible to evalute how well the critera can estimate the predeﬁned number of mixture components K. We have used this method to evaluate how well K̂ estimates K for diﬀerent K and dimensions d. Table 4.7 presents the mean and variance for K̂ when MDL is used to estimating the number of components K. The corresponding result when using AIC is presented in Table 4.8. According to the tables, we can conclude that the estimates produced by AIC, at least for this data in this situation, prove to have better quality than the estimates produces by MDL. According to the tables AIC seems to be more consistent when varying the dimension for the data. MDL tend to overestimate the number of components for low dimension data, and underestimates for data with high dimension. Note that a correct estimate of data clusters quantity does not imply an optimal distribution of model components. And still, the initiation of the EM-algorithm is a problem. Due to these results, for Algorithm 2 we have chosen to use AIC to determine the number of mixture model components. Algorithm Results We have used the suite of test functions presented for evaluation of Algorithm 2. From the tables we can conclude that: • Overall the algorithm has a low rate of ﬁnding the location of all local minima. • The rate of ﬁnding all local minima is decreasing when the problem dimension increases. Probably this is due to the fact that the accumulations of points which are to be identiﬁed by the clustering method become more and more diﬀuse. This problem is caused by the “curse of dimensionality”. • Since the algorithm cannot even produce good results for problems with dimension three and four, it will probably not be a suﬃcient tool for solving the benchmark problem. Hence, we have not spent time on evaluation of the algorithm for problems of higher dimensions. 4.2 Optimization algorithm results 35 Table 4.7: Estimated number of cluster components using MDL. The tables present the mean and variance for K̂, evaluated on 35 simulations. Mean Mixture dimension (d) 2 3 4 5 6 7 8 Mixture components (K) 1 2 3 4 5 6 7 8 9 10 1.0000 2.0000 3.2000 4.8400 6.2800 6.8000 9.0000 12.0800 11.5600 12.9200 2 Mixture components (K) 1 2 3 4 5 6 7 8 9 10 4.2.2 0 0 0.1667 0.8067 1.3767 0.5833 1.5000 3.5767 2.0067 2.7433 1.0000 2.0000 3.4800 5.0000 6.3600 7.1200 8.5200 10.5600 8.7600 12.7200 1.0000 2.0000 3.2400 4.8400 6.2400 7.1200 7.9600 10.0800 10.6000 11.6000 1.0000 2.0000 3.4000 4.6400 5.6400 7.2800 7.8400 8.4400 9.9200 11.6000 1.0000 2.0000 3.4000 4.5600 5.0000 6.2800 7.4000 9.2400 9.3200 9.8000 Variance Mixture dimension (d) 3 4 5 6 0 0 0.7600 1.0000 1.0733 2.4433 1.6767 2.8400 2.2733 2.8767 0 0 0.1900 0.5567 1.4400 1.1933 2.1233 1.0767 2.6167 2.9167 0 0 0.3333 0.8233 1.6567 0.9600 1.2233 1.4233 1.9100 2.3333 0 0 0.4167 0.5900 1.2500 1.4600 1.6667 2.1067 2.8100 3.1667 1.0000 2.0000 3.4000 4.4800 5.4800 6.4400 6.9200 7.9200 8.1600 8.9200 1.0000 2.0000 3.3200 4.2000 4.9600 5.6000 5.9600 7.3200 7.7600 7.8000 7 8 0 0 0.5000 0.5933 0.7600 1.0900 1.5767 1.6600 1.5567 1.4100 0 0 0.2267 0.4167 0.6233 1.0833 1.6233 1.0600 2.6900 1.2500 Local Weighted Regression Approach (MSAF) Algorithm 3 presents a multistart simulated annealing algorithm that uses a local weighted regression algorithm to identify suitable points for restarting the algorithm at next lower temperature. Function estimation The second method presented in Chapter 3 uses function estimation to identify the regions of attraction for the objective function. The idea is then to use that information when selecting points to restart the algorithm at each temperature level. In Section 3.2.3 we concluded that the LWPR method, for our application, is better suited than the RLS-based Lazy Learning Algorithm. This decision was based on the fact that LWPR is better at handling noisy data, and the ability 36 Results Table 4.8: Estimated number of cluster components using AIC. The tables present the mean and variance for K̂, evaluated on 35 simulations. Mean Mixture dimension (d) 2 3 4 5 6 7 8 1.0000 2.0000 3.0000 5.4400 6.2800 7.0400 8.5600 9.6400 10.6400 11.7200 Mixture components (K) 1 2 3 4 5 6 7 8 9 10 2 Mixture components (K) 1 2 3 4 5 6 7 8 9 10 0 0 0 1.3400 2.2100 1.2067 1.0067 1.6567 1.1567 0.9600 1.0000 2.0000 3.0000 4.7600 5.9200 6.9600 8.5600 9.4000 10.5200 11.7600 1.0000 2.0000 3.0800 4.6800 6.4400 7.4800 8.5200 9.9200 10.7200 11.6000 1.0000 2.0000 3.0800 4.5600 5.6400 7.6000 8.4000 9.6000 10.8800 11.3200 1.0000 2.0000 3.9600 4.4800 6.0400 7.2000 8.5600 9.6400 10.6800 11.3600 Variance Mixture dimension (d) 3 4 5 6 0 0 0 0.7733 0.5767 0.7900 1.1733 1.0833 1.5100 1.5233 0 0 0.0767 0.5600 1.3400 1.0100 2.2600 1.1600 1.6267 1.3333 0 0 0.0767 0.5067 0.7400 1.3333 1.9167 1.0833 0.7767 1.8100 0 0 1.2067 0.5100 0.7900 0.8333 1.3400 1.7400 1.2267 1.4067 1.0000 2.0000 3.0400 5.0800 5.8800 7.0800 7.8000 9.4800 10.6000 11.4000 1.0000 2.0000 3.5600 5.0800 6.0800 7.0800 8.5600 10.0000 10.8000 11.5600 7 8 0 0 0.0400 0.8267 0.7767 0.9933 2.0000 1.5933 1.7500 1.2500 0 0 0.4233 0.8267 0.5767 0.9933 1.8400 0.8333 1.5000 1.3400 to use not only boxcar weighting functions enhances the possibility to produce a good estimate even with sparse data. High quality data with quite high density is a requirement to build an estimate good enough to mimic detailed properties about a function. This is intuitively correct since only a data set with high density can explore every detail of a function. For problems of low dimensions this is no problem. However, for higher dimensions the number of data samples has to be increased exponentially to maintain the same data density. No doubt, this will be a problem when the problem dimension increases. Algoritm Results As mentioned, we have used the suit of test functions to evaluate Algorithm 4. The results are presented in Table 4.4. 4.3 Evaluation of the benchmark problem 37 • For the Test Function Nr. 1 and Test Function Nr. 2 the algorithm can ﬁnd every local minimum at quite good rate. It is worth to note that for Test Function Nr. 2 the minima deﬁned by T is quite tricky to locate. This is because the undisturbed function is quite ﬂat, and for that reason is harder to locate. • The algorithm can also be applied to Test Function Nr. 3. However, for this problem it is noted that the complexity is considerably higher due to the one extra unknown parameter compared with Problem Nr. 2. • Over all we must conclude that the algorithm most likely is not applicable to the benchmark problem that has eight unknown parameters, the number of function evaluations to make would be to high. However, the concept is interesting and the result for problems of lower dimensions are satisfactory. 4.2.3 Multilevel Single Linkage (MSL) In Algorithm 4 the MSL algorithm is presented. For evaluation of the algorithm we have used the test function suit presented in Section 4.1. • For every test function the algorithm perform good results. • The algorithm uses many local searches which results in an extensive amount of function evaluations. This is a drawback if evaluation of the objective function is expensive. • Using an Euclidean distance measure makes the method very sensitive to scaling of the coordinates. However, this problem is present for all algorithms presented in this chapter. The Gaussian densities use an Euclidean measure, and the weights used for the function estimation approach are based on Euclidean measures. 4.3 Evaluation of the benchmark problem The objective of the benchmark problem presented in Example 2.1 is to demonstrate the modeling process, presented in Section 1.2.1, and show how to use the methology presented in Chapter 2, and Chapter 3. According to the theory given in Section 2.2.4, to present results for the benchmark problem presented in Example 2.2 we must handle a problem consisting of two main parts. The ﬁrst part is to ﬁnd the set of models that have “enough” predictive power, and the second part is to identify which of those models that show “signiﬁcant diﬀerently” properties. The former part of the problem is solved through the optimization problem (2.10). To solve the later part, we analyse the models given by the ﬁrst part to select which models that give additional knowledge about the system. In Example 4.1 we apply this methodolgy to the benchmark problem. 38 Results Example 4.1: Insulin Signaling cont’d As written above, the identiﬁcation of descriptive models can roughly be divided into two main parts: The problem of ﬁnding suboptimal minima is solved through (2.10). The second problem is handled by analysing the found solutions. 1. To solve the optimization problem (2.10) we have used the MSL-algorithm presented in Algorithm 4 and evaluated in Section 4.2.3. The resulting parameter values are presented in Table 4.9. 2. Analysing the solutions found listed in Table 4.9, we can divide the found solutions into to sets, S1 and S2 . The horizontal line in the table represent the divider. The fact that S1 and S2 represent diﬀerent properties about the system can be studied in Figure 4.2. The ﬁgure shows the simulation results of the system given by (2.2) using the parameter values listed in Table 4.9. It can be seen that only one group of simulations seems to mimic the overshoot that clearly appears in the measured data. Using that observation, we can conclude that only S1 represents solutions with enough predictive power. • The set S1 represents parameter values that correspond to models with “enough” predictive power. Hence, the ﬁnal set of solutions is S1 . • For most of the parameters in S1 no clear relations can be seen between the solutions. However, the value for the reaction velocity kR can be limited to [0.3392, 0.5225]. • For most of the parameters no obvious relations can be found between the parameters. One possible explanation for this can be related to the fact that the problem is over-parameterized, at least with respect to available measured data. 4.3 Evaluation of the benchmark problem 39 100 90 80 70 60 50 40 30 20 10 0 0 5 10 15 Figure 4.2: Simulations using parameter values from Table 4.9. 40 Results Table 4.9: Insulin Signaling — Resulting Parameter Values k1 k−1 k2 k3 k−3 S1 kD kR kY FVAL 12.7122 76.8881 21.3379 37.1172 11.4749 1.5150 45.7914 7.9010 13.6422 9.4384 11.2442 7.9930 25.5096 9.2456 2.4979 8.9191 1.6000 9.2325 9.5359 1.8371 0.9125 13.7346 67.7109 79.5270 0.3605 37.6791 47.6811 0.1313 113.4441 0.3117 0.3470 28.2603 0.5446 0.4771 4.4083 1.4278 3.6097 4.3038 11.2218 2.6143 52.9289 98.3896 69.7877 56.2027 10.9673 52.9884 1.8184 17.6429 8.8086 0.8141 8.1062 61.9113 0.9704 17.2389 3.8308 6.9403 4.9193 3.3635 4.7083 3.7419 0.9628 45.3371 2.4754 15.3588 2.5242 17.9938 15.6874 3.6133 16.3766 9.9312 2.4293 3.6166 4.1115 2.3038 7.2801 1.2512 12.4436 3.4589 9.8515 7.4575 4.0382 222.2496 167.6284 60.1809 12.2177 123.7422 82.9557 5.8290 113.1853 0.0516 3.9089 66.8479 0.0607 14.9478 2.2564 0.7961 1.7531 9.0535 2.1359 11.8739 61.1943 5.7374 97.6481 4.4261 6.8225 79.9269 66.2795 2.5420 74.5885 10.3389 2.5682 23.6146 8.8799 5.9180 5.4400 2.5051 9.7769 7.5687 2.1667 20.2356 0.3392 0.3147 0.3431 0.3307 0.3700 0.3638 0.3653 0.3754 0.3691 0.3976 0.3871 0.3861 0.4492 0.3308 0.4896 0.4581 0.4125 0.5193 0.5225 0.4755 18.2149 19.8630 18.3060 17.8060 16.2152 129.1175 125.2594 15.6443 117.9097 101.4815 14.5333 17.4451 51.0531 13.8702 46.5937 13.3867 97.8695 21.1294 23.8604 67.3766 153.8000 153.8912 153.9990 154.4000 154.4067 155.2468 155.2681 155.3209 156.0062 156.6710 156.9818 159.0000 159.0810 172.7306 173.9222 177.9620 179.8842 190.1176 205.1252 206.8978 71.2640 59.0103 74.0597 38.8884 15.9872 9.1238 83.7368 15.8926 20.1017 23.9004 64.8329 6.8673 67.4260 36.0190 47.7393 53.9721 51.5867 63.8451 89.9858 40.7388 82.2273 39.6913 82.0409 72.1714 59.7254 54.9758 61.1508 95.1013 49.4788 32.3084 10.9673 29.6947 78.1993 35.2583 96.3209 39.2847 41.4848 13.9950 49.5911 87.4996 93.0341 27.6566 45.2720 17.4046 26.7020 30.1223 55.3228 87.7380 59.1278 46.5307 87.7380 59.3455 16.0248 31.8288 42.9153 90.5991 15.8449 14.3051 81.8379 90.7251 8.1565 21.6421 9.3438 8.3747 35.3065 35.1443 11.9345 11.6103 97.0622 7.0981 21.6116 12.0609 55.4880 15.7554 11.8423 27.1626 17.1421 19.3827 20.2341 20.5098 65.3182 31.0775 14.1162 52.5000 50.1853 35.2084 94.0648 3044.0 3044.5 3045.3 3045.5 3045.5 3045.6 3045.6 3045.6 3045.9 3045.9 3046.0 3046.0 3046.1 3046.1 3046.1 3046.1 3046.1 3046.1 3046.1 3046.1 3046.1 3046.1 3046.1 3046.1 3046.1 3046.1 3046.1 3046.1 3046.2 S2 78.1743 7.8699 23.7106 47.6760 65.1409 21.2405 5.6715 12.2756 83.6243 61.1307 5.3163 78.9665 21.0033 95.9948 5.8629 73.2632 76.5338 36.6972 59.3531 34.8573 30.8818 75.8090 6.4428 38.2266 57.4830 19.4994 18.2321 56.6723 22.1273 62.6142 90.1017 30.9054 62.9425 49.4092 17.4420 83.4845 64.4575 16.0561 87.7380 90.1345 25.6369 96.3211 1.9012 45.2306 45.0510 0.2692 70.5716 54.4532 71.5775 9.9765 76.0282 71.0485 62.9481 53.1481 82.7012 48.5701 72.4792 23.8410 4.0718 32.9018 41.0300 8.5669 53.7606 78.9244 86.4774 24.3079 56.1800 43.5592 43.4199 86.9693 45.6574 64.8499 83.6849 33.3321 95.3703 44.8605 97.1169 97.6377 89.6455 19.7466 71.8637 97.2748 87.7380 18.9923 20.8317 54.2705 2.4404 74.1380 74.6083 3.9363 20.8855 72.2233 40.9115 17.2900 85.8078 52.2323 35.1731 77.5860 95.0964 18.9420 47.2734 76.6181 43.1366 31.1600 70.7737 75.3895 93.3002 99.4068 47.0799 65.9736 71.9665 33.5264 94.1727 59.8667 82.9896 56.2392 68.6549 62.3391 52.0661 41.7194 94.1469 41.0538 19.5047 6.2636 50.3048 33.0070 71.4865 96.1961 21.8728 6.4260 31.1469 57.1467 20.9052 16.0058 23.6198 6.5427 71.7929 17.5489 47.9794 8.6357 20.8175 98.1724 5.9576 32.8710 87.1935 Chapter 5 Conclusions and proposals for future work 5.1 Thesis conclusions The main objective for this thesis has been to ﬁnd suitable methods for parameter estimation of descriptive models. The method has been used to present results for a benchmark problem which was built upon a hypothesis about insuling signaling in a fat cell. The main objective connected the thesis was to ﬁnd global optimization methods able to locate suboptimal function minima. To solve this problem three algorithms were presented, two (MSAC and MSAF) using new theory and one (MSL) using existing theory. The results from Chapter 4 show that: • The MSAC-algorithm did not show any good results for ﬁnd the function minima. A probable explanation can be that Gaussian Mixture Modeling does not exhibit the robustness that is necessary to build a multistart algorithm. • The MSAF-algorithm showed satisfactory robustness for low dimension problems. However, building such a detailed estimate of the function that is needed requires a function evaluation grid with high density. With growing problem dimensions this makes the algorithm too computationally expensive for practical use. However, the idea using a function estimate for locating the areas of attraction is appealing. • MSL-algorithm is no doubt the simplest algorithm examined for this thesis. However, the algorithm shows a robustness signiﬁcantly better than MSAC and MSAF. • As mentioned in Section 4.2.3, one drawback with the MSL-algorithm is that it uses an extensive amount of function evaluations. For a low-dimensional problem with really expensive function evaluations the MSAF-algorithm can 41 42 Conclusions and proposals for future work be the best choice since every performed function evaluation is used to update the function estimation. Using the MSL-algorithm we have showed some results for the benchmark problem which restrict the reaction rate for a certain part of the problem. A comment about the algorithm results is that the MSAC and MSAF algorithms were developed to solve the joint problem of ﬁnding suboptimal solutions and handling varable scaling. The MSL-algorithm based on an Euclidian distance measure was never intended to solve the variable scaling problem. 5.2 Proposals for future work One proposal for future work is to search for more knowledge within the ﬁeld of global optimization. Associate professor Dr E.M.T. Hendrix has a long experience within the ﬁeld of global optimization and uniform covering, and has given valuable input for this thesis. An additional algorithm that should be examined is the Topographical Multilevel Single Linkage algorithm, an algorithm that combines the MSL-algorithm with the Topographical Global Optimization Algorithm of Professor Aimo Törn [25]. Bibliography [1] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial Intelligence Review, 11(1-5):11–73, 1997. [2] M. Birattari, G. Bontempi, and H. Bersini. Lazy learning meets the recursive least-squares algorithm. Advances in Neural Information Processing Systems, 11:375–381, 1999. [3] I. O. Bohachevsky, M. E. Johnson, and M. L. Stein. Generalized simulated annealing for function optimization. Technometrics, 28(3):209–217, 1986. [4] G. Cedersund. Core-box Modelling. PhD thesis, Department of Signals and Systems, Chalmers University of technology, Göteborg, Sweden, 2006. [5] W. S. Cleveland. Robust locally weighted regression and smoothing scatterpoints. Journal of the American Stastical Association, 74(368):829–836, 1979. [6] W. S. Cleveland and C. L. Loader. Smoothing by Local Regression: Principles and Methods. In W. Härdle and M. G. Schimek, editors, Statistical Theory and Computational Aspects of Smoothing, pages 10–49. Springer, New York, 1996. [7] C. Engström, editor. Nationalencyklopedin. Bra Böcker, 1995. [8] T. G. W. Epperly and R. E. Swaney. Branch and bound for global NLP: Iterative LP Algorithm and Results, chapter 2. Kluwer Academic Publishers, 1996. [9] C. Fraley and A. E. Raftery. How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal, 41(8):578– 588, 1998. [10] M. Gaviano and D. Lera. Test functions with variable attraction regions for globaloptimization problems. J. of Global Optimization, 13(2):207–223, 1998. [11] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, August 2001. 43 44 Bibliography [12] E. M. T. Hendrix, P. M. Ortigosa, and I. García. On success rates for controlled random search. J. of Global Optimization, 21(3):239–263, 2001. [13] H. Kitano. Systems biology: A brief overview. Science, 295(5560):1662–1664, 2002. [14] C. Kittel and H. Kroemer. Thermal Physics. W. H. Freeman and Company, August 2002. [15] L. Ljung, editor. System identification (2nd ed.): theory for the user. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1999. [16] L. Ljung and T. Glad, editors. Modellbygge och Simulering. Studentlitteratur AB, 2003. [17] M. Locatelli. Relaxing the assumptions of the multilevel single linkage algorithm. Journal of Global Optimization, 13:25–42, 1997. [18] J. H. Mathews and K. D. Fink. Numerical Methods Using MATLAB. Simon & Schuster, 2004. [19] C. G. Moles, P. Mendes, and J. R. Banga. Parameter estimation in biochemical pathways: A comparision of glocal optimization methods. Genome Research, 13:2467–2474, 2003. [20] J. J. Moré, B. S. Garbow, and K. E. Hillstrom. Testing unconstrained optimization software. ACM Trans. Math. Softw., 7(1):17–41, 1981. [21] M. Nakhkash and M. T. C. Fang. Application of the multilevel single-linkage to one-dimensional electromagnetic inverse scattering problem. IEEE Transactions on Antennas And Propagation, 47(11):1658–1668, 1999. [22] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical recipes in C (2nd ed.): the art of scientific computing. Cambridge University Press, New York, NY, USA, 1992. [23] S. Roberts. Parametric and non-parametric unsupervised cluster analysis. Pattern Recognition, 30:261–272, 1996. [24] H. Schmidt and M. Jirstrand. Systems biology toolbox for matlab: a computational platform for research in systems biology. Bioinformatics, 22(4):514–515, 2006. [25] A. Törn and S. Viitanen. Topographical global optimization. Princeton University Press, Princeton, NJ, USA, 1992. [26] K. Tsai and F. Wang. Evolutionary optimization with data collocation for reverse engineering of biological networks. Bioinformatics, 21(7):1180–1188, 2005. [27] A. Törn and A. Zilinskas, editors. Springer-Verlag, 1989. Lecture Notes in Computer Science. Bibliography 45 [28] E. Ulfheilm. Modeling of metabolic insulin signaling in adipocytes. MSc thesis, Department of Electrical Engineering, Linköping University, Linköping, Sweden, 2006. [29] J. J. Verbeek, N. Vlassis, and B. J. A. Kröse. Eﬃcient greedy learning of gaussian mixture models. Neural Computation, 15(2):469–485, 2003. [30] S. Vĳayakumar, A. D’souza, and S. Schaal. Incremental online learning in high dimensions. Neural Comput., 17(12):2602–2634, 2005.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement