Mining continuous classes • uSIng evolutionary computing © University of Pretoria Mining continuous classes using evolutionary computing Data mining is the term given to knowledge discovery paradigms that attempt to infer knowledge, in the form of rules, from structured data using machine learning algorithms. Specifically, data mining attempts to infer rules that are accurate, crisp, comprehensible and interesting. There are not many data mining algorithms for mining continuous classes. This thesis develops a new approach for mining continuous classes. The approach is based on a genetic program, which utilises an efficient genetic algorithm approach to evolve the non-linear regressions described by the leaf nodes of individuals in the genetic program's population. The approach also optimises the learning process by using an efficient, fast data clustering algorithm to reduce the training pattern search space. Experimental results from both algorithms are compared with results obtained from a neural network. The experimental results of the genetic program is also compared against a commercial data mining package (Cubist). These results indicate that the genetic algorithm technique is substantially faster than the neural network, and produces comparable accuracy. The genetic program produces substantially less complex rules than that of both the neural network and Cubist. Thesis supervisor: Prof. A. P. Engelbrecht Department of Computer Science Degree: Magister Scientiae I would like to thank the following people for their assistance during the production of this thesis: • Andrew du Toit, Andrew Cooks and Jacques van Greunen, UP Techteam members, for maintaining the computer infrastructure used to perform my research; • Frans van den Bergh and Edwin Peer who listened patiently when I discussed some of my ideas with them, for their feedback and insight. Felix qui potuit rerum cognoscere causas Happy is he who has been able to understand the cause of things Contents 1 INTRODUCTION 1 2 BACKGROUND 4 2.1 KNOWLEDGE DISCOVERY 4 2.1.1 Introduction . 4 2.1.2 Definitions · ..... 2.1.3 Past usage. . . . . . . 2.2 2.3 2.4 EVOLUTIONARY COMPUTING 5 11 . 12 2.2.1 Introduction . 2.2.2 Definitions · .... 12 2.2.3 Paradigms. . . . . . 16 2.2.4 Performance issues . 17 2.2.5 Past usage. 19 12 . . . . . ARTIFICIAL NEURAL NETWORKS. 20 2.3.1 Introduction. 20 2.3.2 Definitions · .... 21 2.3.3 Performance issues 25 2.3.4 Past usage. 30 CLUSTERING . . . . . ...... 30 2.4.1 Introduction. . . . 31 2.4.2 K-means clustering. 31 2.4.3 Learning vector quantisers 33 2.4.4 Self-organising maps . . . 33 2.4.5 2.5 3 CONCLUSION.... THE GASOPE METHOD 3.1 INTRODUCTION 3.2 THEORY OF FUNCTION APPROXIMATION 3.3 3.4 3.5 4 Split-and-merge . . 3.2.1 Discrete least squares approximation. 3.2.2 Regression 3.2.3 Taylor polynomials .. 3.2.4 Lagrange polynomials 3.2.5 Selecting the correct approximating polynomial order. 3.2.6 Artificial neural networks 3.2.7 Evolutionary computing . GAS OPE STRUCTURE . 3.3.1 K-means clustering . 3.3.2 Genetic algorithm for function approximation. 3.3.3 Hall-of-fame ..... EXPERIMENTAL RESULTS 3.4.1 Functions . 3.4.2 Experimental procedure 3.4.3 Results CONCLUSION... THE GPMCC METHOD 4.1 INTRODUCTION . 4.2 BACKGROUND . 4.3 4.2.1 Artificial neural networks 4.2.2 Genetic programming GPMCC STRUCTURE . 4.3.1 Overview . 4.3.2 Iterative learning strategy. 4.3.3 The fragment pool .... 4.3.4 4.4 4.5 5 Genetic program for model tree induction 96 EXPERIMENTAL RESULTS 107 4.4.1 Datasets 107 4.4.2 Parameter influence. 109 4.4.3 Method Comparison 141 4.4.4 Rule quality . 144 . 158 CONCLUSION 160 CONCLUSION 5.1 SUMMARy 5.2 FUTURE RESEARCH . 160 161 List of Figures 2.1 Decision tree: The lazy student's guide to swimming pool additives 2.2 Model tree: The lazy student's guide to filling up the car 2.3 2.4 Illustration of crossover operators for binary strings 2.5 Feed-forward artificial neural network illustration 2.6 Self-organising map architecture 2.7 Split-and-merge illustration, E ::; 1 3.1 K-means output for y = sin(x) 3.2 Illustration of GASOPE chromosome initialisation for an individual 3.3 3.4 Illustration of the GAS OPE shrink operator for an individual Illustration of the GASOPE expand operator for an individual 10) 3.5 Illustration of the GASOPE perturb operator for an individual 10) 3.6 Illustration of the GASOPE crossover operator for individuals la, 113and ly 3.7 Function f1 actual vs. GASOPE and NN predicted . 3.8 Function f2 actual vs. GASOPE and NN predicted . 3.9 Function f3 actual vs. GASOPE and NN predicted . Illustration of mutation operators for binary strings 3.10 Henon map + U( -1, . . . . . . . . . . 3.11 Rossler attractor . . . . . . . . 3.12 Rossler attractor: x component 3.13 Rossler attractor: y component 3.14 Rossler attractor: z component 3.15 Lorenz attractor . . . . . . . . 3.16 Lorenz attractor: x component . l),x E [O,21t] 10) • 10) • 3.17 Lorenz attractor: y component 3.18 Lorenz attractor: z component 4.1 An arbitrary graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.2 A 3-piece linear approximation ofthe hidden unit activation function tanh(net) 80 given 20 training samples (0) . . . . . . . . . . . . 83 4.3 An example chromosome for a regression problem 86 4.4 Overview of the GPMCC learning process . . . . . 89 4.5 Illustration of GPMCC chromosome for an individual Ix 99 4.6 Illustration of the expand-worst-terminal-node 4.7 Illustration of the shrink operator for an individual Ix 4.8 Illustration of the perturb-worst-non-terminal-node 4.9 Illustration of the perturb-worst-terminal-node operator for an individual Ix " 101 102 operator for an individual Ix 104 operator for an individual Ix 4.10 Illustration of the crossover operator for individuals la, 113and 11 105 106 List of Tables 3.1 Function definitions . . 3.2 3.3 GASOPE initialisation 3.4 Comparison of GAS OPE and NN on noiseless data 3.5 Comparison of GASOPE and NN on noisy data . . 4.1 Databases obtained from the VCI machine learning repository 108 4.2 4.3 GPMCC initialisation parameters .... 111 GPMCC method: Fragment lifetime 5 . 113 4.4 GPMCC method: Fragment lifetime 50 114 4.5 GPMCC method: Fragment lifetime 100 . 115 4.6 GPMCC method: Fragment lifetime 200 . 116 4.7 GPMCC method: Fragment lifetime 400 . 117 4.8 GPMCC method: Leaf optimisation rate 0.05 119 4.9 GPMCC method: Leaf optimisation rate 0.1 120 4.10 GPMCC method: Leaf optimisation rate 0.2 121 4.11 GPMCC method: Leaf optimisation rate 0.4 122 4.12 GPMCC method: Initial window 0.05, window acceleration 0.005 123 4.13 GPMCC method: Initial window 0.05, window acceleration 0.01 . 124 4.14 GPMCC method: Initial window 0.05, window acceleration 0.02 . 125 4.15 GPMCC method: Initial window 0.05, window acceleration 0.04 . 126 4.16 GPMCC method: Initial window 0.1, window acceleration 0.005 . 127 GASOPE vs. NN pattern presentations . 4.17 GPMCC method: Initial window 0.1, window acceleration 0.01 128 4.18 GPMCC method: Initial window 0.1, window acceleration 0.02 129 4.19 GPMCC method: Initial window 0.1, window acceleration 0.04 130 4.20 GPMCC method: Initial window 1, window acceleration 0 . . . 131 4.21 GPMCC method: Initial clusters 100, split factor 2 (196 initial fragments) 133 4.22 GPMCC method: Initial clusters 100, split factor 3 (143 initial fragments) 134 4.23 GPMCC method: Initial clusters 30, split factor 2 (55 initial fragments) 135 4.24 GPMCC method: Initial clusters 30, split factor 3 (43 initial fragments) 136 4.25 GPMCC method: Initial clusters 20, split factor 2 (37 initial fragments) 137 4.26 GPMCC method: Initial clusters 20, split factor 3 (28 initial fragments) 138 4.27 GPMCC method: Initial clusters 10, split factor 2 (17 initial fragments) 139 4.28 GPMCC method: Initial clusters 10, split factor 3 (13 initial fragments) 140 4.29 Changes in GPMCC initialisation parameters from table 4.2 142 4.30 Comparison of Cubist, GPMCC and NeuroLinear 142 4.31 Comparison of Cubist and GPMCC 143 A.1 Table of symbols 175 . . . . A.2 Table of symbols (cont.) 176 Chapter 1 INTRODUCTION Knowledge discovery is the process of obtaining useful knowledge from raw data or facts. Knowledge can be inferred from data by a computer using a variety of machine learning paradigms. Data mining is the generic term given to knowledge discovery paradigms that attempt to infer knowledge in the form of rules from structured data using machine learning. In an article in the Boston Sunday Globe of August 11 2002, Vest identified the Biotechnology industry as the largest growing industry in Boston [97]. This follows the successful conclusion of the human genome project. The result of the human genome project is essentially a catalogue and guide to the structure of human DNA. Yet, very little is known about the effects of individual genes in DNA. In order to decipher and analyse the vast amount of information stored in DNA, data mining will have to be performed on a massive scale. As an industry, data mining and data warehousing threatens to replace e-commerce as the major information technology driving force of the 21st century. Recent events in the media also illustrate the need for data mining: • The September 11 2001 attacks on the World Trade Centre in New York highlighted major deficiencies in the method of information gathering used by US agencies such as the FBI and the CIA. Data mining could be used as an effective tool to combat criminal and terrorist activities, by tracing the activities, communications and purchases of individuals and identifying possible suspects . • The recent scandals involving CEOs of large multi-national corporations such as Enron, Worldcom, etc., illustrate the need for the development of data mining methods in order • The recent world-wide recession and collapse of the airline industry illustrate the need for companies to improve their market share as a means of survival. Access to accurate and reliable information is crucial for decision making. Data mining could be used to provide a competitive edge over other companies. Thus, knowledge discovery in the form of data mining is becoming increasingly important in today's world. However, in order to perform some of the mammoth tasks described above, it is necessary to develop fast, efficient knowledge discovery algorithms. Knowledge discovery algorithms can be divided into two main categories according to their learning strategies: • Supervised learning algorithms attempt to mini mise the error between their predicted outputs and the target outputs of a given dataset. The target outputs can either be - discrete, i.e. the supervised learning algorithm attempts to predict the class of a problem, e.g. whether it will be sunny, rainy or overcast in tomorrow's forecast, - or continuous, i.e. the supervised learning algorithm attempts to predict the value associated with a class, e.g. determining the price of a VCR. • Unsupervised learning algorithms attempt to cluster a dataset into homogeneous regions, according to some characteristic present in the data. Many knowledge discovery algorithms have been developed which utilise machine learning and artificial intelligence paradigms. The main classes of paradigms include: artificial neural networks [13][106], classification systems (like ID3 [81], CN2 [20][21] and C4.5 [83]), evolutionary computing [7][8][49], regression systems (like M5 [82]) etc. One of the primary problems with current data mining algorithms is the scaling of these algorithms for use on large databases. Additionally, very little attention has been paid to algorithms for mining with continuous target outputs, which requires non-linear regression. This thesis presents and fully discusses a number of evolutionary computing algorithms suitable for non-linear regression. The primary algorithm developed by this thesis is a genetic program for the mining of continuous classes (GPMCC), which utilises a genetic algorithm that evolves structurally optimal polynomial expressions (GAS OPE) as the non-linear regressions for the terminal nodes of the individuals in the genetic program. Additionally, a number of algorithms are developed which optimise the learning process. These algorithms utilise a fast, efficient data clustering algorithm. Methods for coping with large databases are also covered by this thesis. The remainder of this thesis is organised as follows: Chapter 2 provides a taxonomy of data mining methods in use today. It also covers many of the principles used throughout this thesis. The GASOPE method is presented in chapter 3 and details other methods employed by or related to the GASOPE method. Chapter 4 presents the GPMCC method, fully discusses the GPMCC method's interaction with the GASOPE method and provides an overview of other associated methods. Finally, chapter 5 concludes this thesis Chapter 2 BACKGROUND This chapter provides a taxonomy of the methods employed by many data mining applications in use today. An overview of the methods utilised in later sections is also presented. The paradigms of knowledge discovery, evolutionary computing, neural networks and clustering are discussed in broad terms. Attention to detail is given for specific methods as and when they are required by later chapters. This section discusses knowledge discovery and, more specifically, knowledge discovery through machine learning. Section 2.1.1 introduces the reader to the fundamentals of knowledge discovery. Some of the more common types of rule induction algorithms are discussed in section 2.1.2. Finally, section 2.1.3 provides examples of the past usage of various types of rule induction algorithms and is not confined to the induction algorithms presented in section 2.1.2. Knowledge discovery is the process of obtaining useful knowledge from raw data or facts. Such data can be structured, e.g. relational databases, tables and spreadsheets, or unstructured, e.g. text, video and audio. Knowledge discovery applied to structured data is termed data mining and knowledge discovery applied to unstructured data is usually termed text mining. Knowledge can be defined, for the purpose of this thesis, as an organised body of information. Such an organised body of information can be represented by rules of the form where the antecedent describes a test on the state of a set of attributes and the consequent describes a response to that state. The goal of knowledge discovery is thus to infer rules from data that are One way in which knowledge can be inferred from data is through machine learning, by an algorithmic process known as empirical learning. Empirical learning includes algorithms that reason from supplied examples in order to produce general theories about a specific problem domain, which is categorised by a dataset. These general theories are used to make predictions about further points in the problem domain, through both extrapolation and interpolation. Examples of empirical learning algorithms include artificial neural networks (discussed in section 2.3), evolutionary computing (discussed in section 2.2) and the large variety of ruleinduction algorithms presented later in this section. If training examples are supplied with known labels or targets, the empirical learning strategy is called supervised learning. Otherwise, the empirical learning strategy is known as unsupervised learning. Supervised learning can solve two types of problems: classification problems, where the training example labels are categorical and are called classes, and regression problems, where the training example labels are continuous. This section provides definitions of terms used throughout this thesis. A number of different rule induction algorithms are also discussed. The rule induction algorithms are divided into two categories according to the types of problems they solve: • Classification systems, which classify training examples according to their discrete target outputs . • Regression systems, which predict the numeric value associated with a class, i.e. regression systems predict the continuous target outputs of training examples. A decision tree is one representation of an organised body of information. A decision tree can recursively be defined as either a leaf node (terminal node) that names a class, or as an internal node (non-terminal node) that represents an attribute-based test, with a branch to another decision tree for each outcome of that test. A node that is linked to a node higher up in the hierarchy is called a child node. The sequence of nodes from the root of the tree to a leaf node is called a path. All paths of a decision tree are mutually exclusive and exhaustive, i.e. all regions defined by all paths completely cover the instance space (or pattern space). An instance space is an n-dimensional attribute space where each instance describes a point in that space. The size of a decision tree is the number of nodes in it, including the leaf nodes. Each attribute in a decision tree can either be nominal, ordinal or continuous. An attribute test has an attribute, relational operator and threshold. An example is classified using these attribute tests by moving the example down from the root of the decision tree, along the path that covers the example, i.e. all the attribute tests along the path of the decision tree are true for that example. When an example reaches a leaf, it is asserted to belong to the class labelled by that leaf. Figure 2.1 illustrates an example of a decision tree. A production rule consists of an antecedent and a consequent. The antecedent is formed by a conjunction of conditions. For zero-order classifier learning, which represents a number of axis-orthogonal splits in the attribute space, these conditions take the form of (A :::;v), (A > v), (A = v) or (A E {VI,"', vn}), where A represents an attribute and v, VI,"', Vn are possible values of A. The consequent, in the case of classification systems, names a class. An alternative representation of an organised body of information is a production system [77]. A production system represents an ordered list of production rules. An example is classified using these rules by moving the example from the top to the bottom of the production system, comparing the attributes of the example against the antecedents of each rule. If an example is covered by the antecedent of a production rule it is asserted to belong to the class labelled by the consequent. Table 2.1 shows the production system for the example of figure 2.1. IF IF IF IF IF IF (Colour (Colour (Colour (Colour (Colour (Colour = {green} ) = {clear}) 1\ (pH < 7) = {clear} ) 1\ (pH = 7) = {clear}) 1\ (pH> 7) = {milky}) 1\ (Debris = yes) = {milky}) 1\ (Debris = no) THEN chlorine THEN chlorine THEN nothing THEN acid THEN nothing THEN acid Two induction strategies are used to generate rules for classification systems: selective induction and constructive induction. These two approaches are described next. Selective induction induces rules from a training set by selecting attributes from the supplied training examples, upon which the training set is split. Essentially, these attributes partition the search space into regions that have the same class membership. In the case of decision trees, the induction algorithm utilises axis-orthogonal splits, each of which is based on a single attribute. If the supplied attributes are appropriate for representing target theories, selective induction algorithms perform well in terms of prediction accuracy and theory complexity. Some popular examples of selective induction algorithms include ID3 [81], C4.5 [83] and CN2 [21][20], and are summarised below: The ID3 algorithm uses decision trees to generate rules. ID3 generates these decision trees using a divide-and-conquer approach. The algorithm recursively splits a training set, P, if the training set labels consist of heterogeneous classes. The heuristic utili sed to decide on a test, upon which to split the training set, is called the gain criterion, which is based on information theory. The gain criterion is defined as follows: _ Y'C injo(P) where 0 L.",'l'=1 !req(C\f,Q) IQI .l injox(P) IPepl· j (D) L.",<I>=1 lPf . In 0 £<1> gain(X) injo(P) - injox(P) og2 (!req(C\f,Q)) IQI Y'0 is the number of outcomes of test X, Q is a set of cases belonging to some class C'l' and c is the number of classes in the domain. The information content of the domain before splitting occurs is calculated by injo(P). The sum of the weighted information content of each outcome after splitting occurs is calculated by injox(P). The objective of the ID3 algorithm is to maximise the gain criterion gain(X), i.e. to maximise the decrease in entropy of performing a split. ID3 assumes clean data and is therefore unable to deal with outliers or missing values. The trees generated by ID3 completely cover the training set. Thus, ID3 is particularly susceptible to over-fitting. The C4.5 algorithm (latest version C5.0 [84]) is based on ID3, but has no assumptions about the purity of the data. The C4.5 algorithm uses decision trees and heuristics to generate simplified, comprehensible production rules. C4.5, like ID3, generates these decision trees using a divide-and-conquer approach. The heuristic utili sed to perform the divide-and-conquer approach is similar to ID3's gain criterion, called the gain ratio criterion. criterion is defined as follows: ° lPf' IPepl l (IPepl) spliLinjo(X) - L<I>=1 gainJatio(X) gain(X) / spliUnj o(X) og2 lPf The gain ratio where gain(X) is obtained using equation (2.1) and all variables are defined as above. The maximum information content of each outcome after splitting occurs is calculated by spliLinjo(X). The objective of C4.5 is to maximise the gain ratio criterion gain-ratio(X), i.e. to maximise the relative decrease in entropy. The gain ratio criterion prevents the C4.5 algorithm from biasing toward classes with large number of patterns. ID3 and C4.5 inherently applies to discrete data. However, in the case of continuous-valued attributes the training examples are sorted (according to the desired attribute) and a threshold is chosen that lies between any two adjacent training examples (usually the midpoint between the two examples). This threshold then acts to split the data into two subsets, according to the test Ai < ai,l i ai ,2. If an attribute has n continuous values in the domain, n - 1 splits are tested using equation (2.2) and the split with the largest gain ratio criterion is selected. C4.5 provides a windowing strategy to reduce the number of training examples presented to the algorithm. The windowing strategy initially selects a random subset of training examples (the window) from which an initial tree is built. This tree is used to classify training examples not included in the window. The training examples that are misclassified by the initial tree are then added to the window, from which another tree is built. Quinlan's initial reason for utilising a windowing strategy was to attempt to reduce the time required to construct trees. After experimentation, however, he discovered that the windowing strategy also resulted in more accurate decision trees [83]. CN2 uses a beam search algorithm in order to find rules, and a control algorithm for repeatedly executing the search. The rules induced by CN2 are represented in the form of a production system. CN2 uses an entropy measure to decide on a set of conditions from which to induce a rule. Instead of using classification accuracy or information content as a measure of rule quality, CN2 uses the Laplace Error Estimate to test the significance of rules. The Laplace Error Estimate is defined as follows: Laplace...Accuracy = IIQII + 1 R +c (2.3) where c is the number of classes in the domain, Q is a set of examples belonging to the class covered by the rule and R is the set of examples covered by the rule. CN2 also uses significance testing in order to prune the induced rules. Constructive induction algorithms consist of two steps. One step constructs new attributes, the other generates theories. Attribute construction can be visualised as the application of constructive operators, such as /\, V and -', to the set of existing attributes, in order to reduce the attribute space of the problem domain. The constructive operators thus provide a mechanism for mapping an N-dimensional attribute space into an X dimensional attribute space such that X :::;N. Theories are generated from the X dimensional attribute space by using any of the previously mentioned selective induction algorithms. In essence, constructive induc- tion attempts to provide a mechanism for generating complex orthogonal splits in the decision space. Zheng's X-of-N algorithm is an example of a constructive induction algorithm [105]. A regression tree is a decision tree that has numeric values at its terminal nodes. A model tree, on the other hand, is a decision tree that has multi-variate linear models at its terminal nodes. Both of these classifiers attempt to predict a numeric value associated with a class, rather than the class to which an example belongs. The objective of both model and regression trees is to perform a piecewise approximation of the instance space. As with decision trees, an example moves along a path toward a leaf and is compared with each attribute test along the way. Once the example reaches a leaf, the example is asserted to have the target output defined by the numeric value or the multi-variate linear model. Figure 2.2 illustrates an example of a model tree. Compared with the large number of classification systems, very few regression systems are currently in existence. One of the few methods in existence is Quinlan's M5 algorithm [82]. M5 utilises model trees and is a descendant of C4.5. Like C4.5, M5 uses a heuristic in order to decide on a test. The heuristic is defined as follows: E -IIP. o 6.error = std(P) - 1 <jll • std(P<jl) <jl=1 P where P, once again, represents the set of training cases, 0 is the number of outcomes for a test, std(P<jl) represents the standard deviation of the target values of cases in P<jl,and the set of patterns belonging to outcome <1>. P<jl represents After examining all tests, M5 chooses the test that maximises the expected error reduction. The multi-variate linear regressions at each leaf are constructed using standard regression techniques (such as the least squares method) [16]. The M5 algorithm also includes heuristics for smoothing and model tree pruning. Pruning is used to remove regressions that cover outliers. Smoothing is used to adjust the values of an appropriate leaf model to reflect the predicted values at nodes along the path from the root to that leaf. Smoothing can be visualised as a process of adjusting the output of models that have few training cases or that lie on either side of an orthogonal split, where the models have very different values. Many successful applications of knowledge discovery techniques exist in the literature. This section presents a small, interesting subset of these examples which are not constrained to include only the algorithms discussed in section 2.1.2. Spertus used C4.5 in a system called "Smokey" to dispose of abusive emails (flame mail) [95]. Smokey built a 47-element feature vector based on the syntax and semantics of each sentence, combining the vectors for the sentences in each message. The system was able to correctly classify 64% of abusive emails and 98% of normal emails. Daelemans et al. applied C4.5 to the problem of natural language processing [26]. Specifically, C4.5 was applied to the formation of diminutive forms in Dutch. C4.5 proved a useful tool in corroborating or falsifying existing linguistic theories. Adomavicius and Tuzhilin used a system called "1: 1Pro" to build customer profiles [2]. Specifically, the system mined the non-alcoholic beverage sales of a number of households. As anticipated, the system discovered that most rules pertained only to a small number of households, meaning that the system captured the idiosyncratic behaviour of individual households. The system also discovered interesting seasonal behavioural characteristics of consumers. This section presents a brief introduction to evolutionary computing. Section 2.2.1 introduces the reader to evolutionary computing by drawing parallels with natural evolution. The evolutionary computing definitions used throughout this thesis are discussed in section 2.2.2. A number of popular evolutionary computing paradigms are presented in section 2.2.3. Section 2.2.4 discusses a number of performance issues relating to the use of evolutionary computing. Finally, section 2.2.5 presents some of the past uses of a number of evolutionary computing paradigms. Interested readers should consult the authoritative works of Goldberg [48] and Back et al. [7][8]. Evolutionary computing simulates the Darwinian principle of natural selection. Darwin discovered that a certain species of birds (finches) native to the Galapagos islands, differed from each other in terms of beak shape [27]. He also noted that the beak varieties were associated with diets based on different foods. He concluded that when the original South American finches reached the islands, they dispersed to different environments where they had to adapt to different conditions. Over many generations, they changed anatomically in ways that allowed them to get enough food and survive to reproduce. This illustrates one of the handful of ways that species of plant, animal and bird-life evolve over long periods of time. The main concept of natural selection is that the fittest individuals in a population survive and the weakest individuals perish. Evolutionary computing (Ee) utilises a population of individuals, where each individual represents a candidate solution to the optimisation problem. The chromosome associated with each individual encodes the genotypes and phenotypes of that individual. A genotype describes the genetic or factional constitution of an individual and thus provides a mechanism to store experiential evidence. A phenotype describes the observable characteristics of an individual produced by the interaction between the genes and the environment. A gene represents a characteristic of the individual. The value of a gene is referred to as an allele. In EC, the design of a chromosome (or individual) is paramount. The efficiency and complexity of an EC search algorithm greatly depends on the chromosome representation scheme. Traditional EC methods encode the chromosome as a binary string. More modem methods encode the chromosome as any combination of available machine types, e.g. real numbers, integers, characters etc. The fitness function is the most important component of any EC paradigm. The fitness function maps a chromosome representation into a floating-point value, which quantifies the quality of each individual. The quality of an individual can be defined as the distance between the individual and the optimal solution. The fitness function is used to guide the other EC operators such as selection, elitism, crossover and mutation. It is important that the fitness function accurately models the optimisation problem, i.e. the fitness function should adequately model the solution space to a problem. Obviously, a fitness function that inaccurately models the optimisation problem could lead to sub-optimal solutions. Also, the fitness function should reflect all the criteria to be optimised, e.g. certain problems require the optimisation of both the output and the internal architecture of a solution. Constraints on the problem space can also be incorporated into the fitness function through penalisation of those solutions that violate the constraints. However, constraints can also be directly applied to the EC operators. Before beginning the evolutionary process, an initial population of individuals must be generated. A standard way of generating these individuals is to randomly set each gene in a chromosome. The goal of random selection is to uniformly represent the entire search space. If prior knowledge is available on the search space, heuristics can be used to bias the initial population toward potentially good solutions. This, however, may lead to premature convergence because the entire search space is not covered. As was mentioned earlier, there are a number of other operators involved in an EC optimisation algorithm. Selection operators emphasise better solutions in a population. Many selection schemes were compared by Goldberg and Deb [49]. Random selection selects indi- viduals with no reference to fitness at all. Proportional selection selects an individual proportionately according to its fitness value, using roulette wheel sampling. Tournament selection selects a group of n individuals to take part in a tournament and the best individual is selected. Rank-based selection uses the rank ordering of fitness values to determine the probability of selection and not the fitness itself. Elitism involves the selection of a set of individuals from the current generation to survive to the next. The individuals that survive to the next generation can be selected using the previous selection operators or as the best individuals from the current generation. Crossover operators model reproduction in nature. Superior individuals should have more opportunities to reproduce. An extreme example of this can be seen in nature in wolf packs, where only the alpha male and the alpha female have pups, because the alpha male and alpha female actively terminate pups produced by other individuals. Each generation is thus strongly influenced by the genes of the fitter individuals. Crossover is achieved by combining genetic material from a number of parents (usually two) to create a new individual. A number of crossover strategies exist for binary string representations and are also applicable to other representations. Uniform crossover begins by creating a random binary mask. This mask is applied to both parents in order to generate a new individual, where a, b, c and m respectively represent the vectors of parent A, parent B, the new individual C and the mask. One-point crossover selects a random bit position. All bits in the first parent before the bit position and all bits in the second parent after the bit position are placed in the target chromosome. Two-point crossover is similar to one-point crossover, except that two bit positions are chosen. The three crossover strategies mentioned above are shown in figure 2.3. Mutation operators serve to expand the search space of the EC optimisation algorithm by injecting new information into the search space. Mutation is thought to occur in nature due to cosmic and other types of radiation, which damages the molecules found in DNA [87]. Mutation in EC is performed by randomly injecting new genetic material into an individual, thus damaging the genes of the chromosome. Mutation is performed on an individual C by randomly adjusting a number of the genes in the individual's chromosome. Two popular mutation strategies for binary string representations are random mutation, which randomly selects a b Application points ~m number of bits and negates them, and inorder mutation, which selects pairs of bit positions and randomly negates bits between pairs of bit positions. The two mutation strategies mentioned above are shown in figure 2.4 ITIJIIJ I 1 1 ITIJIIJ Application points 1 1 1 1 There are many evolutionary computing paradigms. This section overviews some of the most important ones. Genetic algorithms (GAs) model genetic evolution. The original GAs, introduced by Holland, used a bit string representation, proportional selection and crossover as the primary method to produce more individuals [55]. A number of changes have been made to the original GA including representation, selection, elitism, mutation and crossover changes. Island genetic algorithms model the migration of individuals to other subpopulations/islands [18]. Essentially, each island represents a subpopulation of the entire population of individuals. Individuals compete for survival on each of these subpopulations, but are also allowed to migrate to other subpopulations. The subpopulations can also represent subsets of the search space or different optimisation criteria. Genetic programming (GP), introduced by Koza [68], is a specialisation of GAs that utilises a tree-based chromosome representation scheme. This tree-based representation can be used to represent programs or expressions. Each individual in a GP population, therefore, represents a possible program or expression, which is an element of the program space formed from all possible programs that can be created from a given grammar. Evolutionary programming, introduced by Fogel [39], differs from the GA and GP paradigms, in that evolutionary programming models phenotypic evolution. The evolutionary process attempts to find a set of optimal behaviours from the space of observable behaviours. Evolutionary strategies, introduced by Rechenberg [85], models the evolution of evolution. The evolutionary process attempts to evolve both the genetic characteristics of an individual as well as a set of strategy parameters. The strategy parameters control the evolutionary characteristics of the individuals. Coevolution models the complementary evolution of closely related species [7]. Two coevolutionary processes are symbiotic relationships, where two species co-operate for their mutual benefit, and predator-prey relationships, where each species attempts to out-perform the other in their environment. Coevolution does not define optimality through the use of a fitness function, but defines optimality as the overall success of one population over another, i.e. a relative fitness. Coevolution has mostly been applied to two-agent games, where the objective is to evolve a game strategy. Cultural evolution, introduced by Reynolds [86], models societal evolution. Two search spaces are maintained by cultural evolution: the population space and the belief space. The belief space models the cultural behaviours of a population. An acceptance function is used to determine which individuals have an influence on the current beliefs. The belief space is then adjusted with the experiential knowledge of influential individuals. The belief space is then used to influence the individuals in population space. The representation of a chromosome or solution in an evolutionary computing (EC) algorithm has implications for the performance of that algorithm. Binary coding, although frequently used, introduces Hamming cliffs into the search space. A Hamming cliff is formed when two numerically adjacent values have bit representations that lie far apart. This represents a problem when a small change in variables should result in a small change in fitness. An alternative representation is Gray coding, where the Hamming distance between successive numerical values is 1. Tree-based representation schemes, as found in genetic programs, can lead to discontinuous or inefficient results, e.g. if a grammar contains the natural logarithm function, values in the domain (-00,0] will have no meaning. Alternatively, if a grammar consists of boolean connectives, expressions such as ""n will not be unlikely. Additional semantic tests need to be included to ensure the semantic correctness of each individual tree. Mutation operators inject new genetic material into a population. However, Hinterding et al. argued that mutation operators are more than just background operators and are crucial to the success of any EC algorithm [53]. Mutation operators should, in fact, utilise domain specific knowledge. Domain specific knowledge is any previously discovered knowledge that pertains to the specific optimisation problem of the EC, e.g. if the optimisation problem is the design of wings for a fixed wing aircraft, domain specific knowledge would include the known strengths of various compounds from which a wing can be manufactured. The argument for the use of domain specific knowledge is supported by the No Free Lunch theorem of Wolpert and Macready [103], which states that " ... the average performance of any pair of algorithms across all possible problems is exactly identical." " ... if some algorithm aI's performance is superior to that of another algorithm a2 over some set of optimisation problems, then the reverse must be true over the set of all other optimisation problems." From the above example, if an algorithm has specifically been engineered for the design of wings for a fixed wing aircraft, the algorithm may not necessarily perform well at optimising the architecture of an artificial neural network. However, because the algorithm was only engineered to solve the wing design problem, it is not expected to also optimise artificial neural network architectures. The mutation rate of an EC algorithm directly controls the injection rate of new material into the population and therefore controls how rapidly the search space is broadened. A large mutation rate causes the algorithm to behave more like a random search, because experiential knowledge gained from reproduction is lost. When the experiential knowledge is lost, the algorithm continually searches through parts of the search space it has already visited, leading to increased training time. A small mutation rate can cause the algorithm to stagnate in a local minimum, because all of the individuals in a population become homogeneous. Stagnation thus results in poor training accuracy and sub-optimal convergence. The choice of the initial population size has implications for the performance of an EC algorithm. Large populations may be more computationally complex than small populations, but they cover more of the search space. Elitism can ensure that an EC algorithm retains good genetic material. However, the elite group can lead to stagnation, because the elite individuals are more likely to be involved in crossover. If an elite group survives for many generations, the population will become homogeneous, leading to sub-optimal convergence. The crossover rate, on the other hand, controls how rapidly the search space is reduced. A large crossover rate leads to a large number of individuals being involved in reproduction. This can cause slower convergence, because increases in fitness in one individual take a long time to filter through to other individuals in a population. A small crossover rate, on the other hand, leads to a small group of individuals being involved in reproduction. This can cause the algorithm to stagnate in a local minimum, because, as with a low mutation rate, all the individuals in a population rapidly become homogeneous. Additionally, the population size, generation gap (elitism), mutation rate, and crossover rate all exhibit some or other level of dependence on one another and on the problem domain. The correct parameter choices are thus fairly problem specific. Evolutionary computing paradigms are increasingly being applied to situations that are computationally complex or situations where humans have very little prior knowledge of the problem domain. Examples include routing, network optimisation, design, game strategies etc. This section overviews just some of the applications. Biles used a genetic algorithm for generating improvised jazz solos [12]. The method primarily received an input chord progression, as well as other cues, from which the algorithm attempted to improvise a melody. The genetic algorithm used a multi-objective fitness function in order to reward solutions that were considered musically correct. Although Biles' attempts were successful, he did concede that the method was far from musically perfect. Beretta et al. used a genetic algorithm to perform fuzzy compression and decompression of an input image [11]. Specifically, the method optimised the fuzzy membership boundaries that described a patch of pixels. The method was found to perform well on a large database of images and had very good generalisation ability, specifically with regards to preserving important features within a picture. Gockel et ai. used a hybrid genetic algorithm to solve the channel routing problem [47]. The algorithm used domain specific knowledge during initialisation of the population to improve the efficiency of the algorithm. The algorithm provided improved results over similar attempts and could feasibly be implemented on very large channels. Koza used a coevolutionary genetic program to evolve game strategies for a number of types of strategy games [67]. Specifically, the method attempted to evolve a strategy equivalent to the minimax strategy, without prior knowledge of the minimax strategy. The algorithm was successful in discovering the minimax strategy. Schweitzer et al. used evolutionary strategies to optimise the layout of a road network [89]. The method minimised the cost of road construction and maintenance, while providing direct connections between nodes in order to avoid detours. Evolutionary strategies proved to be a capable tool in solving this frustrating problem. This section presents a brief introduction to the machine learning paradigm of artificial neural networks. Section 2.3.1 introduces the reader to artificial neural networks by drawing parallels with biological neural systems. The fundamentals of artificial neural networks are presented in section 2.3.2. Performance issues pertaining to the use of artificial neural networks are discussed in section 2.3.3. Finally, section 2.3.4 presents the past usage of artificial neural networks. The interested reader is encouraged to review the works of Zurada [106] and Bishop [13]. Artificial neural networks are an attempt to model biological neural systems. One of the primary features of biological neural systems is that they can learn from their environments. For example, an animal can learn to identify possible sources of food and water, and also to identify potential threats. Also, many types of birds in infancy have the ability to learn to recognise the individual calls of their parents, particularly through the vast cacophony created by other members of their species in breeding areas. The primary building blocks of biological neural systems are called neurons. A neuron consists of a cell body which contains a nucleus and cytoplasm, from which threadlike processes called dendrites extend. Electrochemical impulses travel from the cell body along a single fibre, called an axon, to other neurons. The processes of one neuron never touch those of another neuron and are separated by a space called a synapse. A synapse serves to either inhibit or propagate an electrochemical signal. An artificial neuron (AN) is a model of a biological neuron. An AN represents a non-linear mapping of I real-valued inputs (9\/) to a real-valued output ([0, 1],[-1,1] or [-00,00]). Each input Xi is associated with a weight Wi, which serves to amplify the aforementioned input. Each AN has an associated activation function fAN and threshold value e, both of which control the output of the AN. The output of an AN is computed as 1 fAN(net) = L (XiWi) - e i=l n 1 fAN(net) = (XiWi) - e i=l for a product unit. Different types of activation functions fAN can be used. In general, activation functions are monotonically increasing mappings, where fAN ( -00) = 0 or fAN ( -00) = -1, and fAN ( 00) = 1. Frequently used activation functions include the linear, step, ramp, sigmoid, hyperbolic tangent and Gaussian functions, of which the sigmoid activation function is the most common. The sigmoid activation function is defined as follows: 1 fAN(net) = 1+e- An et i It, 3-W"'b ~cr b \67(,ol\~ where A controls the slope of the function. Usually A = 1, which gives an active domain (where f~N '*' 0) for the sigmoid function of [- J3, J3]. The weights and threshold values of an AN are obtained through learning. A learning rule, such as the popular gradient descent learning rule, is used to train an AN [99]. Other learning rules include Widrow-Hoff [17], generalised delta [74], etc. The gradient descent learning rule requires the definition of an error metric to measure the AN's error in approximating a target output. The sum squared error IPI Ess = E (YPI - Y~I)2 1=1 is usually used, where YPI and YPI represent the target and predicted outputs for pattern PI, and IPI is the size of the training set. An AN essentially represents an I -dimensional hyperplane, where the curvature of the hyperplane is described by the AN activation function. This hyperplane is of importance, because it can be used either to describe a decision boundary between two linearly separable classes or to fit a regression curve through a number of data-points. An artificial neural network (ANN) is a layered interconnection of ANs. A wide variety of ANN architectures are in use today [13][106]. A feed-forward ANN, illustrated in figure 2.5, consists of an input layer, a number of hidden layers and an output layer, where each layer is connected to the next in the aforementioned order. A feed-forward ANN can also implement direct connections between the input and output layer. Feed-forward ANNs are important, because it has been proved that feed-forward ANNs with monotonically increasing differentiable functions can approximate any continuous function with one hidden layer, provided that the hidden layer has enough hidden neurons [57][58]. Functional link ANNs are feed-forward ANNs that implement activation functions in all layers including the input layer [79]. Recurrent ANNs have feedback connections, which allow the ANN to learn the temporal characteristics of a given dataset [15]. The Elman recurrent ANN copies hidden units into a context layer, which is then fed back to the hidden layer. The Jordan recurrent ANN copies output units into a state layer, which is then fed back to the hidden layer. Generally, ANNs (like other empirical learning algorithms) can be categorised into two types according to their learning strategies: supervised and unsupervised. attempt to predict the target response of a given input pattern. Supervised ANNs Input patterns can either form part of a classification or regression problem. Unsupervised ANNs attempt to discover patterns, features or relationships between input patterns, without supervision, and are mainly used to perform clustering. Unsupervised ANNs will be discussed further in section 2.4. Supervised ANNs utilise a training set consisting of a number of input vectors, where each input vector has an associated target vector. The ANN uses the target vector to determine the overall error of the network. This overall error is employed by a learning rule to either directly adjust the ANN's weights and thresholds or as an error estimate for some optimisation method. ANN optimisation methods can either be classed as local or global. Examples of local optimisation methods include the popular gradient descent with back-propagation [99] and scaled conjugate gradient [75]. Examples of global optimisation methods include leapfrog optimisation [60], genetic algorithms [3] and particle swarm optimisation [61]. Localoptimisation methods can, in turn, be categorised according their weight update strategies: • Stochastic learning adjusts the weight and threshold values of an ANN after each training pattern presentation. • Batch learning adjusts the weight and threshold values of an ANN only after the presentation of all training patterns. The gradient descent optimisation algorithm can be divided into two phases. The feedforward phase calculates the output of the ANN. The back-propagation phase propagates an error signal back from the output layer, through the hidden layer, to the input layer. Essentially, the back-propagation phase adjusts the weights of each AN in the ANN proportionately according to the amount of error introduced by that AN. The weight update equations of an ANN have to be derived separately for each type of activation function employed by the ANN and each optimisation algorithm. An important aspect to the gradient descent algorithm is the introduction of a momentum term a and learning rate 11to the weight update equations. These terms control the convergence speed of an ANN and will be discussed in section 2.3.3. Each training iteration (referred to as an epoch) of an ANN represents one pass through the training set. After each epoch the mean squared error is calculated as E _ ~IPI ~T ( • )2 J..l=l J..t=l Yt,PI - Yt,PI IPIT MS- where T represents the number of target outputs for the ANN. Stochastic learning can be summarised with the following pseudo-code algorithm, assuming gradient descent: Wih:= Wht -1 Vl+T 1) ( Vl+T' -1 VH+T 1) ( VH+T' U := U 1+1 H+l 1+1 H+l ,i E {1, .. ·,I + l},h ,h E {l, .. ·,H E + 1},t {I,· .. ,H} E {I, .. ·, T} where I, H and T are the number of input, hidden and output units respectively (the number of weights for each AN is increased by one to cater for the bias). Also initialise the learning rate 11,the momentum a and the number of epochs E. (a) Perform the feed-forward phase to calculate Yt,PI for each output AN, t. (b) Calculate the error signal (JOt,PI = (Yt,PI - Y7,p) (c) Perform back-propagation weights Wht, (d) Set Ess := Ess EMS := for each output AN. of the error signal (JOt,PI' by adjusting the output layer computing the hidden layer error signals hidden layer weights 5. Set 2 + r:{=l (JYh,PI and adjusting the Wih. (Yt,PI - Y7,p) 2 I~f} As with all algorithms, a number of factors influence the performance of an artificial neural network (ANN). The performance of an ANN is influenced by three competing objectives, i.e. accuracy, complexity and convergence. Accuracy is concerned with those aspects of an ANN that impede its generalisation ability. Complexity is concerned with such issues as the ANN architecture, the size of the training set and the complexity of the optimisation method. Convergence is concerned with the stability of an ANN, i.e. whether the variability in the ANN outcome is within acceptable levels. This section discusses ANN data preparation, ANN weight initialisation, the learning rate, momentum and optimisation method of an ANN, ANN architecture selection and active learning with respect to how they relate to the above mentioned objectives. ANNs usually require the scaling of their input data. Incorrect scaling leads to decreased ANN accuracy, when the scaled inputs do not cover the entire active domain of the ANN activation functions. Patterns should thus be scaled to the active domains of the activation functions utilised in the hidden and output layers. Also, depending on the activation functions utilised in the output layer, the outputs of an ANN will have to be scaled. It is also important to ensure that the data is numeric, i.e. nominal attributes need to be converted into a continuous representation. A nominal attribute with n different values is recoded as n binary valued inputs, where the input parameter corresponding to a particular nominal value is assigned the value of 1 and the remainder are assigned the value of O. Outliers can significantly affect the weights of an ANN and can lead to decreased ANN accuracy in terms of generalisation. Essentially, an outlier has a large effect on the sum squared error metric, utili sed by many ANN optimisation algorithms. Such optimisation algorithms will adjust weights so as to minimise the sum squared error. Thus, a single pattern can exert a disproportionate influence on the weights of an ANN. This leads to a case where the training error of an ANN might be good, but the generalisation error is poor. Thus, outliers result in a bias of the weights of the ANN toward the outlier. A number of strategies exist for handling the outlier problem. One solution, is to remove the outliers before training, using statistical techniques. Another is to use a robust objective function, unlike the sum squared error, that is not influenced by outliers. Training patterns with missing attribute values may contain useful information. Removal of these training patterns could result in decreased ANN accuracy, particularly if the pattern occurs in a region of low data point density. A common solution is to replace the missing value with the average value of the attribute. ANN complexity ultimately depends on the size of the training set utili sed during ANN training. A number of strategies have emerged to control large and, conversely, small training sets. The introduction of noise, sampled from a normal distribution, in small training sets has been shown to result in reduced training time and increased accuracy [56]. Several researchers have also developed strategies for coping with large training sets that involve the presentation of various subsets of the data to the ANN training process. This section differs from the later active learning section in that the ANN has no active control over the subsets presented to it. These training set manipulation strategies include selective presentation [78], incremental training strategies [24], increased complexity training [22] and delta subset training [23]. The selective presentation strategy divides the original training set up into two other training sets. One set containing "typical" patterns, and the other containing "confusing" patterns. Typical patterns refer to patterns that lie far away from decision boundaries and, conversely, confusing patterns refer to patterns that lie close to decision boundaries. The two new training sets are alternately presented to the training algorithm. In practise, this algorithm is not practical because prior knowledge of the search space is required in order to divide the data into subsets. The incremental training strategies start with a small random initial subset. During training additional patterns from the original set are added to the actual training set. This algorithm is practical because it assumes no prior knowledge of the search space. The increased complexity training strategy splits the original training set up into subsets of increased difficulty. The strategy starts by presenting easy problems to the learning algorithm, and gradually increasing the level of difficulty. A drawback of the method is that the complexity measure of the patterns is problem dependent. The delta subset training strategy orders the training patterns according to their interpattern distance. The metric used to perform the ordering can either be the Hamming distance or the Euclidean distance. The learning algorithm is then presented with either the smallest difference patterns first or the the largest difference patterns first. Gradient-based optimisation methods are very sensitive to the initial weight vectors. sensitivity can lead to poor convergence. This A particularly poor strategy is the initialisation of these vectors to 0, which can be shown to be equivalent to an ANN with only one hidden unit. [vi A good strategy is to set the initial weight vectors to random weights in the range -1. connectIOns , vi 1. connectIOns ], where the connections parameter represents the number of connec- tions leading to an artificial neuron (AN) [100]. The learning rate 11controls the step size of the ANN optimisation method. A small learning rate results in small weight adjustments and a large learning rate, conversely, results in large weight adjustments. The learning rate parameter introduces a trade-off between the speed of convergence and the accuracy of convergence. This trade-off exists, because small weight adjustments cause the ANN to take longer to stabilise. Large weight adjustments, on the other hand, may cause the ANN to wildly oscillate between weight values and jump over a local minimum. The momentum term a prevents unlearning in stochastic learning. Unlearning occurs when two successive weight updates result in no change in the state of the ANN weights, i.e. when the second weight update negated the effect of the first weight update. Momentum in ANNs is similar to inertia in physics, in that updates to the weights have little effect, unless they are sustained. The optimisation method employed by an ANN has a significant influence on performance. While gradient descent is very popular, it suffers from slow convergence and susceptibility to local minima. However, optimisation methods such as particle swarm optimisation and genetic algorithms are significantly more computationally expensive. Thus, a trade-off exists among various types of optimisation methods. Learning in ANNs does not just include finding the optimal weight values, it also includes finding the optimal ANN architecture. Finding the optimal architecture is crucial, because a large number of weights, trained for too long, with noise in the training data causes an ANN to "memorise" that data. Memorisation results in poor ANN generalisation ability. Finding the optimal architecture ultimately requires a search over all possible architectures. The optimal architecture of an ANN is thus that architecture which results in the best generalisation performance. Architecture selection can be divided into three categories:regularisation, network construction and network pruning. Regularisation involves the addition of penalty terms to the ANN objective function, which penalises network complexity (network size). A large number of regularisation strategies exist in the literature [46][63][98][101]. Network construction involves the growth of a small network by dynamically adding hidden ANs during training. This method requires the ANN to analyse and decide on an appropriate time to add new hidden ANs. Deciding on the appropriate time is, however, not trivial and can result in over-fitting and increased training time [42][54]. Starting with a too large architecture, network pruning involves the removal of unnecessary ANs either during or after training. The decision to prune an AN depends on some measure of the relevance of that AN. A large number of pruning techniques exist in the literature. Optimal brain damage is a popular technique, that uses sensitivity analysis to remove redundant weights [70]. Variants of optimal brain damage include optimal brain surgeon [52] and optimal cell damage [19]. Fletcher et al. utilise the Fisher information matrix as well as statistical hypothesis testing to determine the optimal number of hidden units and weights [38]. Engelbrecht used sensitivity analysis in order to prune irrelevant weights, hidden units and input units [31]. Most ANNs are passive learners, i.e. they have no control over the training data presented to them. Active learning, on the other hand, allows ANN s to make optimal use of the training data. The ANN trains on the patterns it regards as most informative, by automatically removing patterns that are redundant or ambiguous from the training set. This section differs from training set manipulation in that the ANN has active control over the data. There are two main approaches to active learning: incremental learning and selective learning. Incremental learning starts with an initial subset of the training data. At specified intervals during training, further patterns are selected from the training set using some or other heuristic. As training progresses the size of the actual training set increases [25][35][43][71]. Selective learning differs from incremental learning in that training starts with the full training set and patterns are discarded as they are found to be redundant [33][34]. Engelbrecht and Brits present an interesting active learning strategy that makes use of clustering to select the most informative patterns for training [32]. 2.3.4 Past usage Artificial neural networks (ANNs) are increasingly being applied to diverse fields such as feature recognition, compression, design, forecasting, classification etc. This section presents a small subset of these applications. Le Cun et ai. used an ANN for the recognition of hand-written digits on a database of zipcode examples provided by the U.S. Postal Service [69]. The algorithm was fairly accurate (l % error rate) and required minimal preprocessing of the input data. The results indicated that the method was extensible to alphanumeric characters. Indications were that the method was also comparitively fast, with the ability to recognise over 10 digits per second. Fanghanel et al. used an ANN in the field of data compression to find the optimal parameters for Wavelet Transform Coding [37]. Specifically, the ANN learnt the optimal Daubechies filters for different one dimensional signals. Their findings indicated that the ANN resulted in better compressed outputs than the standard decompositions under noisy circumstances. Sellar et al. used an ANN approach to assist in the design of a "hovercraft" [90]. The method was required to select an appropriate combination of design variables, in order to come up with a feasible design. The design variables consisted of a number of discipline-specific local optimisations. An ANN based optimisation algorithm was employed in order to optimise the global combination of design variables. Sellar et ai. termed these combinations, response surface approximations. The method was shown to reduce the cost and time associated with designing a system, in comparison to the traditional methods of designing such a system. Yao et al. used a standard back-propagation ANN in order to forecast exchange rates between the Swiss Franc and the U.S. Dollar [104]. The method modelled different trading strategies and computed the average paper profits for adopting any given strategy. Depending on the strategy employed, they showed that average paper profits between 11.36% and 27.59% could be achieved over different time horizons. This section presents a brief overview of clustering and various clustering techniques. Section 2.4.1 introduces the goals of clustering, as well as the main categories into which all clustering algorithms fall. K-means clustering is discussed in section 2.4.2. Section 2.4.3 presents the learning vector quantiser. Self organising maps are discussed in section 2.4.4. Finally, section 2.4.5 presents the split-and-merge algorithm. Clustering is the process of finding groups of similar data points in a given dataset, i.e. finding regions in a dataset with a high data point density. Essentially, any clustering algorithm attempts to reduce the variance among data points in each of its constituent clusters. Clustering can thus be seen as a process of partitioning a set of data points, such that each of the subsets of those data points is homogeneous with respect to some characteristic. Each cluster represents knowledge about its constituent data points. Clustering methods can broadly be classified into two main categories [5][51][66]: • Partitional clustering aims to directly partition a given dataset into disjoint subsets (clusters) such that specific clustering criteria are optimised. Clustering is primarily used to reduce the amount of data to be presented to machine learning algorithms. Clustering is also important for exploratory data analysis [28][64], where little is known about a dataset or problem. Exploratory data analysis attempts to make a set of data points more accessible by using simple analysis methods to provide a preliminary overview of the content of those data points. K-means clustering is a fast, rough, partitional clustering algorithm [72]. K-means clustering treats each training pattern in a dataset as a real-valued input vector PI, I E {I,···, IPI is the size of the dataset. IPI}, where Nominal attributes need to be converted to binary-valued features as described in section 2.3.3. K-means clustering initialises k buckets (or clusters) Co,8 E {I, ... , k}, where each bucket is associated with a centroid vector represents the average of all of the input vectors in Co. woo Each centroid vector o W The algorithm starts by randomly removing k training patterns from the dataset and inserting one pattern into each bucket. The algorithm proceeds by placing each of the remain- ing training patterns in the dataset into the bucket whose centroid lies closest to that training pattern. The distance metric used by the algorithm can either be the Manhattan Distance or the Euclidean distance. After all training patterns have initially been clustered, the algorithm repeatedly iterates over all the training patterns in all buckets, moving each training pattern to the bucket with the closest centroid, until a stopping criterion is reached. The k-means clustering algorithm proceeds as follows: (a) Update the cluster centroid o to be the centroid of all the samples currently in Co W using k EQ = EE O=lPtEC/l Ilpt-wol12 5. Return to step 2 until EQ does not change significantly, cluster membership does not change or a maximum number of iterations is reached. Many k-means clustering variants exist in the literature. Alsabti et al. present an interesting k-means clustering algorithm that makes use of a tree structure in order to reduce the number of prototype comparisons during each training cycle [4]. Basically, the tree structure represents a nearest neighbour ordering of the training patterns. ISODATA is another k-means clustering variant suitable for image clustering [9]. 2.4.3 Learning vector quantisers A learning vector quantiser (LVQ) is a neural network based, unsupervised, partitional clustering algorithm [65]. An LVQ has two layers: an input layer and an output layer. The LVQ training process constructs clusters based on competition between output artificial neurons. Each output unit of the LVQ represents a cluster (not to be confused with a classification). During training, the output unit whose weight vector is closest to the current input pattern is declared the winner. The weights of the winning output unit and that of its neighbours are adjusted to better resemble the training pattern. The distance between an input pattern and a weight vector is measured using the Euclidean distance. In essence, the LVQ is very similar to the k-means clustering algorithm. Each output unit's weights can be viewed as a centroid vector, and the update equations cause those centroid vectors to move in order to cover or describe a number of patterns. 2.4.4 Self-organising maps Self-organising maps (SOMs) were motivated by the self-organisation characteristics of the human cerebral cortex, such as the visual cortex and the auditory cortex [66]. The SOM is a multi-dimensional scaling method to project an IPI-dimensional input space to a discrete output space of lower dimension. This discrete output space is usually a two-dimensional grid, but can be toroidal. The SOM uses the grid to approximate the probability density function of the input space, while still maintaining the topological structure of the input space. Therefore, if two input patterns are close to one another in the input space, then the patterns will also be close together in the SOM. SOM training is based on a competitive learning strategy. Each artificial neuron (AN) on the map is associated with an I-dimensional weight vector Wt 0t (shown by figure 2.6). Two types of clustering occur for SOMs: • The first is during training where input patterns are mapped to the closest AN; so, the AN serves as a centroid of a cluster of geometrically similar patterns. • Then, after training, centroids/neuron vectors are further clustered together to form groupings of similar ANs. Training starts by initialising each AN's weight vector. The weight vector can be initialised to random values, to a randomly selected input pattern or any other suitable strategy. Each training pattern is presented to the SOM and the AN with the shortest Euclidean distance to this pattern is adjusted to more accurately reflect the training pattern. Also, ANs in the neighbourhood of the winning AN are proportionately adjusted to reflect the training pattern. The learning process is iterative and continues until a good enough map has been found. The quantisation error is a measure of the goodness of the map and can be defined as the sum of Euclidean distances of all patterns to their corresponding winning ANs.The quantisation error is defined as follows: IPI EQ = L IlpI- 2 Wt11 1=1 where PI is the training pattern and Wt is the weight vector of the winning AN 0t. The main advantage of a SOM is the easy visualisation and interpretation of clusters formed by the map. Map visualisation is achieved by computing the unified distance matrix, which expresses the distances between the codebook vectors of adjacent neurons or by using some colour scale representation. Large distances represent cluster boundaries and small distances represent clusters. SOMs can also be used for classification by labelling each AN according to the most likely outcome of that AN, and using the label as a classification for input patterns. Other applications of SOMs include prediction and interpolation [28]. 2.4.5 Split-and-merge 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 3 1 4 9 9 8 1 0 3 1 4 9 9 8 1 0 1 1 8 8 8 4 1 0 1 1 8 8 8 4 1 0 1 1 6 6 6 3 1 0 1 1 6 6 6 3 1 0 1 1 5 6 6 3 1 0 1 1 5 6 6 3 1 0 1 1 1 1 5 6 6 2 1 0 1 1 5 6 6 2 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 3 1 4 9 split 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 0 8 1 0 8 8 8 4 6 6 3 1 0 1 0 1 6 2 1 1 5 6 6 3 1 0 1 1 5 6 6 2 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 3 1 4 9 9 8 1 0 3 1 4 9 9 8 1 0 1 1 8 8 8 4 1 0 1 1 8 8 8 4 1 0 1 1 6 6 6 3 1 0 1 1 6 6 6 3 1 0 1 1 5 6 6 3 1 0 1 1 5 6 6 3 1 0 1 1 5 6 6 2 1 0 1 1 5 6 6 2 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 9 1 1 split 2 1 1 1 - 1 1 1 / 1 1 1 merge" Split-and-merge is a hierarchical clustering algorithm that attempts to discover regions in a dataset, within which some property does not change abruptly [59]. Split-and-merge is mostly used for finding regions of homogeneous intensity in images and is important for robot vision. An image region is a set of connected pixels such that: • The difference in intensity values of pixels in a region should be no more than some error E. • A polynomial surface of degree n can be fitted to the intensity values of pixels in the region with largest error less than E 2. For no two adjacent regions it is the case that the union of all the pixels in these two regions satisfies the homogeneous property. Split-and-merge starts with one image region. If the region does not satisfy property I of the image region definition, then the image is partitioned into four image regions. If each of those image regions do not satisfy the homogeneous property, then each of the image regions is partitioned into four image regions. This process iterates until each image region is homogeneous. Once all splitting has been performed, the algorithm merges image regions until property 2 of the image region definition is satisfied. Figure 2.7 illustrates the split-and-merge algorithm with E :=:; 1. This chapter provided a taxonomy of the underlying methods employed by many data mining applications in use today. The paradigms of knowledge discovery, evolutionary computing, artificial neural networks and clustering were broadly discussed. This chapter provided the fundamental grounding for the algorithms and methods presented in the rest of this thesis. The next chapter discusses the GASOPE method as a means for obtaining structurally optimal polynomial expressions, as a means of performing function approximation. Chapter 3 THE GASOPE METHOD The previous chapter presented a number of machine learning paradigms used in this chapter, and throughout the rest of this thesis. This chapter presents a function approximation method that uses a genetic algorithm to evolve structurally optimal polynomial expressions (GASOPE). This genetic algorithm is one of the core aspects of the algorithm discussed in the next chapter. The study of function approximation can be broken up into two classes of problems. One class deals with a function being explicitly stated, where the objective is to find a computationally simpler type of function, such as a polynomial, that can be used to approximate a given function. The other class deals with finding the best function to represent a given set of data points (or patterns). The latter class of function approximation plays a very important role in the prediction of continuous-valued outcomes, e.g. subscriber growth forecasts, time-series modelling, etc. This chapter concentrates on methods to construct functions that accurately represent a series of data points, at minimal processing cost. Traditional methods to perform this type of function approximation include the frequently used discrete least-squares method, regression, Taylor polynomials and Lagrange polynomials. Other methods include neural networks and some evolutionary algorithm paradigms. This chapter develops a genetic algorithm approach to evolve structurally optimal polynomial expressions (GAS OPE) in order to describe a given data set. A fast clustering algorithm is used to reduce the pattern space and thereby reduce the training time of the algorithm. Highly specialised mutation and crossover operators are used to directly optimise the polynomial expressions, and to exploit similarities between the various polynomial expressions in the search space. The remainder of this chapter is organised as follows: Section 3.2 presents an overview of various function approximation techniques. The implementation of the genetic algorithm polynomial approximator is presented in detail in section 3.3. The section presents the clustering algorithm used to cluster the training data, and introduces the representation, specialised mutation and crossover operators, and the hall-of-fame, all of which ensure the structural optimality of the evolved polynomials. The experimental procedure, data sets and results are discussed in section 3.4. Finally, section 3.5 presents the summarised findings and envisioned future developments to the method. As was mentioned earlier, function approximation can be broken up into two classes of problems: One class dealing with the simplification of a defined function in order to determine approximate values for that defined function, and the other class dealing with finding a function that best describes a set of data points. This section discusses, in detail, traditional methods to perform both classes of function approximation, such as the discrete least squares approximation, regression, Taylor polynomials and Lagrange Polynomials. This section also discusses other approaches, such as neural networks and evolutionary computing. 3.2.1 Discrete least squares approximation The following is a brief adaptation of Fraleigh and Beauregard [40], and Burden and Faires [16]. The method of least squares involves determining the best linear approximation to an arbitrary set of m = IPI data points {(aI, bI), ... , (am, bm)}, by minimising the least squares m Ess = E[bi i=I - biJ2 where n + 1 represents the maximum number of terms. The coefficients ro through rn of the polynomial function mentioned above, can be determined by solving the linear system: where rT = [ro rl ... rn]' which means obtaining the least-squares solution by solving the overdetermined linear system: A= and vector bT = [bo bi ... 1 al a2I an 1 a2 a2 an 1 am a2 m an m 2 I 2 (3.4) bn]. Obviously, the above method requires a decision as to what function to use to calculate the least-squares fit. Many types of functions can be fitted, e.g. polynomial, exponential, logarithmic, etc. In the simple polynomial case, however, at least a decision needs to be taken as to the value of n. Because a least-squares fit is empirically obtained from a set of data points, the interpolation characteristics of such a fit are reasonably good. However, for the same reason, a least-squares fit has poor extrapolation properties, particularly when an extrapolated point lies far away from the data set. The least squares method is relatively fast, because it only requires the reduction of one linear system in order to obtain the best approximation of the problem space for a given architecture (approximating function). However, outliers can exercise considerable influence on the rj coefficients and can lead to poor generalisation. Regression can be described as the process of discovering the most plausible and most easily understandable relationship between a set of independent variables and a dependant variable [96]. A dependent variable and an independent variable are correlated when there is a relationship between them, i.e. an increase in the independent variable results in a decrease in the dependent variable or an increase in the independent variable results in an increase in the dependent variable. However, it is important to note that correlation between variables does not imply causality, i.e. if there is a relationship between an independent and a dependent variable, it is not necessarily the case that changes in the dependent variable are directly caused by the independent variable in the real-world. The coefficient of determination R2 provides an indication of how well a discrete least squares approximation (model) of section 3.2.1 fits the observed data. The coefficient of determination R2 is defined as: R2 = 1 _ E~l (bi - bJ)2 E~l (bi - b)2 where m is the size of the training set, bi is the target output of pattern i, b is the mean of the target values and bi is the predicted output of pattern i. The coefficient of determination has the range [0,1], where values closer to 1 indicate a strong correlation between the data set and the model and values closer to ° indicate no relation between the data set and the model. The correlation coefficient is defined as R = Vi{i for simple linear regression, where there is only one independent variable. The adjusted coefficient of determination R~ is used as an indication of how well a discrete least squares approximation fits the observed data, factoring in the complexity of the discrete least squares approximation. The adjusted coefficient of determination R~ is defined as: where k is the number of coefficients (free variables) of the least squares approximation and all other variables are defined as before. Essentially, the adjusted coefficient of determination R~ penalises fits that have larger numbers of free variables, i.e. two models may have the same or similar accuracy, but the model with the smaller number of free variables will be preferred. Thus, the adjusted coeffi- cient of determination attempts to maximise the correlation between a model and the data set, while minimising the architecture of the model. The following is a brief adaptation of Haggarty [50]. Without loss of generality, assume x is a scalar. If f is an n-times differentiable function at a point a, then the Taylor polynomial of degree n for f at a is defined by: / (a) f(n) Tn,af(x) =f(a)+-l!-(x-a)+ ... + (a) n! (x-a)n The importance of Taylor polynomials is that they only involve simple addition and multiplication. Moreover, given x to any specified degree of accuracy, it is straightforward to evaluate such polynomial expressions to a comparable degree of accuracy. Taylor's theorem provides an important result: Let f be (n + 1)-times continuously differentiable on an open interval containing the points a and b. Then the difference between f and Tn,af at b is given by: f(b) - Tn,af(b) = (b )(n+l) (:: I)! f(n+l) (c) for some c between a and b. The error in approximating f(x) by the polynomial Tn,af(x) is the term to the right of the equality in the above. The error at a point between the Taylor polynomial Tn,af(x) and any function f can be determined to any degree of accuracy. This, in turn, means that Taylor polynomials can be used to approximate any n times differential, continuous function at any specific point on that curve. Taylor polynomials are not appropriate for interpolation [16]. The nth Lagrange interpolating polynomial is an alternative to Taylor polynomials that allows the approximation of a defined function over an interval [16]. The definition of an as follows: If XQ,Xl,'" ,Xn nth Lagrange interpolating polynomial is are n + 1 distinct numbers and f is a function whose values are given at these numbers, then there exists a unique polynomial P(x) of degree at most n with n P(X) = f(xo)Ln,o(x) + ... + f(xn)Ln,n(x) = E f(Xk)Ln,k(X) k=O L n, k = (X-xo) (X-Xd"'(X-Xk_l)(X-Xk±d"'(X-Xn) (Xk-XO) (xk-xd"'(Xk-Xk-d(Xk-Xk±d"'(Xk-Xn) n (X-Xi) i=O,i# (Xk-Xi) (3.8) n for each k = 0, I, ... , n. In order to fit an n-degree Lagrange polynomial, the method should be provided with n + I control points that are uniformly distributed throughout the interval of the defined function. It can be proved that the error between the Lagrange polynomial and the defined function is bounded over an interval. Suppose XO,XI,'" andf E Cn+I[a,b]. Then, for each X in [a,b], a number E(x) in (a,b) exists with f(x) 3.2.5 ,Xn are distinct numbers in the interval [a, b] = P(x) + f(n+I) (E(x)) ( ) (x-xO)(X-XI)"'(X-xn) n+ I ! Selecting the correct approximating polynomial order Any function with n turning points can be reasonably described by an order (n polynomial. Assume a continuous function derivative of f f with n turning points (PI,"', + I)-degree Pn), then the must necessarily be 0 at those turning points. A polynomial expression with f(x) = 0 at all points (PI, ... , Pn) has the factorised form Integration of the simplified form yields f(x) = rn ~+I n + rn-I n-I ~ + ...+ ro xl + C I which is of degree n + 1. Thus, if a polynomial is used to approximate a defined function or dataset, the degree of the polynomial approximation should always be selected as one more than the number of turning points of the original function or dataset. Higher order approximations can lead to over-fitting, because too many free variables (coefficients and terms) are available to any given polynomial approximation technique. Artificial neural networks with at least one hidden layer have been proved to be universal approximators [57], [58]. This means that a neural network can approximate any nonlinear mapping to a desired degree of accuracy, provided that enough hidden units are provided in the hidden layer. With reference to the previous section, Basson and Engelbrecht showed that for any function, the optimal number of hidden units that should be provided in the hidden layer should be one more than the number of turning points of the function [10]. However, this result assumes prior knowledge about the input space. With reference to section 2.3, neural networks suffer from a number of problems when performing function approximation: • The training of neural networks is computationally time consuming, especially when a large number of data points, are used and for large architectures. • Finding the optimal architecture is crucial to ensure optimal interpolation (generalisation) performance. Architecture selection further adds to the complexity of training. • Depending on the training algorithm used, neural networks are susceptible to local minima. • While neural networks do have extrapolation capabilities, this deteriorates the further extrapolation points lie from the training set. 3.2.7 Evolutionary computing A number of evolutionary computing approaches have been applied to function optimisation. Wilson describes a genetic algorithm that performs a piecewise-linear approximation to any arbitrary function [l02]. His findings yielded arbitrary close approximations that were effi- ciently distributed over a function's domain. Angeline, discusses a model that uses genetic programming to select a system of equations that are optimised in a neural network like fashion, in order to predict chaotic time-series [6]. His goal, specifically, was to evolve task specific activation functions for neural network-like systems of equations, i.e. activation functions that were not sigmoidal in nature. Nikolaev and Iba discuss a genetic program that uses Chebishev Polynomials as building blocks for a tree-structured polynomial expression [76]. Their findings indicate that the treestructured polynomial representation produced superior results on several benchmark and realworld time-series prediction problems, both in terms of training and generalisation accuracy. With reference to section 2.2, evolutionary computing approaches have a number of factors that need to be considered before application: • The training of evolutionary computing paradigms is computationally time consuming, especially when a large number of data points are used and for a large number of individuals. • A suitable representation scheme must be chosen in order to adequately describe the problem space. • A suitable fitness function must be selected in order to adequately describe the problem space. • Depending on the problem, evolutionary computing paradigms could be sensitive to initial conditions and training parameters. • Using a suitable chromosome representation scheme, the interpolation ability of an evolutionary computing algorithm can be engineered to be as good as any numerical analysis technique, particularly if the evolutionary computing algorithm utilises numerical methods. However, the evolutionary computing algorithm's extrapolation abilities will only be as good as the underlying numerical methods. This section discusses the implementation specifics of a genetic algorithm that evolves structurally optimal polynomial expressions (GASOPE) and heavily borrows from the ideas presented in section 3.2. Essentially, the algorithm is a three stage process that consists of: One of the primary problems with all machine learning paradigms, is the need to iterate over each training pattern in order to calculate an error metric (in the case of a neural network) or to calculate the fitness (in the case of an evolutionary algorithm) of an individual. This is especially a problem with very large data sets. Special strategies have to be utili sed, such as active learning for artificial neural networks 2.3.3, to solve the large data set problem. Clustering has been used in order to try to break the aforementioned restriction. The idea is to perform a fast, rough clustering of the training data, and then to draw a stratified random sample from the clusters, to be used in a manner similar to that of Engelbrecht and Brits [32]. A stratified random sample [96], of size s, is drawn from k clusters of training patterns, where each cluster represents a stratum of homogeneous training patterns (according to some characteristic inherent in the data), instead of using all of the available training patterns. The stratified random sample is drawn proportionally from each of the k clusters, i.e.: I I Co c Co : ICol = ICois lPf and ICol is the (stratum) size of cluster Co and IPI is the size of the data set. Obviously, a proportional sample would be meaningless unless each strata was homogeneous with respect to some characteristic. Also, if a cluster consists of just outliers, a proportional sample will prevent these outliers from skewing the data distribution. The GAS OPE method uses the k-means clustering algorithm (of section 2.4.2) to obtain the homogeneous strata mentioned above. The GASOPE method uses a simple heuristic in order to increase the performance of the k-means clustering algorithm, which requires the inclusion of a standard deviation measure for each cluster centroid. The algorithm proceeds as follows: 1. Initialise k centroids (WI,"', wI) = PI), al) = Wk) such that each centroid is initialised to one input vector 0 E {I,· .. , k} and initialise k centroid deviation vectors (aI,' .. , ak) such that 0, 0 E {I,· .. , k}, where each cluster CI) is associated with the centroid wI) and the centroid deviation vector al). (a) If (Pi E Coo, (0 E {I, ... , k} ) /\ ((Pi < Woo- aoo) V (Pi> Woo + aoo)), then find the nearest centroid wl)*, i.e. if (a) Update the cluster centroid wI) to be the centroid of all the samples currently in CI) so that Lp/ECl)Pi WI)= ICI)I (b) Update the cluster centroid deviation vector al) to be the standard deviation of all the samples currently in CI) so that k EQ=[' [,llpt-WoI12 O=lp/ECIl 5. Return to step 2 until EQ does not change significantly or cluster membership does not change. Essentially, the centroid o and the centroid standard deviation (jo in the above algorithm W allow the k-means clustering algorithm to fit k hyper-cubes over an n-dimensional pattern space. Any pattern not within the bounds of the hyper-cube of the cluster of which the pattern is a member becomes eligible for selection in step (2a) of the algorithm. This pattern selection strategy drastically reduces the number of comparisons that need to be made for each training iteration of the algorithm, resulting in improved performance over the normal direct k-means clustering algorithm. Figure 3.1 demonstrates a typical output of the above k-means clustering algorithm for 15 cluster centroids. What is interesting to note, is that larger clusters form around the turning points of the function. These larger clusters demonstrate that the information content of the function is greatest near the turning points of a function, where the derivative of the function is changing more rapidly. This, in turn, indicates that the efforts of the function approximation technique should be concentrated on the regions near the turning points of the function. Proportional sampling will select more patterns from larger clusters, and will therefore concentrate the efforts of the function approximation technique on the regions near the turning points. The idea is similar to an incrementalleaming approach used by Engelbrecht for artificial neural networks [35] With regards to the bias-variance dilemma [44], clustering minimises the variance component of the GAS OPE method. The genetic algorithm of the next section minimises the bias component of the GASOPE method. The following section discusses the core algorithms employed by the genetic algorithm component of the GAS OPE method. The introduction introduces the reader to the complexity of the technique used by the genetic algorithm. The representation of each individual in a population is then discussed. The initialisation values of each individual is presented. The mutation and crossover operators employed by the genetic algorithm are discussed. The fitness function employed by the genetic algorithm is discussed. Finally, the algorithm to guide the genetic algorithm optimisation process is presented. In sections 3.2.1 and 3.2.3 Taylor polynomials and discrete least squares approximation were discussed in terms of their relevance to function approximation. The definition of the linear function presented in equation (3.1) is extended to the non-linear form from: n hi = L rja! j=o where m is the dimensionality of the input space, n is the maximum polynomial order, Aq is the order of attribute ai,q and r(A] ,AZ,"',Am) is a real-valued coefficient. This definition allows the representation of functions such as: + + for m = 2 and n = 2. r(1,O)ai,l 2 r(2,O)ai,1 + + r(O,1)ai,2 2 r(O,2)ai,2 + If the value of the coefficients r(A] ,Az,"',Am) are efficiently determined using the least squares approximation (from equation (3.3)), then all the genetic algorithm is required to do is to algorithmically determine the optimal approximating polynomial structure. The definition of optimality used throughout this thesis is bimodal: both the smallest polynomial structure and the best possible function approximation are required. A simplistic approach would be to generate every possible combination of a function, and test the predicted result of the function against the data set. However, such an exhaustive search approach is prohibitive. This section does, however, continue to derive an upper bound on the genetic algorithm search space, should an exhaustive search be undertaken. First, the total number of unique terms t generated by equation (3.10) is determined. Select, with repetition, p inputs from a set of m inputs. This problem is similar to determining the number of n-multi-sets of size p, where p E {O"", n}. There are t=t(p+m-1)P p=O such multi-sets (terms) [36]. By applying induction and Pascal's formula, equation (3.11) is simplified to The number of function choices, u, is calculated by choosing, without repetition, q terms from the set of t terms: From equations (3.12) and (3.14), the number of terms and function choices can be determined. A lO-dimensional input space with a maximum polynomial order of 3, has 286 possible terms and 2286 function choices. To iteratively calculate and test each function choice against a set of S training patterns is thus computation ally difficult. Genetic algorithms, however, can be used to determine solutions to such difficult problems, because genetic algorithms implement a highly parallel search. The representation used by the algorithm is fairly simple and is, in fact, a representation of equation (3.10). Each individual is made up of a set 1mof unique, term-coefficient mappings, e.g. where p is the maximum set size (maximum number of terms) and r~, for ~ E {O"", p - I}, is a real-valued coefficient (re-mapped from equation (3.10) for the sake of simplicity). Each term t~ is made up of a set T~of unique, variable-order mappings, e.g. where m is the number of inputs (variables), a~,t, 't E {l,···, m} is an integer representing an input value and A.~,t is a natural-valued order. In practise, these two sets are maintained as a two-dimensional variable-length, sorted array, which allows only unique insertion. Each individual 1min a population GGA is initialised by randomly selecting variable-order pairs, in order to build a term T~ up to a maximum polynomial order e E {O,,,,, n}. This process is repeated until the number of terms equals the maximum number of terms p. The initialisation of an individual is fully described by the following pseudo-code algorithm: (a) Set T~= {} (b) Select e E {O,... , n} uniformly from the maximum polynomial order n. (c) While e > 0 do 1. 11. 111. Select I ~ f ~ e uniformly from the available orders e. Select I ~ g ~ m uniformly from the set of m inputs. If IT~I < IT~U {(g -+ f)}1 then e:= e - f, i.e. decrease the number of available orders e. IV. Set T~= T~U { (g -+ f) } (d) Set lro = lro U {(T~ -+ O)} as shown in figure 3.2 ~ ~ ~ r, ~ - r4 -- -- a2,1 A2,1 a42 A4.2 r------ a2,3 r------ a4,3 A43 T~ AZ3~ -- a4,4 A44 The mutation operators serve to inject new genetic material into a population of individuals, thus the mutation operator broadens the search space. Four mutation operators are used by the genetic algorithm, namely shrink, expand, perturb and reinitialise: • Shrink operator: The shrink operator is fairly simple to implement and has the objective to remove, arbitrarily, one of the term-coefficient pairs T~ from the individual lro, as indicated by the crossed out section in figure 3.3. The pseudo-code for the shrink operator is as follows: 1. Select T~E 1m uniformly from the set of terms 1m. 2. Set 1m= Im/ {T~} as shown in figure 3.3. • Expand operator: The expand operator adds a new random term-coefficient pair to the individual 1m(only if 1m< p). The pseudo-code for the expand operator is as follows: 1. If 11mI < p then (a) Set T~= {} (b) Select e E {O,... , n} uniformly from the maximum polynomial order n. (c) While e > 0 do i. Select 11. f E {I,· .. , e} uniformly from the available orders e. Select g E {I,· .. , m} uniformly from the set of m inputs. iii. If IT~I< IT~U {(g -+ fnl available orders e. iv. Set T~= T~U {(g -+ fn then e := e - f, i.e. decrease the number of (d) Set 10) = IO)U {(T; --+ O)} as shown in figure 3.4. ~ r4 a 4,2 A4,2 ~ • Perturb operator: The perturb operator is fairly complicated and requires the algorithm to select a term T; from the individual 10), and to adjust one of the variable-order mappings. This adjustment can either add, remove or adjust an order in a variable-order mapping and is applied uniformly (with equal probability) over all three actions. The pseudo-code for the perturb operator is as follows: 1. Select T; E 10) uniformly from the set of terms 10). 2. Calculate the number of orders available IT~I e := p - E Alro,v. v=l 3. Select g E {I, ... ,m} uniformly from the set of m inputs. 4. Select h E U(O, 1) as a uniformly distributed random number. 5. If h < 0.333 then (a) Set T!; = T!;I {(g -+ Ag)}, i.e. remove the lh variable-order mapping from set T!; as shown by the crossed out section in figure 3.5. 6. Else if h < 0.666 (a) Select f E then {I,··., e} uniformly from the available orders e. (b) Set T!; = T!;U { (g -+ f)} as shown by the large box in figure3.5. 7. Else (a) Set T!; = T!;I {(g -+ Agn, i.e. remove the gth variable-order mapping from set T!;. (b) Set e := e + A, i.e. increase the number of available orders e. (c) Select f E {I,··., e} uniformly from the available orders e as shown by the small box in figure 3.5. (d) Set T!; = T!;U {(g -+ fn ~ r2 t----T~ ~ r4 The crossover operator serves to retain genetic material from one generation of individuals to the next, thus the crossover operator directs the search space toward a particular solution. The crossover operator used by the genetic algorithm selects a subset of two individuals in order to construct a new chromosome. Term-coefficient mappings are selected to construct the new chromosome at random, with a higher probability of selection given to term-coefficient mappings that are prevalent in both individuals. A ratio of 80:20 was used (shown below), because, on average, the new individual generated by these parameters was found to be roughly the same length as its longer parent. The pseudo algorithm for the crossover operator is as follows: 2. Let 113E G be any term-coefficient set in the population of individuals GGA as shown in figure 3.6. 3. Let Iy E G be any term-coefficient set in the population of individuals GGA as shown in figure 3.6. 5. Let B = (/13/ly) U (/y/ll3) be the union of the exclusions as shown in figure 3.6. (b) If h < 0.8 then la (c) Set e := e + 1 = la U {Ae} (b) If h < 0.2 then In = In U {Be} (c) Set e := e + 1 [I}--111 ~2 ~3 ~4 a4,2 1.4,2 ~ ~6 I~ The fitness function is an important aspect of a genetic algorithm, in that it serves to direct the algorithm toward optimal solutions. The fitness function used by the genetic algorithm is similar to the adjusted coefficient of determination (from equation (3.6». The fitness function is defined as: 2 Ra I;i-l(bi-bi = 1- L~ 1=1 J2 s-l (b.-i)2 . s-k 1 (3.15) 1 where s is the sample size, bi is the actual output of pattern i, b~ro,iis the predicted output of individual 1m for pattern i, and the model complexity d is calculated as follows: where /0) is an individual in the set P of individuals, T~ is a term of /0) and A~,'t is the order of term T~. This fitness function penalises the complexity of an individual /0) by penalising the number of multiplications needed to calculate the predicted output of that individual,i.e. the number of terms and their order. In order to calculate the fitness of an individual, however, the algorithm requires the coefficients in the set /0) to be calculated. Matrix A (from equation (3.4)) is populated with the combination of terms represented by each term-coefficient mapping, e.g. if the term is aoar, the algorithm multiplies out each of the input attributes for a particular pattern. Matrix A is thus populated from left to right with terms from each pattern in the sample space (where the patterns proceed from top to bottom). The vector b is made up of the target output for each pattern. After reducing the linear system shown by equation (3.3), vector r represents the coefficients of each of the term-coefficient mappings. (a) Sample S C U~=lCo where Co is a cluster (stratum) of patterns (b) Determine the coefficients of each of the term-coefficient mappings in an individual /0) by reducing b ~ Ar in terms of the sample S (c) Evaluate the fitness R~(IO)) of each individual in population GGA,g using the patterns in S (d) Let G~A,g C GGA,g be the top x% of the individuals to be involved in elitism (e) Install the members of G~A,g into GGA,g+l (f) Let G~A,g C GGA,g be the top n% of the individuals to be involved in crossover /I i. Randomly select two individuals In and l~ from GGA,g 11. Produce offspring ly from In and 1[3 iii. Install ly into G~A,g (h) Perform mutation: 1. Select an individual 1m from d~A,g ii. Mutate 1m 111. Install 1m into GGA,g+ 1 (i) Evolve the next generation g := g +1 It is important to note that each generation in the above algorithm draws a new stratified random sample from the training set. The entire process, thus, is not based on just one sample of the training set. This section discusses the need for a hall-of-fame. The hall-of-fame works like the hall-of- fame in classic arcade games, where the player that achieved a better score than any of the players in the hall-of-fame takes hislher rightful place (by entering hislher initials) and knocks the worst score off the list. For GASOPE, the hall-of-fame is essentially a set of unique, individual solutions, ranked according to their fitness value. After every generation the best individual is given the opportunity to enter the hall-of-fame. Entry into the hall-of-fame is determined as follows: • If the best individual of a generation is structurally equivalent to an individual in the hallof-fame, the fitness values of the individual in the hall-of-fame and the best individual are compared. The individual with the best fitness then replaces the individual in the hall-of-fame. • Otherwise, if the best individual of a generation is not structurally equivalent to any individual in the hall-of-fame, the best individual is inserted relative to its fitness (or possibly not at all if its fitness is worse than any of the individuals in the hall-of-fame). The hall-of-fame is not an elitism method; the individuals in the hall-of-fame do not further participate in the evolutionary process. The hall-of-fame simply keeps track of the best solutions for any given architecture. In the end the solution taken is not necessarily the best solution of the last generation, but the best over all generations. Ultimately, the purpose of the hall-offame is to ensure that the best, general solution is selected as the solution to the optimisation process. Because the genetic algorithm of section 3.3.2 works with only a sample of the available patterns, certain function fits may not represent the true nature of the data set, particularly when such a data set is extremely noisy, because of the new sample used at each generation. For example, with a particularly poor sample selection from a noisy data set, the genetic algorithm may decide that a straight line is the optimal fit for the data set, when, in fact, a cubic function would have performed better on the whole. Following that, the genetic algorithm may decide that such a fit was the best fit seen so far in the optimisation process and will decide to retain that solution, ultimately leading to sub-optimal convergence. The hall-of-fame prevents this scenario from happening, because all best solutions compete for a place in the hall-of-fame. At the end of the optimisation process, the solutions in the hall-of-fame are tested against a validation set (a subset of the patterns withheld from training), to determine the ultimate solution. This section discusses the experimental procedure and results of the GASOPE method VS. an artificial neural network, applied to various data sets generated from a range of generating functions. Section 3.4.1 presents the generating functions of the various data sets. The experimental procedure and program initialisation are explained in section 3.4.2. Section 3.4.3 presents the experimental results for the functions listed in section 3.4.1 and discusses the findings. Table 3.1 presents a range of functions, fl to f5, used to test the GASOPE algorithm. Some of these functions are illustrated by figures 3.7 to 3.9. These functions are all continuous over Name Function fl f(xo) = sin(xo) + U( -1,1 );xo E [0,21t] f2 f(xo) = sin(xo) + cos(xo) + U( -1,1); XoE [0,21t] f3 f(xo) =x6-5x6+4xo+U(-1,1);xo f4 f(XO,Xl) =sin(xo)+sin(xl)+U(-l,l);{xo,xd f5 f(XO,Xl) =x6-5x6+4xo+x1-5xi+4xl E [-2,2] E [0,21t] +U(-l,l);xO,Xl E [-2,2] their domains and have been injected with noise to illustrate the characteristics of the GAS OPE method on noisy data sets. Additionally, the method has been tested on a number of interesting chaotic time-series problems: • Logistic map: The first, and most basic, chaotic time-series problem is the Logistic map, whose generating function can be described in the following manner: dx -=a'x(l-x) dt n n • Henon map: The next chaotic time-series problem is the Henon map, which can be described in the following manner: dx 2 dt = l-a·x:+b·y n - where a = 1.4, b = 0.3,x(0) n dy dx dt dt and y(O) - U( -1,1). - U( -1,1) The henon map is illustrated by figure 3.10. • Rossler attractor: The next chaotic time-series problem is the Rossler attractor, whose generating function is: dx dt = -Yn -Zn dy dt =Xn -a'Yn dz dt = b+zn(xn - e) where a = 0.2,b = 0.2,e = 5.7,xo = 1.0,yo = 0 and Zo = O. This function should be generated using the Runga- Kutta order 4 method [16]. The Rossler attractor is illustrated in figures 3.11 to 3.14. • Lorenz attractor: The last chaotic time-series problem is the Lorenz attractor, whose generating function is: dx dt = cr(Yn - xn) dy dt = r·xn -Yn -XnZn dz dt =XnYn -b·zn where cr = 1O,r = 28,b = 8/3,xo = 1.0,yo =0 and Zo = O. Once again, this function should be generated using the Runga- Kutta order 4 method [16]. The Lorenz attractor is illustrated in figures 3.15 to 3.18. 3.4.2 Experimental procedure Each of the generating functions listed in section 3.4.1 were used to create a corresponding data set consisting of 12000 patterns. Each pattern consisted of the inputs and target outputs for the specific generating function, e.g. for the Rossler attractor, each pattern consisted of the three input components and one target output component. Each data set was scaled to create another data set, which consisted of scaled input components (to the range [-1,1]), to be used by a neural network for function approximation. The neural network implementation used for comparison with the GAS OPE method was trained until the maximum number of epochs were exceeded, using stochastic gradient descent. Only the hidden layer of the neural network used sigmoidal activation functions; the other layers used linear activation functions. The use of linear activation functions in specifically the output layers negates the need for the scaling of the neural network outputs. The neural network was initialised with an initial learning rate of 0.15 (a linearly decreasing learning rate was used), a momentum of 0.9 and a maximum number of epochs of 500. No checks were used to determine whether the neural network over-fitted the data. Variable Value Clusters 15 ClusterEpochs 10 FunctionMutationRate 0.1 FunctionCrossoverRate 0.2 FunctionGenerations 100 Functionlndividuals 30 FunctionPercentageSampleSize 0.01 FunctionMaximumComponents 20 FunctionElite 0.1 FunctionCutOff 0.001 The GAS OPE method was initialised using the values shown in table 3.2. The number of clusters ("Clusters") parameter sets the number of clusters to be used in the k-means optimisation process. A larger value results in increased training time due to the increase in comparisons between patterns and cluster centroid vectors. A smaller value may have implications for accuracy in that, strictly speaking, the number of cluster centroids should be at least as large as the number turning points of the underlying function that describes the data points. The number of clusters value was chosen as 15 because the value is larger than the number of turning points used by any of the above defined functions. The number of cluster epochs ("ClusterEpochs") determines the number of times the training set is presented to the k-means clustering algorithm. A small number of cluster epochs results in rough cluster membership, whereas a large number of cluster epochs result in crisp cluster membership. However, the larger the number of cluster epochs, the longer the genetic algorithm takes to optimise. A value of 10 was chosen because the GASOPE method requires speed more than precision. The linear system (ATA)r = ATb (refer to equation (3.3)) is consistent. This means that whether the linear system b ~ Ar is overdetermined or underdetermined, the first linear system will still be reducible. However, the first linear system could potentially return a poor least squares approximation if Gauss-Jordan reduction is used to solve the linear system. GaussJordan reduction can introduce discontinuities into the reduction of the linear system when the data points used to populate that linear system lie too close to one another or a pivot in the system lies close to O. These discontinuities typically occur when a value in the matrix is less than the minimum granularity of the floating point representation of the target platform. An alternative method of matrix reduction that does not introduce discontinuities is singularvalue decomposition [80]. Singular-value decomposition can always be performed, no matter how singular (non-invertible) a matrix is. Although both Gauss-Jordan reduction and singularvalue decomposition have a maximum complexity of O(n3), singular-value decomposition requires the execution of more depth-3 nested loops than Gauss-Jordan reduction and is therefore slower than Gauss-Jordan reduction. An alternative way to solve the poor least squares approximation problem, as outlined above, is based on data point clustering. The goal of clustering (from section 2.4) is to find homogeneous strata in a set of data points. A pattern in one cluster is dissimilar to all patterns in all other clusters. If we sample the training set by selecting at least one pattern from each cluster, the patterns should be dissimilar enough to return a good least squares approximation. The sample size ("FunctionPercentageSampleSize") parameter should thus be chosen large enough to include at least one pattern from each cluster. A large sample size results in a slower optimisation process because it directly increases the number of patterns to be presented to the GAS OPE optimisation algorithm. The accuracy of the GASOPE method depends both on the sample size and the number of clusters. An increase in the number of clusters leads to patterns being selected more uniformly throughout the domain of the search space. An increase in the sample size results in more patterns being selected from the turning points of the search space. The implication is that both the sample size and the number of clusters are equally important factors in ensuring the accuracy of the generated solutions. The function cutoff ("Function CutOff") parameter controls a heuristic that prevents terms from appearing in a polynomial expression when that term's coefficients tend to O. This cutoff results in a reduced set of terms and is thus used to prune each individual solution. The cutoff is applied after the calculation of each term's coefficient by checking whether the coefficient lies between the bounds of [-FunctionCutOff,FunctionCutOff]. If this is the case, the term is removed from the polynomial expression. The other genetic algorithm parameters are fairly self explanatory and suffer from the problems discussed in section 2.2.4. Both the neural network and the GASOPE method made use of three distinct sets of patterns. For each problem the 12000 data patterns were split up into a training set of 10000 patterns, a validation set of 1000 patterns and a generalisation set of 1000 patterns. The purpose of the training set is to train the two methods; the fitness function of the genetic algorithm and the forward- and back-propagation phase of the neural network use the training set as the driving force of their algorithms. The validation set is used to validate the interpolation ability of the genetic algorithm; the genetic algorithm uses the validation set to select the best individual from the hall-of-fame. The generalisation set is used to compare the two algorithms on unseen data patterns i.e., their generalisation ability. The results for each of the functions listed in section 3.4.1 were obtained by running 100 simulations of the corresponding data sets. Note though, that before each simulation was run, the data patterns were shuffled randomly among the three sets (training, validation and generalisation) and presented in this form to both the neural network and the GASOPE method. Results reported are averages over the 100 simulations together with standard deviations. Method Simulations Epochs/ Individuals Generations Training Pattern patterns presentations GAS OPE 100 100 30 0.01 x 10000 30000000 NN 100 500 1 10000 500000000 One criticism of comparing a neural network with the GAS OPE method revolves around the complexity of each method. Using a training set of 10000 patterns, table 3.3 shows the calculated number of pattern presentations per method. Because the GASOPE method only samples 100 training patterns per generation (FunctionPercentageSampleSize x SampleSize = 0.01 x 10000 = 100), the number of pattern presentations that have to be performed are considerably reduced compared to that of the neural network. If the number of training patterns had not been included in this complexity measure, many would have argued that the experimental results are unfairly weighted against the neural network. Bearing in mind that the neural network sees more of the training data during optimisation, this is not the case. Also, it is expected that the NN optimisation process will take on average 16.667 times longer than the GASOPE method to complete. The breakdown of the total execution time of the GASOPE method for an average simulation run is as follows: This section discusses the experimental results of both the neural network mentioned in section 3.4.2 and the genetic algorithm presented in this chapter for the chosen function approximation tasks. The section is divided into three categories namely, noiseless data sets, noisy data sets and polynomial structure. Table 3.4 summarises the results of the noiseless application of the Henon map, Logistic map, Rossler attractor and the Lorenz attractor, respectively. All experiments utili sing these functions used the GASOPE method with a maximum polynomial order of 3 and a neural network (NN) with a hidden layer size of 3, i.e. there were 3 hidden units. As mentioned in section 3.4.2, all other initialisation values were set according to table 3.2 and the initialisation criteria specified in section 3.4.2. For each of the experiments shown in table 3.4, the GASOPE method performed significantly better, on average, than the neural network. This improvement was both in terms of training and generalisation accuracy and in terms of the average simulation completion time for each simulation run. What is interesting to note, is that the GA performed better than the NN on a range of chaotic time-series problems. Also, the GASOPE method is more robust than Table 3.4: Comparison of GASOPE and NN on noiseless data (TMSE training set, GMSE = mean squared error on test set (generalisation), = mean squared 0' error on indicates the standard deviation, t is average simulation completion time in seconds) GAS OPE TMSE O'TMSE GMSE O'GMSE NN TMSE O'TMSE GMSE O'GMSE t t GAS OPE 0.000000 0.000000 0.000000 0.000000 0.83s NN 0.001258 0.011678 0.001266 0.011730 29.46s GASOPE 0.000000 0.000000 0.000000 0.000000 0.66s NN 0.055384 0.061578 0.055628 0.061842 20.50s Rossler GAS OPE 0.000000 0.000000 0.000000 0.000000 0.91s x component NN 0.000754 0.000562 0.002488 0.002444 32.16s Rossler GAS OPE 0.000000 0.000000 0.000000 0.000000 0.87s y component NN 0.000454 0.000348 0.001416 0.001640 32.15s Rossler GASOPE 0.000004 0.000000 0.000004 0.000000 0.90s z component NN 0.000388 0.000054 0.018146 0.041082 32.lOs Lorenz GASOPE 0.000434 0.000004 0.000432 0.000028 0.88s x component NN 0.002064 0.001108 0.655078 0.634666 32.50s Lorenz GASOPE 0.000900 0.000014 0.000900 0.000086 0.95s y component NN 0.087084 0.034538 2.588240 2.377820 32.51s Lorenz GASOPE 0.000404 0.000028 0.000396 0.000050 0.96s z component NN 0.172464 0.193974 2.464340 2.326490 32.54s Function Henon Logistic the NN, because the standard deviations are smaller. Additionally, the NN average simulation completion time was more than 16.667 times the average completion time of the GASOPE method. Thus, the GAS OPE method is orders of magnitude faster than the NN. Figures 3.10 to 3.18 show the function plots of most of the functions used in this section. Table 3.5 represents the experimental results for the GAS OPE method and the neural network (NN) as applied to a number of noisy data sets. • fl: Function fl represents the results of a noisy application of sin(x) over a domain of one period. The NN was trained using 3 hidden units in the hidden layer and the GASOPE method was trained with a maximum polynomial order of 3. All other initialisation parameters were set as shown by table 3.2 and specified in section 3.4.2. The GASOPE method performed slightly better than the NN in terms of training and generalisation error, and performed substantially better than the NN in terms of the average simulation completion time. A plot of the best GASOPE and NN output, evaluated according to generalisation ability of each end state of each simulation run, against the plot of the generalisation data set is shown in figure 3.7. • f2: Function f2 represents the results of a noisy application of sin (x) + cos(x) over a domain of one period. The NN was, once again, trained using 3 hidden units and the GASOPE method was trained with a maximum polynomial order of 3. The NN, in this case, performed slightly better than the GASOPE method in terms of accuracy. The GASOPE method, however, was still reasonably competitive against the NN. The GAS OPE method performed substantially better than the NN in terms of the average simulation completion time. Figure 3.8 shows a plot of the best NN and GAS OPE output for the function f2 against a plot of the generalisation data set. • f3: Function f3 represents the results of a noisy application of a fifth order polynomial over the interval [- 2,2]. The GAS OPE method used a maximum polynomial order of 5 and the NN used 5 hidden units. The GASOPE method performed substantially better than the NN, both in terms of accuracy and the average simulation completion time. A plot of the best GA and NN output against the plot of the generalisation data set is shown in figure 3.9. • f4: Function f4 represents the results of a 2-dimensional application of function fl. The GASOPE method used a maximum polynomial order of 3 and the NN used 6 hidden units (2 dimensions, 2 turning points in each dimension). The NN performed slightly better than the GA in terms of training and generalisation accuracy, but performed significantly worse than the GASOPE method in terms of the average simulation completion time. • f5: Function f5 represents the results of a 2-dimensional application of function f3. The GA used a maximum polynomial order of 5 and the NN used 10 hidden units (2 dimensions, 4 turning points in each dimension). The GASOPE method performed signifi- cantly better than the NN both in terms of training and generalisation accuracy, and in terms of the average simulation completion time. • Henon: The Henon function represents the results of a noisy application of the Henon map. Uniformly distributed noise in the range [-1, 1] was injected into the output component of the data set. The GASOPE method used a maximum polynomial order of 3 and the NN used 3 hidden units. The NN performed slightly better than the GASOPE method in terms of training and generalisation accuracy, but performed significantly worse than the GASOPE method in terms of the average simulation completion time. Figure 3.10 shows a plot of the Henon map. • Lorenz: The Lorenz functions represent the results of a noisy application of the Lorenz attractor in all three components. Uniformly distributed noise in the range [-10, 10] was injected into the each of the input and output components of the data set. The GAS OPE method used a maximum polynomial order of 3 and the NN used 3 hidden units. For all three experiments of the Lorenz attractor, the GASOPE method performed significantly better than the NN both in terms of training and generalisation accuracy, and in terms of the average simulation completion time. Figures 3.15 to 3.18 show the plots for the Lorenz attractor. Once again, it is interesting to note that the NN's average simulation completion time was more than 16.667 times the average simulation completion time of the GASOPE method. This showed that the GASOPE method is orders of magnitude faster that the NN. This section discusses the structural optimisation ability of the GASOPE method. Using Lagrange interpolating polynomials (section 3.2.4) a defined function can be approximated to any required degree of accuracy. If the function Table 3.5: Comparison of GASOPE and NN on noisy data (TMSE training set, GMSE = mean squared error on test set (generalisation), = mean 0" squared error on indicates the standard deviation, t is average simulation completion time in seconds) GAS OPE TMSE O"TMSE GMSE O"GMSE NN TMSE O"TMSE GMSE O"GMSE t t GASOPE 0.339252 0.001538 0.339346 0.010278 0.75s NN 0.341698 0.003444 0.341108 0.009948 19.54s GAS OPE 0.387740 0.002300 0.386500 0.011378 0.75s NN 0.341504 0.002420 0.340432 0.009524 19.87s GAS OPE 0.337418 0.001496 0.336534 0.008804 0.78s NN 0.446104 0.034328 0.446128 0.037590 29.35s GAS OPE 0.351866 0.002412 0.351748 0.010270 1.25s NN 0.342220 0.003066 0.343354 0.009826 47.36s GAS OPE 0.336980 0.001826 0.337030 0.008900 1.36s NN 0.437576 0.090354 0.440616 0.085214 72.06s GAS OPE 0.745516 0.021636 0.748390 0.034662 0.80s NN 0.724794 0.027760 0.729226 0.038624 29.58s Lorenz GAS OPE 46.663200 0.576780 46.760000 1.665810 0.93s x component NN 49.093600 1.640780 51.442600 2.628370 32.lOs Lorenz GASOPE 52.554100 1.200120 53.048800 2.334640 0.87s NN 56.833200 2.887430 59.376100 4.060700 32.46s Lorenz GAS OPE 55.493400 0.958328 55.562000 2.087360 0.90s z component NN 59.959000 2.625930 62.317100 3.420240 32.43s Function fl f2 f3 f4 f5 Henon y component is approximated with a Lagrange polynomial of degree 3 (with interpolation points x = O,x = 0.5,x = l,x = 1.5), the approximation y = is obtained. 1.3875X3 + 0.057570~ + 1.2730x Using the GASOPE method, the result over 100 simulations (initialised using table 3.2 and a maximum polynomial order of 3) is consistently y = 1.38552x3 + 1.35944x - 0.0280584 The difference in MSE between these two methods is 0.000174, with the Lagrange polynomial being the more accurate of the two. The GAS OPE method decided to substitute the x2 term for the constant -0.0280584 (one less multiplication in the simplified form used by the GASOPE method) because the loss in accuracy was acceptable and it resulted in a smaller architecture. Similarly, approximation of with a Lagrange polynomial of degree 3 (with interpolation points x = O,x = 2:,x = ~1t,x = 21t), yields the approximation y = 0.094266x3 - 0.888436~ + 1.860735x Using the GASOPE method, the result over 100 simulations (initialised using table 3.2 and a maximum polynomial order of 3) is consistently The difference in MSE between these two methods is 0.01552, with the GASOPE method being the more accurate of the two. The GASOPE method decided, in this case, to include the constant -0.226775 because the gain in accuracy was significant. Both of the above results illustrate that the GASOPE method does indeed find the optimal polynomial approximation to a function. This chapter presented and discussed a genetic algorithm approach to evolve structurally optimal polynomial expressions (GASOPE) to represent a given data set. The genetic algorithm was shown to be significantly faster than a neural network approach, and the genetic algorithm produced comparable results when compared to the neural network approach in terms of generalisation ability for most of the functions used in this chapter on both clean and noisy datasets (which included chaotic time-series). The success of the genetic algorithm approach is mainly due to the specialised mutation and crossover operators, and can also be attributed to the fast k-means clustering algorithm, both of which lead to a significant reduction of the search space. Performance gains in terms of speed can also be attributed to the highly parallel search behaviour of genetic algorithms, i.e. the increase in performance of the GAS OPE method over a neural network is not just a function of the number of patterns presented. Although the genetic algorithm discussed in this chapter appears to be fairly effective both in terms of accuracy and speed, there is, however, one serious drawback. The drawback is that polynomial structures are poor predictors of periodic data. In order to predict periodic data over an interval with polynomials, it is necessary to increase the order of the interpolating polynomial. However, such a prediction becomes particularly poor and deteriorates when the polynomial predictor is used to extrapolate information outside of the aforementioned interval. This problem can be solved in one of 2 ways: Build an expression that utilises a periodic function such as cosine or sine, or use only linear predictors at the ends of the approximation interval. The use of cosine or sine as a periodic function in the above, would require a substantial rework of the data structure employed to house term-coefficient pairs. The data structure would have to be changed from a list to a tree, which would require the operators to be changed. The use of linear predictors at the ends of the approximation interval is fairly simple to implement: Construct an expression that represents a hyper-plane in the attribute space, e.g. if two inputs (XQ,Xl) and one target output z are used, construct the expression z = rQxQ + rlXl + c and use the linear system mentioned in this chapter to solve rQ, rl and c. This hyper-plane can then be used in conjunction with an interval measure to extrapolate any unseen data outside the training interval. The next chapter presents a genetic program for the mining of continuous-valued classes (GPMCC) that utilises the expressions generated by the GASOPE method to provide multivariate models for the leaf nodes of a model tree. 11 C3AS()PE:" NN "/t'-H .++.t'_i~,+ t + ~.;4< .. -0.5 -t t+ + f2 (~AS(".IF)r:: NN 13 Cl!\~'~,OF'~E NN .'/~~~",","" ,', *f l ". ' " ' 11' 8 6 4 o 2 -2 -4 -6 -8 -10 15 -12 Y -10 o 100 200 300 400 500 600 700 BOO 900 1000 600 700 BOO 900 1000 t -12 o 100 200 300 400 500 t o o 100 200 300 400 500 t 600 700 800 900 1000 Figure 3.16: Lorenz attractor: x component Figure 3.17: Lorenz attractor: y component Chapter 4 THE GPMCC METHOD The previous chapter presented a genetic algorithm for evolving structurally optimal polynomial expressions (GASOPE). This chapter discusses a genetic program for the mining of continuousvalued classes (GPMCC). The GPMCC method relies heavily on the algorithms presented earlier in this thesis. Knowledge discovery algorithms like C4.5 [83] and M5 [82] utilise metrics based on information theory to partition the problem domain and to generate rules. However, these algorithms implement a greedy search algorithm to partition the problem domain. For a given attribute space, C4.5 and M5 attempt to select a test that minimises the relative entropy of the subsets resulting from the split. This process is applied recursively until all subsets are homogeneous or some accuracy threshold is reached. Figure 4.1 illustrates the partitioning problem by drawing parallels with graph theory. The minimal spanning tree of the graph of figure 4.1 illustrates how a greedy algorithm reaches point D from point A through point B, which is clearly not the optimal path. The optimal path from point A to point D traverses point C. This clearly shows that knowledge discovery algorithms such as C4.5 and M5 may not generate the smallest possible number of rules for a given problem. A large number of rules results in decreased comprehensibility, which violates one of the prime objectives of data mining. A 1.3 0- o -0 J.2 _.~--_....IO C o B o Shortest path from A to 0 This chapter discusses a regression technique that does not implement a greedy search algorithm. The regression technique utilises a genetic program for the mining of continuous valued classes (GPMCC) which is suitable for mining large databases. Although the majority of continuous data is linear, there are cases for which a non-linear approximation technique could be useful, e.g. time-series. Therefore, the GPMCC method utilises the GASOPE method of chapter 3 to provide non-linear approximations (models) to be used as the leaf nodes (terminal symbols) of a model tree. The remainder of this chapter is organised as follows: Section 4.2 provides a background to techniques that do not implement greedy search algorithms to generate rules. The structure and implementation specifics of the GPMCC method are discussed in section 4.3. Section 4.4 presents the experimental findings of the GPMCC method for a number of real-world and artificial databases. Finally, section 4.5 presents the summarised findings and envisioned future developments to the GPMCC method. The previous section presented a brief introduction to the GPMCC method. This section presents a detailed discussion of two existing methods suitable for non-linear regression. A novel method of generating comprehensible regression rules from a trained artificial neural network is discussed in section 4.2.1. Finally, section 4.2.2 presents genetic programming approaches for non-linear regression and model tree induction. Artificial neural networks (ANNs) are widely used as a tool for solving regression problems. However, ANN s have one critical drawback: the complex input to output mapping of the ANN is almost impossible for a human user to comprehend. ANNs are thus one of a handful of black-box methods that do not satisfy the comprehensibility requirement of knowledge discovery. This section discusses a recent work by Setiono that allows decision rules to be generated for regression problems from a trained ANN, called NeuroLinear [91][94]. Setiono's findings indicated that rules extracted from ANN s were more accurate than those extracted by various discretisation methods. This section is divided into three parts. Network training and pruning discusses the training and pruning strategy of the ANN used by Setiono. Activationfunction approximation describes how a piecewise linear approximation of the activation function is obtained. Generation of rules discusses the algorithm for generating rules from a trained ANN. The method starts by training an ANN that utilises hyperbolic tangent activation functions in the hidden layer (of size H). Training is performed on a training set of (Xi,Yi), i = 1,···, IPI where Xi E 9\N and Yi E IPI training points 9\. Training, in this case, minimises the sum of squares error Ess(w, v) augmented with a penalty term P(w, v). IPI Ess(w, v) = E (Yi - yj)2 + P(w, v) i=l £1(E~=l (Ef=111~:~1+ 1~~~~)) + £2(E~=l (E~l W~l + v~)) where £1,£2 and 11are positive penalty terms, yj is the predicted output for input sample Xi, i.e. H yj = E tanh ( (xif vm) + 't, m=l wm E 9\N is the vector of network weights from the input units to hidden unit m, Wml is the l-th component of wm, 't Vm E 9\ is the network weight from the hidden unit m to the output unit and is the output unit's bias. Setiono performed training using the BFGS optimisation algorithm, due to its faster conver- gence than gradient descent [29][92]. After training, irrelevant and redundant neurons were removed from the ANN using the N2PFA (Neural Network Pruning for Function Approximation) algorithm [93]. ANN pruning prevents the ANN from over-fitting the training data (discussed in section 2.3.3) and also reduces the length of the rules extracted from the ANN. The length of the extracted rules are reduced because the number of variables (weights, input units and hidden units) affecting the outcome of the ANN are reduced. The complex input to output mapping of an ANN is a direct consequence of using either the hyperbolic tangent or the sigmoid function as artificial neuron activation functions. However, as was discussed in section 2.3.2, the importance of these functions in ANN training is that they are monotonically increasing, differentiable and continuous throughout their domains. 1 -•... Q) c: .c: c: •...«l "0 0.5 0 c: -•... «l Q) c: -0.5 ....J L(net) tan{llet) -sample-points -1 -1.5 -4 o -3 net Figure 4.2: A 3-piece linear approximation of the hidden unit activation function tanh(net) given 20 training samples (0) In order to generate comprehensible rules, a 3-piece linear approximation of the hyperbolic tangent activation function is constructed. This entails finding the cut-off points (neto and -neto), the slope «(30 and (31) and the intersection (0, <Xl and -<Xl) of each of the three line segments. The sum squared error between the 3-piece linear approximation and the activation function is minimised to obtain the values for each of these parameters: IPI min = I)tanh(neti) neto,I3o,~1,0.1 where neti = xf . w is the weighted - L(neti))2 i= 1 input of sample i and -<Xl + (31net if (3onet if L(net) = { <Xl + (31net net < - neto -neto ~ net ~ neto if net> neto The intercept and slopes which minimises the sum squared error are calculated as follows: R tJO = Llnetd~neto netitanh(neti) 2 Llnetil~neto neti ~l L\netiI >neto (neti - neto) (tanh(neti) - tanh (neto) ) = ------------L\netd>neto (neti - neto) 2 <Xl = (~o-~l)neto The weighted input neti of each sample is checked as a possible optimal value for neto starting from the one that has the smallest magnitude. Figure 4.2 illustrates how the 3-piece linear approximation is constructed. Linear regression rules are generated from a pruned ANN once the network hidden unit activation functions tanh (net) has been approximated by the 3-piece linear function described above. The regression rules are generated as follows: (a) Generate a 3-piece linear approximation Lm(net). (b) Using the points -netmo and netmo from function Lm(net), divide the input space into 3 subregions. (a) Define a linear equation that approximates the ANN's output for an input pattern i in subregion r as the rule consequent: H yj = L vmLm(netmi) +'t m=l where netmi = xi w output unit and m, Vm 't E 9\ is the network weight from the hidden unit m to the is the output unit's bias. ((Cd /\ ... /\ (Cm) /\ ... /\ (CH)) where Cm is either (b) Generate the rule antecedent: netmi < -netmo, netmi > netmo or -netmo < netmi < netmo. Each Cm represents an attribute test. The antecedent is formed by the conjunction of the appropriate tests from each of the hidden units. The rule antecedent (( Cl) /\ ... /\ (Cm) /\ •.. /\ (CH)) defines the intersection of each subspace in the input space. For a large number of hidden ANs, this antecedent becomes large. If this antecedent is simplified, using logic or by using an algorithm such as C4.5 (discussed in section 2.1.2), then the rules generated by the above algorithm will be much easier to comprehend. The evolutionary computing paradigm of genetic programming (GP) can be used to solve a wide variety of data mining problems. This section discusses GP methods for symbolic regression, decision, regression and model tree induction, and scaling problems associated with GP. GP satisfies the comprehensibility requirement of knowledge discovery, because the representation of an individual (or chromosome) can be engineered to provide easily understandable results. Unlike artificial neural networks, GP can be used to perform symbolic need for data transformations. regression without the GP is also capable of regression analysis on variables that exhibit non-linear relationships, as apposed to the linear regression techniques presented in section 3.2. GP is thus a useful tool for regression analysis of non-linear data. However, because most data is in fact linear, a more conventional form of regression should always be considered first. Regression problems are solved by fitting a function, represented by a chromosome, to the dataset using a fitness function that minimises the error between them. The terminal set is defined as a number of constants and attributes, e.g. {32,2.5,O.833,x,y,z}, and describes a number of valid states for the leaf nodes of a chromosome (in the form of a tree). The function set is defined by a number of operators, e.g. {+,-, *, \,cos,sin}, and describes a number of Figure 4.3: An {32, 2.5,0.833,x,y,z} example chromosome and function set {+,-, for a regression problem: terminal set *,\, cos,sin} valid states for the internal nodes of a chromosome. Figure 4.3 demonstrates an example chromosome for the aforementioned terminal set and function set. The constants used in the terminal set are an Achilles' Heel of a symbolic regression genetic program. If a population is liberally scattered with constants chosen from a preset range, e.g. [-1, 1], it may be difficult for a genetic program to evolve the expression 300x. Abass et at. present a concise overview of methods to correct this problem [1]. A number of different classification systems that utilise genetic programming (GP) have been developed. This section discusses two interesting ones. Additionally, this section shows how a genetic program can be developed to directly evolve decision, regression and model trees. Eggermont et at. present a GP approach that utilises a stepwise adaptation of weights (SAW) technique in order to learn the optimal penalisation factor for the GP fitness function [30]. Each individual in the population represents a classification rule, and utilises a function set of boolean connectives and a terminal set of attribute tests, i.e. either the rule condition covers a training pattern, in which case it is asserted to belong to a class, or it does not. The approach was shown to have increased accuracy over the fixed penalisation factor case. Freitas presents an interesting GP framework that evolves SQL queries in order to increase scalability, security and portability [41]. Each individual consists of two parts: a tuple-set descriptor and a goal attribute. The GP approach utilises a niching strategy in order to force the method to produce innovative rules. GP can also be used to directly build decision, regression and model trees. As was mentioned in section 4.2.1, artificial neural networks provide no comprehensible they classify or approximate a dataset. explanation of how On the other hand, classification systems, such as C4.5 (from section 2.1.2), and regression systems, such as M5 (from section 2.1.2), generate overly complex trees. GP is potentially capable of providing a compromise between these two extremes. For zero-order learning, the function set of the chromosome consists of a number of attribute tests. The terminal set for the chromosome consists of either The models for model trees can be obtained by linear regression, symbolic regression or the GASOPE method of the previous chapter. The fitness function can either • or minimise the error between the predicted response of the individual and the target response of a number of training patterns. Additionally, the fitness function should implement a penalisation factor in order to penalise the complexity of a chromosome. In this manner, genetic programs can be used to minimise both the bias and the variance of the model described by a chromosome. GP has shown considerable promise in its problem solving ability over a wide range of applications including data mining [45][62]. However, problems exists in scaling GP to larger problems such as data mining. Marmelstein and Lamont summarise many of these difficulties [73]. Some of the most important scaling problems are: • The size and complexity of GP solutions can make it difficult to understand. Furthermore, solutions can become bloated with extraneous code (also known as introns). Of the above difficulties, the most difficult to control is the complexity of GP solutions, otherwise known as code growth. Abass describes many methods for the removal of introns, e.g. chromosome parsing, alternative selection methods and alternative crossover methods [1]. This section discusses a genetic program for the mining of continuous-valued classes (GPMCC). Section 4.3.1 presents an overview of the GPMCC method and its various components. An iterative learning strategy used by the GPMCC method is discussed in section 4.3.2. Section 4.3.3 describes, in detail, the fragment pool utili sed by the GPMCC method. Finally, the core genetic program for model tree induction is discussed in section 4.3.4. The genetic program for the mining of continuous-valued classes (GPMCC) consists of three parts: 1. An iterative learning strategy to reduce the number of patterns that are presented to the genetic program. 2. A pool of GASOPE fragments, which serve as a terminal set for the terminal nodes of a chromosome in a genetic program. This pool of fragments is evolved using mutation and crossover operators. Stratified rando sample GP tree initialisation Figure 4.3.1 shows an overview of the GPMCC learning process. In addition, the GPMCC learning process is summarised below: (a) Sample S, using an incremental training strategy, from the remaining training patterns I , P, i.e. P C P,S= SUP. (b) Remove the sampled patterns from P, i.e. P = P / s. (c) Evaluate the fitness R~(IxJ of each individual in population GGP,g using the patterns in S (d) Let G~p'g C GGp'g be the top x% of the individuals, based on fitness, to be involved in elitism (e) Install the members of G~p,g into GGP,g+1 (f) Let G~p,g C GGP,g be the top n% of the individuals, based on fitness, to be involved in crossover (g) Run the fragment pool optimisation algorithm once (h) Perform crossover: i. Randomly select two individuals Ia, and 113 from G~p,g 11. Produce offspring Iy from Ia, and 113 iii. Install Iy into G~p,g (i) Perform mutation: i. Select an individual 1m from G~P,g ii. Mutate 1mby randomly selecting a mutation operator to perform. iii. Install 1m into GGP,g+1 (j) Evolve the next generation g := g 4.3.2 +I Iterative learning strategy The GPMCC method utilises an iterative learning strategy to reduce the number of training patterns presentations per generation. Additionally, the iterative learning strategy should result in more accurate rules being generated [32][83]. The strategy utilises the k-means clustering of section 3.3.1 to cluster the training data. Clustering finds regions of high pattern density. As was discussed in section 3.3.1, larger clusters form around the turning points of a function. Using a proportional sampling strategy, the most informative patterns can be selected for training. As with the GASOPE method, a stratified random sample is selected from the available training patterns during each generation of the genetic program. The size, s, of the initial sample is chosen as a percentage of the total number of training patterns IPI. The sampling strategy utilises an acceleration rate to increase the size of the sample during every generation of the genetic program. This size of the sample is increased until the size of the sample equals the total number of training patterns. The stratified random sample is drawn proportionally from each of the k clusters of the k-means clusterer, i.e.: e' C· Ie' I = ICol· acceleration· o Co· and ICol is the (stratum) size of cluster 0 CO,IPI IPI s is the size of the data set, s is the initial sample size and acceleration is the acceleration rate. The acceleration rate increases the size of the sample at each generation. 4.3.3 The fragment pool From section 2.1.2, a model tree is a decision tree that implements multi-variate linear models at its terminal nodes. However, linear models may not adequately describe time-series data or non-linear data. In these cases, a model tree that implements multi-variate linear models at its terminal nodes will perform a piecewise approximation of the problem space. Although such a piecewise approximation may be accurate, the number of rules induced by the approximation will be large as indicated by the results in section 4.4.3. A number of techniques exist to perform non-linear regression. Two obvious non-linear regression techniques include • using a genetic program to perform a symbolic regression (from section 4.2.2) of the data covered by the terminal nodes, • or using the GASOPE method (from section 3.3.2) to perform a non-linear regression of the data covered by the terminal nodes. However, the use of both of these methods can be shown to have a severe impact on the time taken to construct a model tree. From section 3.4.3, the largest training time of the GAS OPE method was approximately 1.5 seconds. Assuming a genetic program is used to construct a model tree, if a model is generated by a mutation operator, a 1.5 second time penalty will be incurred every time that mutation operator is called. If, for example, there is a 1% chance of the mutation operator being run on an individual in a population of 100 individuals with, on average, 5 terminal nodes per individual, a 7.5 second time penalty will be incurred per generation (0.01 x 100 x 5 x 1.5 = 7.5). For 1000 generations this performance penalty is 7500 seconds (2 hours and 5 minutes). In other words, a very large proportion of the genetic program's training time will be spent optimising the models. This section discusses the fragment pool. The fragment pool is an evolutionary algorithm for improving the time taken for a model to be generated, based on context modelling. The fragment pool represents a belief space of terminal symbols for a model tree. The remainder of this section discusses the representation of the fragment pool, the initialisation of a fragment, the fragment mutation and crossover operators, and the fitness function. The implementations of the fragment pool and the genetic program of section 4.3.4 to evolve model trees are heavily intertwined. Therefore, for the remainder of this section assume that there is a genetic program that evolves model trees, whose terminal nodes are models that are obtained from the fragment pool. where 100is a GASOPE individual from section 3.3.2 (model) and lifetime 1t(j) 1t(j) is the lifetime of 100, The represents the age of a fragment in the fragment pool. When a fragment's lifetime expires, it is removed from the pool. The lifetime of a fragment can, however, be reset if the fragment is deemed "useful". By counting the number of times a model appears as a terminal node in the members of the crossover group of the genetic program, the fragment usefulness can be determined. A model is more likely to appear as a terminal node of members of the crossover group if the model closely approximates the sub-space described by the training and validation patterns covered by the path to the terminal node. Thus, the fragment pool implements a kind of context modelling [88], because fragments that result in sub-optimal approximations for these sub-spaces are eventually removed and can no longer contaminate the pool. (a) Use the GASOPE method (from section 3.3.2) to obtain a non-linear regression 10 of the patterns in each cluster. 4. Divide k by the split factor i.e k := sp l"t} 1_ ac t or . Essentially, the above algorithm performs multiple piecewise approximations of the problem space to build an initial set of fragments. The number of initial clusters, initial-clusters, and the split factor, split-factor, ultimately control the total number of piecewise approx- imations (models) that are generated. The algorithm starts by fitting many highly specific approximations and then increasing the approximation generality by decreasing the available number of clusters k. Decreasing the available number of clusters results in an increase in the number of patterns covered by each cluster centroid. This, in turn, results in a more general function approximation per cluster. The models contained within the fragments form part of a terminal set for the genetic program. Whenever the genetic program requires a model, the fragment pool randomly selects a fragment and passes the fragment's model to the genetic program. The fragment pool mutation operators serve to inject additional fragments into the fragment pool. The addition of fragments to the fragment pool result in an increased number of models in the terminal set of the genetic program. Additional models in the terminal set prevents the stagnation of the genetic program, by allowing the introduction of models that approximate regions not covered by the initialisation of the fragment pool. Additionally, the mutation operators also serve to fine-tune the models at the terminal nodes of the genetic program. Two mutation operators exist for the fragment pool; The shrink operator duplicates an arbitrary fragment in the fragment pool, and applies the shrink operator of section 3.3.2 to the duplicate. The introduce operator calls the GASOPE optimisation algorithm of section 3.3.2 on the training and validation patterns covered by the path of a terminal node of an arbitrary individual of the genetic program. The path is defined as the nodes traversed from the root of a tree to a specific node in the tree. The model obtained from the GAS OPE optimisation algorithm is given a lifetime of 0 and is inserted into the fragment pool. The other mutation operators of the GASOPE method are not used by the fragment pool optimisation process, because they require the coefficients for a term to be calculated. The coefficients can only be calculated if the training and validation sets are kept for each fragment. This, however, is not feasible because a fragment in the pool may be utili sed by more than one individual in the genetic program, i.e. an individual could describe multiple sub-spaces of the problem domain. All the operators of the GASOPE method are, however, used in the initialisation of the fragment pool. The crossover operator of the fragment pool is an invocation of the GAS OPE crossover operator of section 3.3.2. A fragment with a non-zero usefulness factor and an arbitrary fragment are randomly chosen from the fragment pool and their models are given to the GAS OPE crossover operator. The model obtained from the GASOPE crossover operator is given a lifetime of 0 and is inserted into the fragment pool. A culling operator removes fragments from the fragment pool, when those fragment's fragment lifetimes have expired, i.e. when the fragment lifetimes are larger than some upper- bound. The culling operator removes fragments from the fragment pool that have not been useful for a number of generations. This operator ensures that the fragment pool does become uncontrollably large. A large fragment pool results in a large number of terminal symbols, which may increase the time taken to optimise the genetic program. The shrink operator and the crossover operator are uniformly/randomly applied once after the completion of a generation of the genetic program. This ensures that the size of the fragment pool increases at least once per generation, in order to counteract the effects of the fragment lifetime, i.e. the removal of useless fragments. The introduce operator is applied with a statistical probability whenever the genetic program requires a model, i.e. when the relevant genetic program mutation operator is invoked. If the fragment lifetime is too large or the introduce operator is applied too often, the fragment pool will grow too quickly. A large fragment pool results in a large number of terminal symbols, which may have a negative consequence on the convergence properties of the genetic program. Conversely, a small fragment pool may lead to the stagnation of the genetic program, because there may not be enough variation in the terminal symbols described by the fragment pool. The fitness function rewards the usefulness of a fragment. As was mentioned earlier, the fragment usefulness is determined by counting the number of times a fragment appears as a terminal node of individuals in the crossover group of the genetic program. The crossover group consists of the top individuals in the genetic program, obtained through tournament selection. Thus, the usefulness of a fragment is determined by the number of times it appears as well as where it appears, i.e. fragments not used as terminal symbols of the crossover group are useless, because the overall fitness of the individuals using those fragments is poor. 1. Obtain a set G~p,g of individuals from the crossover group of a genetic program. 2. Evaluate the fitness of each fragment Fro in the pool fragment Fro E GFP: GFP using G~p,g, i.e. for each (a) Count the number of times, n, the individual lro appears as a terminal symbol in /I GGP,g· (b) Set the fitness of the individual FFP(Fro) 3. Select a crossover group from the pool G~p = n. c GFP, where FFP(Fro) > 0, Fro E G~p (the fragments of G~p still reside in GFP). 4. Reset the fragment lifetime of all fragments in GP~p to 0, as they were deemed useful. 5. Increase the fragment lifetime of all fragments in GPFP/GP~P by one. , (a) Select a fragment Fa from GFP and a fragment F13 from GFP. (b) Perform crossover to obtain a fragment Fy = {/y -+ O}. (c) Insert Fy into GFP (a) Select a fragment Fro from GFP. (b) Duplicate Fro to get F~. (c) Perform mutation on F~. (d) Insert F~ into GFP This section discusses, in detail, a genetic program for inducing model trees. The models for this genetic program are obtained from the fragment pool discussed in section 4.3.3. NODE: (CONSE QU ENT) ANTECEDENT: I ((ANTECEDENT (NOMINAL..ANT NOMINAL..ANTECEDENT: I (CONT -+ NODE)) INU OU S..ANT ECEDENT) (AI; = vI;) CONTINUOUS..ANTECEDENT: CONSEQUENT: ECEDENT) -+ NODE) V (,ANTECEDENT (AI; < VI;)I (AI; > VI;)I (AI; = VI;)I (AI; i= vI;) (1m) A con~equent, 1m, represents a GASOPE model from the fragment pool, AI; represents a nominal-, continuous- or discrete-valued attribute and vI; represents a possible value of AI;' For the continuous antecedents, operators such as :::;and ~ are obtained by adjusting the attribute value vI;' The GPMCC initialise operator creates an individual Ix by recursively adding nodes to that individual, up to a maximum depth bound. The pseudo-code algorithm for the initialisation is as follows, where CALLER is a calling node initially set to Nil and depth is the maximum required depth of the tree: (a) Select an attribute AI;' ~ = 1,···,1 from the attribute space of dimension I. (b) If AI; is a continuous-valued attribute, select an operator op(~) E {<, >, =, i=}. (c) Otherwise, op(~) E {=}. (d) Select an attribute value as,i for attribute s = as,i,i V E {I,···, As from a training pattern i, such that IPI}. s) and consequent (e) Create a node Nx with antecedent antx = (As op(~) v conx = Nil. (f) Call Initialise with the node covered by the antecedent of Nx (the left node), Nantx' and depth depth + 1. (g) Call lnitialise with the node covered by the negation of the antecedent of Nx (the right node), N..,antx' and depth depth + 1. (a) Select an individual Iro from the fragment pool. (b) Create a node Nx with antecedent antx = Nil and consequent conx = Iro. (c) Set the node covered by the antecedent of Nx: Once the procedure terminates, CALLER returns with the head of the tree. Figure 4.5 illustrates one outcome of the initialisation of an individual Ix. Each node in the diagram is recursively initialised as Nx and the arrows show the path taken by the initialisation method. The mutation operators serve to inject new genetic material into the population. Additionally, the mutation operators utilise domain specific knowledge (for reasons described in section 2.2.4) in order to improve the quality of the individuals in the population. A large number of mutation operators exist for the GPMCC method. • Expand-worst-terminal-node operator: The expand-worst-terminal-node operator lo- cates and partitions the sub-space for which a terminal node has a higher relative error than all other terminal nodes in the individual Ix. The relative error of the terminal node is determined by using the mean squared error EMS between the model described by the terminal node and the training set covered by the path of that terminal node. The operator attempts to maximise the adjusted coefficient of determination R~ (fitness) of the individual, by partitioning the sub-space described by a terminal node into smaller sub-spaces. The pseudo-code for the expand-worst-terminal-node 1. Select Nx (shown in figure 4.6) such that VNi E Ix : (conx operator is as follows: i- Nil) /\ (EMS (Nx) ::; EMS (N i) ), i.e. select the worst terminal node. 2. Select an attribute A~, ~ = 1, ... ,I from the attribute space of size I (in order to turn the consequent into an antecedent). 3. If A~ is a continuous-valued attribute, select an operator op(~) E {<, >, =, i-}. 4. Otherwise, op(~) E {=}. 5. Select an attribute value a~,i for attribute A~ from a training pattern i, such that v~ = a~,i,i E {I"", IPI}. 6. Set the antecedent antx of node Nx to (A~ op(~) v~). 7. Set the consequent conx of node Nx to Nil (to satisfy the termination criteria). 8. Create a node covered by the antecedent of Nx (the left node), NantX., with antecedent ant-,antx. = Nil and consequent conantx. = lro. Nantantx. := Nil 11. Create a node covered by the negation of the antecedent of Nx (the right node), N-,antx.' with antecedent ant-,antx. = Nil and consequent con-,antX. = lro· N -,antantx.:= Nil Intuitively, the expand-worst-terminal-node operator attempts to increase the fitness of an individual, by partitioning the subspace covered by the worst terminal node (in terms of mean squared error) into two more subspaces. A high mean squared error is an indication of a poor function approximation. It is possible that by partitioning the subspace the accuracy of the function approximation can be increased. Thus, this operator caters for discontinuities in the input space . • Expand-any-terminal-node operator: The expand-any-terminal-node operator parti- tions the sub-space of a random terminal node in an individual Ix. The pseudo-code for the expand-any-terminal-node operator is identical to the expand-worst-terminal-node operator, except that step (1) should read: If the high mean squared error in the worst terminal node is due to a large variance in the data, the expand-worst-terminal-node operator will continually attempt to parti- tion the subspace of the worse terminal node to no avail. This could lead to extremely slow convergence of the GPMCC method. The expand-any-terminal-node prevents this scenario from occurring by allowing any terminal node to be expanded. • Shrink operator: The shrink operator replaces a non-terminal node of an individual Ix with one of the non-terminal node's children. The pseudo-code for the shrink operator is as follows: 1. Select Nx (as shown in figure 4.7) from Ix such that (antx "# Nil). 2. If U(O, 1) < 0.5 then the current node becomes the node covered by the antecedent (the left node) Nx := Nantx. 3. Otherwise, the current node becomes the node covered by the negation of the antecedent (the right node) Nx := N'antx (as shown in figure 4.7). The shrink operator is responsible for removing introns from an individual, which is necessary to prevent code bloat. • Perturb-worst-non-terminal-node operator: The perturb-worst-non-terminal-node op- erator selects and perturbs a non-terminal node which has a higher relative error than all other non-terminal nodes in an individual Ix. Once again, the relative error is determined using the mean squared error EMS on the training set. This operator gives the GPMCC method an opportunity to optimise the partitions described by the non-terminal nodes of an individual. The pseudo-code for the perturb-worst-non-terminal-node operator is as follows: 1. Select Nx (as shown in figure 4.8) such that VNi E Ix : (antx =1= Nil) 1\ (EMS (Nx) :::; EMS(Ni)). 2. IfV(O, 1) < VI, where VI E [0,1] is a user-defined parameter (a) If As is a continuous-valued attribute 1. If V(O, 1) < V2 where V2 E [0,1] is a user-defined parameter, select an operator op(~) E {<, >, =, =I=} (as shown in figure 4.8). ii. Otherwise, adjust the attribute value v~ according to a Gaussian distribu. '-(max-min)U(O,I)2 h W • 1 I .( > ) tIOn v~ .- v~ + 2U3(O.3)2 , W ere va~,i' l , .. , . max _ a~,i 1\ 1\ (min ::; a~,i)' U3 E 9\ is a user-defined parameter, min is the minimum value for an attribute A~ and max is the maximum value for an attribute A~. The standard deviation 0.3 of the Gaussian distribution provides an even distribution of the Gaussian function in the domain [0,1]. (b) Otherwise, 1. Randomly, select an attribute value a~,i for attribute A~ from a training pattern i, and let v~ = a~,i, i E {I,···, !PI}. 3. Otherwise, (a) Select an attribute A~, ~ = 1, ... ,I from the attribute space of size I. (b) If A~ is a continuous-valued attribute, select an operator op(~) E { <, >, =, =I=}. (c) Otherwise, op(~) E {=}. (d) Randomly select an attribute value a~,i for attribute A~ from a training pattern i, such that v~ = a~,i' i E {I,···, IPI}. 4. Set the antecedent antx of node Nx to (A~ op(~) v~). Intuitively, the partition described by a non-terminal node of an individual may not correctly partition the subspace e.g. if a test should have been Al Al < 4.6. The perturb-worst-non-terminal-node < 5, but is actually operator specifically attempts to adjust the test described by the worst non-terminal node (indicated by the largest mean squared error) . • Perturb-any-non-terminal-node operator: The perturb-any-non-terminal-node ator selects and perturbs a non-terminal node in an individual Ix. oper- The perturb-any- non-terminal-node operator is identical to the perturb-worst-non-terminal-node operator except that step (1) should read: This operator allows for the perturbation of any non-terminal node, in order to prevent slow convergence in the case that the high mean squared error of the worst non-terminal node is due to high variation in the dataset. • Perturb-worst-terminal-node operator: The perturb-worst-terminal-node operator se- lects and perturbs a terminal node which has a higher relative error than all other nonterminal nodes in an individual Ix. This operator gives the GPMCC method an opportunity to optimise the non-linear approximations for the sub-space covered by the training and validation patterns described by the path to a terminal node. The perturb-worstterminal-node operator is as follows: 1. Select Nx (as shown in figure 4.9) such that 'iNi E Ix: (conx =I Nil) 1\ (EMS (Nx) ::; EMS(Ni))' 2. Select an individual Iro from the fragment pool. 3. Set the consequent conx of node Nx to Iro. Intuitively, the model described by the terminal node of an individual may be a poor fit of the data covered by the path of that terminal node. The perturb-worst-terminal-node operator randomly selects a new individual from the fragment pool to replace the current model. • Perturb-any-terminal-node operator: The perturb-any-terminal-node operator selects and perturbs a terminal node in an individual Ix. The perturb-any-terminal-node operator This operator allows for the perturbation of any terminal node, in order to prevent slow convergence in the case that the high mean squared error of the worst terminal node is due to high variation in the dataset. • Reinitialise operator: The reinitialise operator is a re-invocation of the initialisation operator. The crossover operator implements a standard genetic program crossover strategy. Two individuals (Ia and [13)are chosen by tournament selection from the population and a crossover point is chosen for each individual. The two crossover points are spliced together to create a new individual [y. The pseudo-code for the crossover operator is as follows: Like all evolutionary computing paradigms, the fitness function is the most important aspect of a genetic program, in that it serves to direct the algorithm toward optimal solutions. The fitness function used by the genetic program is an extended form of the adjusted coefficient of determination (the GASOPE fitness function) from equations (3.15) and (3.16): where s is the size of the sample set, bi is the actual output of pattern i, bi i is the predicted x' output of individual Ix for pattern i, and the model complexity d is calculated as follows: d = + ~IIOlI ~[T~I AL..1;=1L..'t=l 1;,'t L[Ixl { 1 ,u=l 1 where Ix is an individual in the set GGP of individuals, if con,u =f Nit if con,u = Nit IIxl is the number of nodes in Ix, Iro is the model at a terminal node, T1; is a term of Iro and A-1;,'t is the order of term T1;' This fitness function penalises the complexity of an individual Ix by penalising the size of an individual and the complexity of each of the non-terminal nodes of that individual, i.e. the number of nodes and the complexity of each leaf node as shown equation (3.16). This section discusses the experimental procedure and results of the GPMCC method vs. NeuroLinear and Cubist, applied to various data sets obtained from the VCI machine learning repository and a number of artificially created datasets. Section 4.4.1 presents the various data sets. The influence of a number of key GPMCC parameters are discussed in section 4.4.2. Section 4.4.3 presents the experimental results for the functions listed in section 4.4.1. The quality of the generated rules is discussed in section 4.4.4. The GPMCC method was evaluated on a number of benchmark approximation databases from the machine learning repository as well as a number of artificial databases [14]. Table 4.1 describes each of the VCI databases used in this thesis. The artificial databases used in this thesis were created to analyse various approximation problems not sufficiently covered by the VCI machine learning repository, e.g. time-series, and to provide a number of large databases, exceeding 10000 patterns, with which to analyse the performance of the GPMCC method. The House-16H data set in table 4.1 is a particularly difficult problem. Apart from being large, the data set also has very large values for its attributes. The performance in terms of rule accuracy of the various data mining algorithms used later in this thesis is expected to be poor, as is shown by this and the next section. Because the adjusted coefficient of determination is Table 4.1: Databases obtained from the UCI machine learning repository (Attributes: N Nominal, C = = Continuous) Dataset Samples Attributes Prediction task Abalone 4177 1N,7C Age of abalone specimens Auto-mpg 392 7C Car fuel consumption in miles per gallon Elevators 16599 18C Action taken for controlling an F16 aircraft Federal Reserve Economic Data 1049 16C I-Month credit deficit rate House-16H 22784 16C Median value of homes in 50 US states Housing 506 13C Median value of homes in Boston suburbs Machine 209 6C Relative CPU performance Servo 167 2N,2C Response time of a servo mechanism used as a fitness function for the GPMCC method, the scaling of attributes will not improve the accuracy of the GPMCC method. The function describing the Machine data set in table 4.1 is piecewise linear. Therefore, the GPMCC method is not expected to perform substantially better, in terms of the number of rules generated, than other methods. In addition to the problems listed in table 4.1, the following problems were also used: • The function example database represents a discontinuous function, defined by a number of standard polynomial expressions: y=U(-I,I)+ 1.5x2-7x+2 if (type ='C') A (x > 1) x3 +400 if (type ='C') A (x ~ l) 2x if (type ='A') -2x if (type ='B') where x E [-10,10] and type E {'A',' B' ,'C'}. The database consists of 1000 patterns, with 3 attributes per pattern . • The Lena Image database represents a 128 x 128 grey-scale version of the famous "Lena" image, which is used for comparing different image compression techniques [88]. The data consists of 11 continuous valued attributes, which represents a context (or footprint) for a given pixel. The objective of an approximation method is to infer rules between the context pixels and the target pixel. The database is large and consists of 16384 patterns. • The Mono Sample database represents an approximately 4 second long sound clip, sampled in mono at 8000 hertz (the chorus of U2's "Pride (In the name of love)"). The database consists of 31884 patterns, with 5 attributes per pattern. The objective of an approximation method is to infer rules between a sample and a context of previous samples. • The Stereo Sample database represents an approximately 4 second long sound clip, sampled in stereo at 8000 hertz (the chorus ofU2's "Pride (In the name oflove)"). The database consists of 31430 patterns, with 9 attributes per pattern. The objective of an approximation method is to infer rules between a sample in the left channel and a number of sample points in the left and right channels. • The Time-series database represents a discontinuous application of components of the Rossler and Lorenz attractors (from section 3.4.1), the Henon map and a polynomial term: where Xo,Yo,zo, Wo rv U( -5,5). Xn+l if (xn ~ 0) Yn+l if (xn < 0) 1\ (Yn ~ 0) Zn+l if (xn < 0) 1\ (Yn < 0) 1\ (Zn ~ 0) Wn+l if (xn < 0) 1\ (Yn < 0) 1\ (Zn < 0) The database consists of 1000 patterns, with 4 attributes per pattern, generated using the Runga-Kutta method with order 4 [16]. • the fragment lifetime, which controls how long unused fragments remain in the fragment pool, • the leaf optimisation rate, which controls how often the GASOPE method is called to obtain a model for a terminal node, • and fragment pool initialisation, which controls the initial terminal set of the GPMCC method. The databases used to test the influence of key GPMCC parameters were the artificial databases of the previous section. The artificial databases were used because they are well defined, easily understandable and diverse. Four databases from the VCI machine learning repository have been selected, i.e. Abalone, Elevators, Federal Reserve Economic Data and House-16H, to provide additional problem diversity. Each of the databases was split up into a training set, a validation set and a generalis ation set. The training set was used to train the GPMCC method, the validation set was used to validate the models of the GASOPE method and the generalisation set was used to test the performance of the GPMCC method on unseen data. The training set for each database consisted of roughly 80% of the patterns, with the remainder of the patterns split evenly among the validation and generalisation sets (80% : 10% : 10%). The GPMCC initialisation for each of the databases is shown by table 4.2. For all datasets the maximum polynomial order was set to 5, except for the house-16H dataset, for which the maximum polynomial order was set to 10. Generally speaking, the parameters prefixed by "Function" in table 4.2 have the same meaning as that of section 3.4.2 and control the behaviour of the model optimisation algorithm of the fragment pool. These parameters are soft options for the GPMCC method, because they are automatically adjusted if they violate any of the restrictions of section 3.4.2. The parameters prefixed by "Decision" control the behaviour of the genetic program for generating model trees. In general, "consequent" refers to terminal nodes and "antecedent" refers to non-terminal nodes. The mnemonic "C~' stands for continuous antecedent, "NA" stands for nominal antecedent, "ME" stands for mutate expand and "MN" stands for mutate node (perturb). Parameter SyntaxMode Clusters ClusterEpochs FunctionMutationRate FunctionCrossoverRate FunctionGenerations FunctionIndividuals PolynomialOrder FunctionPercentageSampleSize FunctionMaximumComponents FunctionElite FunctionCutOff DecisionMaxNodes DecisionMEWorstV sAnyConsequent DecisionMECreate V sRedistributeLeatN odes DecisionMNAntecedentV sConsequent DecisionMNWorstV sAny Antecedent DecisionMNWorstV sAnyConsequent DecisionReoptimize V sSelectLeaf DecisionMutateExpand DecisionMutateShrink DecisionMutateNode DecisionMutateReinitialize DecisionNAAttribute V sClassOptimize DecisionCAAttributeOptimize DecisionCAClassOptimize DecisionCAConditionOptimize DecisionCAClass V sGaussian DecisionCAClassPartition DecisionCAConditionalPartition DecisionPoolN oClustersStart DecisionPoolN oClustersDivision DecisionPoolN oClusterEpochs DecisionPoolFragmentLifeTime DecisionInitialPercentageSampleSize DecisionSampleAcceleration DecisionNoIndividuals DecisionNoGenerations DecisionElite DecisionMutationRateInitial DecisionMutationRateIncrement DecisionMutationRateMax DecisionCrossoverRate Cross Validation Value I 30 10 0.1 0.2 100 30 5 om 10 0.1 0.001 30 0.5 0.5 0.5 0.5 0.5 0.1 0.3 0.3 0.3 0.1 0.2 0.1 0.6 0.3 0.1 0.1 0.1 30 2 1000 50 0.1 0.005 100 10 0.0 0.2 0.01 0.6 0.1 0 The GPMCC method utilises a variable mutation rate. The mutation rate is initially set to a default parameter ("DecisionMutationRateInitial"). Every time the accuracy of the best individual in a generation does not increase, the mutation rate is increased (by the amount specified by "DecisionMutationRateIncrement") "DecisionMutationRateMax") up to a maximum mutation rate (given by . This variable mutation rate helps to prevent stagnation. If the accuracy of the best individual does not improve over a number of generations, the increase in mutation rate injects more new genetic material into the population. However, if the accuracy of the best individual does improve, then the mutation rate is reset to the initial mutation rate. A dagger, t, on the right hand side of results listed in tables 4.3 to 4.28 indicate the best result for a particular experiment set. In the event of ties between results, two other types of daggers are used: t t indicates that a particular result was judged the best of the experiment set using the adjusted coefficient of determination, and to indicates that a particular result was judged the best of the experiment set using the mean squared error. The fragment lifetime ("DecisionPooIFragmentLifetime" in table 4.2) controls how long an unused GAS OPE fragment remains in the fragment pool. Intuitively, a longer fragment lifetime results in a larger memory overhead to store unused fragments. A larger pool may result in slower convergence due to the increased number of fragments that could be selected at a leaf node. A smaller pool could result in sub-optimal convergence due to over-fitting. Tables 4.3 - 4.7 show the effect of the fragment lifetime on the outcomes of the GPMCC method. Generally speaking, the fragment lifetime has a definite effect on the average simulation completion time t. A decrease in the fragment lifetime results in a decrease in the average simulation completion time. The fragment lifetime has no significant effect on any of the other outcomes (MSE, number of nodes, R2, R~ etc.). A fragment lifetime of 5 seems to result in over-fitting for some of the databases. Abalone, House-16H, and Function Example (shown in table 4.3) show a significant increase in both the average generalisation mean squared error GMSE and the standard deviation of the mean squared error (JGMSE. Even though the fragment lifetime has no significant effect on general- isation accuracy, four of the nine datasets obtained their best generalisation performance with Dataset TMSE OTMSE Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries GMSE °GMSE °R~ R2G R~T °R~T R~G Nodes 4.5854 0.5614 0.5571 4.0000 0.0000 0.8832 0.8822 7.8667 0.0441 0.9961 0.9960 5.5333 0.4687 1.0000 1.0000 7.8000 1691964000.0000 0.3916 0.3887 11.0667 121.2942 0.9402 0.9398 13.8000 301.4220 0.6873 0.6868 10.2000 200.4300 0.8241 0.8239 5.1333 1.1065 0.9912 0.9905 7.0000 ONodes t r;;;;;sF 7.3979 0.5126 0.4765 1264.3700 0.0000 0.8796 0.8705 1873.5300 0.0521 0.9954 0.9928 636.3330 0.4804 1.0000 1.0000 953.1670 2320560000.0000 0.3606 0.3375 2940.3000 127.1936 0.9377 0.9348 1671.2700 302.9520 0.6799 0.6761 2437.8300 200.3480 0.8191 0.8174 4584.1300 1.0884 0.9894 0.9396 687.4000 10.1001 0.1065 0.1009 0.6198 0.0000 0.0076 0.0078 1.0000 0.0189 0.0017 0.0027 0.8477 1.4807 0.0001 0.0001 0.9757 2949560000.0000 0.0743 0.0699 0.7291 6.7631 0.0033 0.0035 0.9536 8.6692 0.0092 0.0093 0.9950 1.1636 0.0011 0.0010 1.0004 1.6558 0.0162 0.1840 1.0167 R2T 0.0275 0.0026 0.0025 1.4622 tT t t t t t t t t t t t t t t 0.0000 0.0065 0.0063 2.2087 0.0032 0.0003 0.0003 1.7367 1.4012 0.0000 0.0000 1.4480 63812800.0000 0.0229 0.0224 3.3418 3.9576 0.0020 0.0019 3.7729 6.7250 0.0070 0.0070 2.0745 1.0665 0.0009 0.0009 1.2794 1.7963 0.0143 0.0155 1.8937 oRb °R~G TMSE t tT t t t t t t t t t t t t t tt t t Dataset TMSE R2T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5661 0.5632 0.5588 4.1333 0.0000 0.8811 0.8802 6.9333 0.0450 0.9961 0.9959 5.3333 0.1004 1.0000 1.0000 7.8667 1706552000.0000 0.3864 0.3836 8.7333 119.9730 0.9408 0.9405 13.3333 301.6280 0.6871 0.6866 9.3333 201.4300 0.8233 0.8231 5.0000 1.1449 0.9909 0.9902 7.2000 crTMSE GMSE crR} R2 crR~T R~G crNodes t t t t 0.0304 0.0029 0.0028 1.2521 0.0000 0.0069 0.0068 2.1961 0.0022 0.0002 0.0002 1.0613 0.0032 0.0000 0.0000 1.3578 68102200.0000 0.0245 0.0243 2.4486 3.6204 0.0018 0.0017 3.6040 5.3602 0.0056 0.0055 2.1064 2.1446 0.0019 0.0019 1.7420 1.4470 0.0115 0.0124 1.8458 crGMSE cr G Rb crR~G TMSE t 5.1339 0.5660 0.5279 1376.5300 0.0000 0.8771 0.8696 2045.0700 0.0477 0.9958 0.9938 743.3330 0.0915 1.0000 1.0000 967.4000 1775414000.0000 0.3755 0.3529 3442.2700 125.8572 0.9383 0.9354 1894.8000 302.7020 0.6802 0.6765 2740.8000 201.5580 0.8180 0.8163 4675.4700 0.9168 0.9910 0.9691 795.3670 r.;:;;sp t t t t t t t t t t t t 0.4080 0.0345 0.0390 0.8894 0.0000 0.0088 0.0084 1.0000 0.0034 0.0003 0.0008 0.9418 0.0034 0.0000 0.0000 1.0974 89054600.0000 0.0313 0.0330 0.9612 4.6576 0.0023 0.0025 0.9532 6.9376 0.0073 0.0073 0.9965 2.5642 0.0023 0.0024 0.9994 1.0918 0.0107 0.0546 1.2489 t t Dataset TMSE OTMSE R2T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries GMSE OGMSE OR} R2G R~T °R~T R~G Nodes 4.5658 0.5632 0.5587 3.9333 0.0000 0.8827 0.8819 6.7333 0.0452 0.9961 0.9959 5.1333 0.1001 1.0000 1.0000 7.4000 1708182000.0000 0.3858 0.3831 9.1333 121.5576 0.9400 0.9397 12.4000 300.2160 0.6885 0.6881 10.0000 201.3880 0.8233 0.8231 4.6667 1.3721 0.9891 0.9882 6.6667 °Nodes t t to to t 0.0355 0.0034 0.0032 1.1427 0.0000 0.0078 0.0076 2.3916 0.0019 0.0002 0.0002 1.0417 0.0025 0.0000 0.0000 0.8137 59607800.0000 0.0214 0.0214 3.0596 3.4550 0.0017 0.0017 3.3280 5.0925 0.0053 0.0053 1.9476 2.4986 0.0022 0.0022 2.0398 1.4380 O.ot15 0.0124 1.4933 ORb ~°R2 TMSE t 16.0463 0.5418 0.5046 1471.9700 0.0000 0.8776 0.8700 2062.2000 0.0476 0.9958 0.9936 845.3670 0.0908 1.0000 1.0000 1008.2700 1816690000.0000 0.3642 0.3434 3571.0700 127.4858 0.9375 0.9350 2024.3700 300.9340 0.6821 0.6783 2773.1000 201.5300 0.8181 0.8165 4819.2300 1.2856 0.9874 0.9301 859.4000 r;;::;m; t t t to to 59.4182 0.1074 0.1009 0.2846 0.0000 0.0092 0.0081 1.0000 0.0051 0.0005 0.0013 0.9501 0.0030 0.0000 0.0000 1.1017 257592000.0000 0.0737 0.0707 0.9403 5.1076 0.0025 0.0027 0.9535 6.2382 0.0066 0.0065 0.9976 2.9388 0.0027 0.0027 0.9993 1.2400 0.0121 0.1829 1.0673 t t Dataset TMSE R2 T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5606 0.5637 0.5592 4.2000 0.0000 0.8823 0.8813 7.6667 0.0447 0.9961 0.9959 5.4000 0.1005 1.0000 1.0000 7.4667 1730830000.0000 0.3777 0.3750 9.0667 121.6324 0.9400 0.9397 12.2667 300.9200 0.6878 0.6873 9.5333 202.2480 0.8225 0.8223 4.6667 1.1618 0.9907 0.9901 6.7333 CJTMSE GMSE CJR} R2 CJR~T CJNodes t t t 0.0409 0.0039 0.0034 1.6274 0.0000 0.0054 0.0053 2.1867 0.0021 0.0002 0.0002 1.4288 0.0028 0.0000 0.0000 1.2521 49356800.0000 0.0177 0.0174 2.7029 3.4741 0.0017 0.0017 2.8031 5.6947 0.0059 0.0059 2.0965 2.6624 0.0023 0.0023 1.4933 1.4913 0.0119 0.0128 1.5522 CJGMSE G CJR~ CJR2 R~G t 5.3216 0.5501 0.5088 1483.3700 0.0000 0.8777 0.8693 2117.2ooo 0.0591 0.9948 0.9919 826.8670 0.0919 1.0000 1.0000 1147.8700 1826146000.0000 0.3576 0.3351 3542.1300 129.2850 0.9366 0.9339 2066.8000 302.1420 0.6808 0.6771 2845.7000 203.5020 0.8163 0.8147 5001.7700 0.9247 0.9910 0.9612 915.8330 ~ TMSE ~ 0.8711 0.0736 0.0866 0.8570 0.0000 0.0068 0.0061 1.0000 0.0373 0.0033 0.0050 0.7555 0.0040 0.0000 0.0000 1.0933 124870200.0000 0.0439 0.0484 0.9478 9.8847 0.0048 0.0050 0.9408 7.3790 0.0078 0.0077 0.9960 5.1323 0.0046 0.0048 0.9938 1.2046 0.0118 0.1351 1.2564 t Dataset TMSE R2 T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5682 0.5630 0.5588 3.6667 0.0000 0.8819 0.8810 7.0667 0.0449 0.9961 0.9959 5.0000 0.1004 1.0000 1.0000 7.4000 1705792000.0000 0.3867 0.3839 9.5333 122.2352 0.9397 0.9394 12.1333 299.7100 0.6890 0.6886 9.2667 200.5820 0.8240 0.8238 4.6667 1.3141 0.9895 0.9889 6.2667 t t t t t t t to t CJTMSE GMSE CJGMSE CJR} R2 CJRb CJR~T R~G G CJNodes t 0.0340 0.0033 0.0030 1.2130 0.0000 0.0074 0.0072 2.4344 0.0028 0.0002 0.0002 1.3896 0.0030 0.0000 0.0000 1.1017 65710600.0000 0.0236 0.0230 3.5982 3.7588 0.0019 0.0018 3.2242 4.0271 0.0042 0.0042 1.5522 1.7623 0.0015 0.0015 1.8257 1.5510 0.0124 0.0130 1.8557 5.5379 0.5383 0.5023 1575.8000 0.0000 0.8782 0.8702 2172.7300 0.0518 0.9954 0.9932 904.9330 0.0912 1.0000 1.0000 1198.7000 1856478000.0000 0.3592 0.3374 3578.9000 130.0164 0.9363 0.9336 2017.7300 300.2380 0.6828 0.6792 2799.3000 200.5640 0.8189 0.8173 5066.8300 1.1545 0.9887 0.9740 982.1000 CJR~G TMSE (';MSF t t t t 1.6687 0.1071 0.1012 0.8249 0.0000 0.0094 0.0086 1.0000 0.0218 0.0019 0.0032 0.8667 0.0036 0.0000 0.0000 1.1007 390526000.0000 0.0728 0.0685 0.9188 9.8415 0.0048 0.0051 0.9402 4.6945 0.0050 0.0050 0.9982 2.1152 0.0019 0.0020 1.0001 1.1890 0.0116 0.0307 1.1382 t a fragment lifetime of 50. Additionally, eight of the datasets where distributed between a fragment lifetime of 5 and 100. Therefore, a fragment lifetime of 50 appears to be the best choice for this parameter, since it is both consistent and computationally less expensive than the larger fragment lifetimes. The leaf optimisation rate ("DecisionReoptimizeVsSelectLeaf" in table 4.2) controls the rate at which a leaf node is optimised using the GASOPE method. The GASOPE method takes ±1.5 seconds to perform an optimisation (refer to section 3.4.3). A larger leaf optimisation rate should thus result in the GPMCC method taking longer to complete each simulation. Tables 4.8 - 4.11 show the effect of the leaf optimisation rate on the outcomes of the GPMCC method. As expected, an increase in the leaf optimisation rate resulted in an increase in the average simulation completion time t. An increase in the leaf optimisation rate also resulted in an increase in training accuracy (TMSE, R} and R~T)' However, there was no significant increase in generalisation accuracy (GMSE, Rb and R~G)' except in the case of the Abalone data set where the generalisation accuracy actually decreased. In table 4.11, over- fitting was observed in some instances (Abalone, Federal Reserve Economic Data and Timeseries), but not in others (House-16H, Function Example, Lena Image, Mono Sample and Stereo Sample). Interestingly, seven of the nine databases achieved the best training-generalisation ratio ~~~~ for a leaf optimisation value of 0.05 (shown in table 4.8). This clearly shows that the generalisation accuracy is not significantly affected by an increase in the leaf optimisation rate. A low leaf optimisation rate is clearly desired, because the performance gain in terms of the average simulation completion time outweighs any increase in training performance, particularly if there was no significant increase in generalisation performance. For this reason, a leaf optimisation rate of 0.05 appears to be the best choice for this parameter. Windowing is controlled by two parameters; the initial window size as a percentage of the total number of training patterns ("DecisionInitialPercentageSampleSize" in table 4.2) and the acceleration rate by which patterns are injected into the window ("DecisionSampleAccelera- Dataset TMSE crTMSE R2T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5836 0.5615 0.5572 4.0000 0.0000 0.8805 0.8797 6.5333 0.0452 0.9960 0.9959 5.0000 0.1000 1.0000 1.0000 7.6000 1698136000.0000 0.3894 0.3865 10.2667 122.2988 0.9397 0.9394 12.8000 301.5540 0.6871 0.6867 9.6667 201.3500 0.8233 0.8231 4.9333 1.4509 0.9884 0.9874 6.4000 t t t GMSE crGMSE crR} R2G crR~T R~G crNodes t 0.0308 0.0029 0.0028 1.5536 0.0000 0.0062 0.0061 1.6344 0.0034 0.0003 0.0003 1.2865 0.0034 0.0000 0.0000 1.1919 68116800.0000 0.0245 0.0242 3.7318 3.4490 0.0017 0.0017 3.2947 6.3549 0.0066 0.0066 2.1227 1.6139 0.0014 0.0014 1.7006 2.0606 0.0164 0.0181 1.4044 6.4099 0.5426 0.5071 991.6330 0.0000 0.8783 0.8709 1648.6700 0.0480 0.9957 0.9935 485.3330 0.0910 1.0000 1.0000 539.4330 1807944000.0000 0.3640 0.3395 2762.8700 128.1206 0.9372 0.9346 1681.1700 302.2460 0.6807 0.6769 2414.7300 201.4480 0.8181 0.8165 3953.2300 1.0484 0.9897 0.8855 474.0330 crR~ crR~G TMSE t t t t t t t t t t t t "f'fMS'F. 6.7010 0.1079 0.1027 0.7151 0.0000 0.0084 0.0077 1.0000 0.0068 0.0006 0.0014 0.9428 0.0035 0.0000 0.0000 1.0997 113766200.0000 0.0400 0.0463 0.9393 4.3605 0.0021 0.0012 0.9546 8.1705 0.0086 0.0086 0.9977 2.1278 0.0019 0.0021 0.9995 1.4091 0.0138 0.3011 1.3840 tt t t t t t t Dataset TMSE R2T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5705 0.5628 0.5580 4.2000 0.0000 0.8799 0.8790 6.3333 0.0447 0.9961 0.9959 5.7333 0.0997 1.0000 1.0000 7.7333 1715448000.0000 0.3832 0.3807 8.6000 120.0606 0.9408 0.9404 13.0667 301.0520 0.6877 0.6872 8.8667 201.6580 0.8231 0.8229 5.0667 1.4052 0.9888 0.9880 6.7333 t t t CJTMSE GMSE CJGMSE CJR} R2 CJRb CJR~T R~G G CJ 2 CJNodes t 0.0355 0.0034 0.0033 1.7100 0.0000 0.0053 0.0051 2.1867 0.0029 0.0003 0.0003 1.3374 0.0035 0.0000 0.0000 1.4368 61350600.0000 0.0221 0.0218 2.5407 3.4722 0.0017 0.0017 3.5422 5.7343 0.0060 0.0059 1.8889 2.9872 0.0026 0.0026 1.7006 1.9056 0.0152 0.0164 2.0833 5.2330 0.5576 0.5156 1436.9000 0.0000 0.8748 0.8674 2128.3300 0.0485 0.9957 0.9932 794.2670 0.0915 1.0000 1.0000 996.5330 1787912000.0000 0.3711 0.3511 3481.7300 125.8182 0.9383 0.9354 2017.5700 302.2340 0.6807 0.6771 2803.7000 202.6600 0.8170 0.8153 4757.1300 1.0292 0.9899 0.9694 819.9670 ~ R TMSE ~ t t t t t t 0.6863 0.0580 0.0680 0.8734 0.0000 0.0050 0.0048 1.0000 0.0055 0.0005 0.0013 0.9223 0.0040 0.0000 0.0000 1.0918 90663800.0000 0.0319 0.0314 0.9595 5.7197 0.0028 0.0028 0.9542 7.2463 0.0077 0.0075 0.9961 5.4989 0.0050 0.0050 0.9951 1.3389 0.0131 0.0696 1.3654 t t Dataset TMSE R2T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5632 0.5635 0.5592 3.7333 0.0000 0.8847 0.8837 7.9333 0.0443 0.9961 0.9960 5.3333 0.0999 1.0000 1.0000 7.4667 1681026000.0000 0.3956 0.3929 8.7333 118.9782 0.9413 0.9410 12.9333 300.5480 0.6882 0.6877 9.4000 201.2660 0.8234 0.8232 4.3333 1.3655 0.9891 0.9883 6.4667 aTMSE GMSE aGMSE aR} R2 aRb aR~T R~G aNodes t t t t t t 0.0300 0.0029 0.0025 1.2299 0.0000 0.0071 0.0068 3.0050 0.0028 0.0002 0.0002 1.5830 0.0031 0.0000 0.0000 1.0080 63924800.0000 0.0230 0.0228 2.8154 3.0875 0.0015 0.0015 3.5809 5.1244 0.0053 0.0053 1.8495 1.8135 0.0016 0.0016 1.2130 1.6599 0.0132 0.0141 1.5698 G aR~G TMSE t 6.8353 0.5367 0.5010 2480.1300 0.0000 0.8809 0.8723 2898.1700 0.0496 0.9956 0.9934 1494.1000 0.0909 1.0000 1.0000 2015.8300 2019260000.0000 0.3654 0.3437 4715.7700 126.2072 0.9382 0.9353 2626.0700 301.2480 0.6817 0.6780 3468.6000 201.4240 0.8182 0.8166 6513.7000 1.2416 0.9879 0.9682 1682.8300 tT t t r.MSF 8.6463 0.1142 0.1089 0.6676 0.0000 0.0089 0.0078 1.0000 0.0078 0.0007 0.0013 0.8927 0.0038 0.0000 0.0000 1.0990 1377860000.0000 0.0762 0.0729 0.8325 5.8012 0.0028 0.0029 0.9427 6.7504 0.0071 0.0072 0.9977 1.9861 0.0018 0.0019 0.9992 1.2834 0.0126 0.0380 1.0998 Dataset TMSE crTMSE R2 T crR} R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5585 0.5640 0.5594 4.0000 0.0000 0.8847 0.8837 7.6000 0.0427 0.9963 0.9961 5.3333 0.0988 1.0000 1.0000 7.8667 1651170000.0000 0.4063 0.4034 10.3333 117.2918 0.9421 0.9418 14.7333 298.7060 0.6901 0.6896 10.5333 200.4060 0.8242 0.8240 4.7333 1.4194 0.9887 0.9880 5.8667 crR~T crNodes t t t tt t t t t t t to to t t t t t t t t t t t t 0.0221 0.0021 0.0018 1.1447 0.0000 0.0058 0.0057 2.2376 0.0031 0.0003 0.0003 1.5830 0.0031 0.0000 0.0000 1.3578 79038000.0000 0.0284 0.0280 2.6436 4.8566 0.0024 0.0023 4.6307 3.8851 0.0040 0.0041 2.5015 0.8277 0.0007 0.0007 1.7991 1.6810 0.0134 0.0141 1.7953 GMSE crGMSE R2 crRb G R~G crR~G TMSE t 25.8332 0.5183 0.4785 4367.6300 0.0000 0.8798 0.8707 5008.1700 0.0500 0.9956 0.9930 3115.8300 0.0904 1.0000 1.0000 3809.7000 1750206000.0000 0.3843 0.3596 6268.5300 123.7744 0.9393 0.9361 3968.4000 299.6060 0.6835 0.6793 4645.1700 200.6060 0.8189 0.8172 9672.4300 1.3092 0.9872 0.9740 3558.8300 ~ 111.4494 0.1117 0.1070 0.1765 t to to t t t t t t t t t t t t 0.0000 0.0077 0.0080 1.0000 0.0057 0.0005 0.0013 0.8537 0.0030 0.0000 0.0000 1.0925 119859600.0000 0.0422 0.0411 0.9434 7.4047 0.0036 0.0035 0.9476 5.1041 0.0054 0.0059 0.9970 0.8389 0.0008 0.0006 0.9990 1.3016 0.0127 0.0234 1.0842 Dataset TMSE crTMSE R2T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries GMSE crGMSE crR} R2G crR~T R~G Nodes crNodes t 4.5765 0.5622 0.5581 3.5333 0.0000 0.8819 0.8809 7.1333 0.0457 0.9960 0.9958 4.9333 0.1008 1.0000 1.0000 7.3333 1719960000.0000 0.3816 0.3790 8.4667 121.0116 0.9403 0.9400 12.8667 301.8660 0.6868 0.6864 9.2667 202.0540 0.8227 0.8225 4.1333 1.4959 0.9881 0.9871 6.7333 0.0363 0.0035 0.0033 0.8996 0.0000 0.0060 0.0059 1.8144 0.0026 0.0002 0.0002 1.4368 0.0027 0.0000 0.0000 0.7581 58444000.0000 0.0210 0.0207 3.0596 3.4832 0.0017 0.0017 3.9631 6.6957 0.0070 0.0069 2.2118 2.9447 0.0026 0.0026 1.5477 1.8987 0.0151 0.0164 1.7207 10.6855 0.5241 0.4907 1322.4300 0.0000 0.8783 0.8701 1896.2700 0.0492 0.9956 0.9935 745.0670 0.0926 1.0000 1.0000 836.9670 2192540000.0000 0.3427 0.3232 3128.1000 126.7218 0.9379 0.9351 1798.3300 302.8780 0.6800 0.6765 2390.7300 202.2560 0.8174 0.8160 4334.7700 1.1953 0.9883 0.9532 757.9330 t t t t t crR~ crR~G TMSE r.M"W t t t t t t t t t 25.6446 0.1451 0.1368 0.4283 0.0000 0.0075 0.0070 1.0000 0.0078 0.0007 0.0013 0.9272 0.0033 0.0000 0.0000 1.0886 1932294000.0000 0.0954 0.0903 0.7845 6.5353 0.0032 0.0032 0.9549 8.5772 0.0091 0.0090 0.9967 3.6260 0.0033 0.0033 0.9990 1.3302 0.0130 0.0883 1.2515 t t t Dataset TMSE R2T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5640 0.5634 0.5589 4.2000 0.0000 0.8845 0.8836 7.0000 0.0435 0.9962 0.9960 5.7333 0.0999 1.0000 1.0000 7.6667 1694530000.0000 0.3907 0.3880 9.2667 119.4946 0.9411 0.9407 13.9333 300.2680 0.6885 0.6880 9.7333 200.9620 0.8237 0.8235 4.4667 0.9358 0.9925 0.9920 7.1333 crTMSE GMSE crGMSE crR} R2 crRb crR~T R~G crNodes 0.0382 0.0037 0.0036 1.3493 tT t t 0.0000 0.0084 0.0082 2.4635 0.0025 0.0002 0.0002 1.3374 0.0033 0.0000 0.0000 1.0933 77536200.0000 0.0279 0.0274 3.8141 4.0812 0.0020 0.0019 3.9908 5.3172 0.0055 0.0055 1.8557 1.1458 0.0010 0.0010 1.2794 1.3316 0.0106 O.oI14 1.8144 G cr 2 t ~R TMSE m:.rsF. 5.2366 0.5573 0.5192 1595.7700 0.0000 0.8803 0.8726 2151.3000 0.0567 0.9950 0.9922 849.7670 0.0912 1.0000 1.0000 1157.3300 1791446000.ססOO 0.3698 0.3468 3746.1000 127.8456 0.9374 0.9344 1990.3000 301.3240 0.6816 0.6779 2686.7700 201.0280 0.8185 0.8170 5559.3700 0.7544 0.9926 0.9697 855.1000 0.6709 0.0567 0.0598 0.8716 0.0000 0.0104 0.0094 1.0000 0.0264 0.0023 0.0033 0.7679 0.0041 0.0000 0.0000 1.0954 120896200.0000 0.0425 0.0403 0.9459 8.5962 0.0042 0.0043 0.9347 6.7600 0.0071 0.0073 0.9965 1.6687 0.0015 0.0018 0.9997 1.0153 0.0099 0.0792 1.2406 Dataset TMSE crTMSE R2T crR} R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5545 0.5643 0.5596 4.2667 0.0000 0.8811 0.8803 7.2000 0.0441 0.9961 0.9960 5.0667 0.1004 1.0000 1.0000 7.7333 1699556000.0000 0.3889 0.3862 9.9333 119.3798 0.9411 0.9408 14.2667 300.4640 0.6883 0.6878 10.6000 200.2760 0.8243 0.8241 4.8667 1.1903 0.9905 0.9897 6.8667 crR~T crNodes t t t 0.0264 0.0025 0.0024 1.3374 0.0000 0.0078 0.0077 2.4269 0.0024 0.0002 0.0002 1.6174 0.0026 0.0000 0.0000 1.2299 63065400.0000 0.0227 0.0221 3.0505 3.8368 0.0019 0.0019 4.1184 5.1975 0.0054 0.0054 2.0611 1.3597 0.0012 0.0011 2.161310 1.3186 0.0105 0.0117 1.2794 GMSE crGMSE R2G crRb cr 2 ~R TMSE ~ R~G t 11.7964 0.5382 0.4987 1741.9000 0.0000 0.8764 0.8687 2309.2300 0.0497 0.9956 0.9935 931.5000 0.0900 1.0000 1.0000 1367.6000 1897644000.0000 0.3536 0.3322 3856.1700 125.1458 0.9387 0.9356 2085.0700 301.5520 0.6814 0.6774 2833.8700 200.3160 0.8192 0.8175 5928.9700 1.0331 0.9899 0.9157 954.6000 t t t 35.9028 0.1113 0.1051 0.3861 0.0000 0.0085 0.0077 1.0000 0.0094 0.0008 0.0014 0.8879 0.0031 0.0000 0.0000 1.1159 532012000.0000 0.0782 0.0744 0.8956 6.5550 0.0032 0.0034 0.9539 6.2393 0.0066 0.0066 0.9964 1.3913 0.0013 0.0011 0.9980 1.0850 0.0106 0.2131 1.1521 Dataset Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries OTMSE GMSE °GMSE R2T OR} R2 G ORb R~T °R~T R~G Nodes 4.5529 0.5645 0.5600 4.1333 0.0000 0.8804 0.8796 6.6000 0.0436 0.9962 0.9960 5.6667 0.1001 1.0000 1.0000 8.2000 1687488000.0000 0.3932 0.3906 10.0000 119.0428 0.9413 0.9409 13.8000 300.5620 0.6882 0.6877 10.0000 201.5460 0.8232 0.8230 4.8000 0.4407 0.9965 0.9963 7.1333 ONodes t °R~G TMSE r.MIT 0.0370 0.0035 0.0030 1.2521 0.0000 0.0078 0.0076 2.429700 0.0021 0.0002 0.0002 1.4223 0.0033 0.0000 0.0000 1.7889 70731200.0000 0.0254 0.0252 3.2270 4.1056 0.0020 0.0020 3.6237 5.7319 0.0059 0.0059 2.6130 3.0197 0.0027 0.0026 1.4239 0.6411 0.0051 0.0054 1.0417 5.1910 0.5611 0.5225 1749.6700 0.0000 0.8752 0.8678 2514.7700 0.0522 0.9954 0.9930 927.9000 0.0893 1.0000 1.0000 1308.5000 1761580000.0000 0.3803 0.3594 3861.4000 127.3112 0.9376 0.9347 2102.4700 302.1180 0.6808 0.6767 3106.6300 201.8520 0.8178 0.8161 6203.1700 0.3631 0.9964 0.9913 904.2330 0.4104 0.0347 0.0339 0.8771 0.0000 0.0087 0.0082 1.0000 0.0111 0.0010 0.0017 0.8356 0.0046 0.0000 0.0000 1.1204 86005600.0000 0.0303 0.0295 0.9579 8.0203 0.0039 0.0037 0.9351 7.1440 0.0075 0.0077 0.9948 3.7076 0.0033 0.0033 0.9985 0.4587 0.0045 0.0171 1.2137 TMSE t t t t t t t t t t t to to t t t t t t Dataset TMSE R2T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5658 0.5632 0.5586 4.0000 0.0000 0.8834 0.8825 7.0667 0.0438 0.9962 0.9960 5.5333 0.1004 1.0000 1.0000 7.5333 1708888000.0000 0.3855 0.3829 9.6000 121.3112 0.9402 0.9398 13.4667 301.8080 0.6869 0.6864 9.5333 200.8280 0.8238 0.8236 4.4000 1.2285 0.9902 0.9894 6.3333 <rTMSE GMSE <rGMSE erR} R2 <rRb <rR~T R~G <rNodes t 0.0338 0.0032 0.0030 1.1447 0.0000 0.0062 0.0060 1.9989 0.0026 0.0002 0.0002 1.4794 0.0026 0.0000 0.0000 1.0417 74731200.0000 0.0269 0.0267 2.3577 3.4365 0.0017 0.0017 3.8483 6.3805 0.0066 0.0066 1.8889 1.7459 0.0015 0.0015 1.5888 1.5279 0.0122 0.0133 1.3218 G <r 2 ~R TMSE ~ t 10.1196 0.5315 0.4922 1455.6300 0.0000 0.8790 0.8711 2100.6700 0.0496 0.9956 0.9933 771.4000 0.0915 1.0000 1.0000 1004.9700 1769664000.0000 0.3775 0.3564 3297.7000 126.8750 0.9378 0.9348 1951.9700 303.3280 0.6795 0.6758 2722.3000 200.9520 0.8186 0.8171 4850.8300 1.0462 0.9898 0.9400 839.3330 t 26.2766 0.1207 0.1191 0.4512 0.0000 0.0079 0.0077 1.0000 0.0068 0.0006 0.0012 0.8815 0.0030 0.0000 0.0000 1.0975 104933800.0000 0.0369 0.0364 0.9657 4.8966 0.0024 0.0024 0.9561 8.5357 0.0090 0.0090 0.9950 2.1067 0.0019 0.0019 0.9994 1.2766 0.0125 0.1847 1.1742 t Dataset TMSE R2T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5643 0.5634 0.5588 4.0000 0.0000 0.8810 0.8801 6.7333 0.0431 0.9962 0.9961 6.0000 2.3835 1.0000 1.0000 8.1333 1705032000.0000 0.3869 0.3842 9.3333 119.9782 0.9408 0.9405 12.5333 302.7440 0.6859 0.6855 9.4000 200.5580 0.8240 0.8238 5.0667 0.9743 0.9922 0.9917 6.8000 C5TMSE GMSE C5R} R2 C5 R~G C5R2 C5R~T C5Nodes t 0.0285 0.0027 0.0024 1.0171 0.0000 0.0074 0.0073 1.8742 0.0033 0.0003 0.0003 1.2595 11.4882 0.0004 0.0004 1.5477 68749600.0000 0.0247 0.0244 3.0663 4.1011 0.0020 0.0020 3.4314 6.6125 0.0069 0.0068 2.5407 1.0425 0.0009 0.0009 1.6174 1.3699 0.0109 0.0116 1.5177 C5GMSE G Rb ~ TMSE ~ t 6.5100 0.5402 0.5019 1654.7300 0.0000 0.8774 0.8698 2291.3000 0.0482 0.9957 0.9933 902.0670 0.3013 1.0000 1.0000 1228.1700 1767764000.0000 0.3782 0.3558 3954.6000 203.0900 0.9053 0.9025 2107.5000 304.9620 0.6778 0.6741 3180.5700 200.5260 0.8190 0.8173 5454.8000 0.9567 0.9906 0.9761 915.3000 t t t 7.0911 0.1087 0.1038 0.7011 0.0000 0.0094 0.0087 1.0000 0.0049 0.0004 0.0010 0.8937 1.1123 0.0000 0.0000 7.9112 90879600.0000 0.0320 0.0302 0.9645 403.0240 0.1711 0.1705 0.5908 8.3493 0.0088 0.0086 0.9927 1.2339 0.0011 0.0012 1.0002 1.3130 0.0128 0.0315 1.0184 t t Dataset TMSE R2T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timesenes Nodes 4.5585 0.5639 0.5594 3.9333 0.0000 0.8841 0.8832 7.4000 0.0441 0.9962 0.9960 5.4667 0.0999 1.0000 1.0000 8.1333 1698038000.0000 0.3894 0.3866 10.1333 119.2130 0.9412 0.9409 14.3333 300.2740 0.6885 0.6880 10.1333 201.4200 0.8233 0.8231 5.4667 1.1575 0.9908 0.9900 6.8667 aTMSE aR} aR~T aNodes 0.0377 0.0036 0.0029 1.1427 0.0000 0.0087 0.0085 2.3134 0.0029 0.0003 0.0002 1.5477 0.0030 0.0000 0.0000 1.2521 73046600.0000 0.0263 0.0259 3.0027 3.8666 0.0019 0.0019 3.6515 5.1783 0.0054 0.0053 2.3887 2.3580 0.0021 0.0021 1.5477 1.5759 0.0126 0.0138 1.4794 GMSE aGMSE R2 G aRb aR2 R~G t ~ TMSE r.mF 5.1005 0.5688 0.5310 1733.3700 0.0000 0.8807 0.8731 2400.4000 0.0811 0.9928 0.9879 933.7000 0.0900 1.0000 1.0000 1372.0700 1818504000.0000 0.3605 0.3375 3985.0300 123.7320 0.9394 0.9365 2153.5300 301.5820 0.6814 0.6774 3157.3700 201.7480 0.8179 0.8161 5540.6700 1.0481 0.9897 0.9689 956.7670 0.3779 0.0320 0.0309 0.8937 0.0000 0.0106 0.0094 1.0000 0.1728 0.0153 0.0299 0.5437 0.0036 0.0000 0.0000 1.1108 229100000.0000 0.0796 0.0762 0.9338 6.0586 0.0030 0.0031 0.9635 6.6504 0.0070 0.0069 0.9957 2.9485 0.0027 0.0027 0.9984 1.3298 0.0130 0.0463 1.1044 tT t t t t t tt t Dataset TMSE C5TMSE R2T C5R} R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5607 0.5637 0.5591 4.2667 0.0000 0.8834 0.8825 6.8667 0.0428 0.9963 0.9961 5.6667 0.0995 1.0000 1.0000 8.1333 1675820000.0000 0.3974 0.3944 1Ll333 119.6078 0.9410 0.9406 14.6000 300.3680 0.6884 0.6879 9.7333 201.5320 0.8232 0.8230 5.4000 0.8319 0.9934 0.9928 7.2667 C5R~T C5Nodes t t t t to to t t t 0.0280 0.0027 0.0022 1.2299 0.0000 0.0074 0.0073 1.8144 0.0023 0.0002 0.0002 1.6046 0.0028 0.0000 0.0000 1.3578 71386200.0000 0.0257 0.0254 2.7258 3.1367 0.0015 0.0015 3.8739 5.5149 0.0057 0.0057 2.2581 2.8834 0.0025 0.0025 2.3723 1.2699 0.0101 0.0111 1.6386 GMSE C5GMSE R2 G C5 Rb C5R2 R~G ~ TMSE ~ t 5.079160 0.570595 0.531629 1737.4000 0.0000 0.8791 0.8712 2456.1300 0.0504 0.9955 0.9930 947.6000 0.0901 1.0000 1.0000 1358.8300 34665600000.0000 0.3566 0.3346 3942.6000 128.6118 0.9370 0.9340 2175.1700 301.5000 0.6815 0.6777 3296.1000 201.7980 0.8178 0.8160 5862.5000 0.8069 0.9921 0.9447 943.7000 t t t 0.3547 0.0300 0.0305 0.8979 0.0000 0.0085 0.0078 1.0000 0.0078 0.0007 0.0015 0.8486 0.0036 0.0000 0.0000 Ll045 179898600000.0000 0.1065 0.1005 0.0483 10.1046 0.0050 0.0051 0.9300 6.9080 0.0073 0.0073 0.9962 3.3606 0.0030 0.0030 0.9987 0.9965 0.0098 0.1809 1.0310 t Dataset TMSE (JTMSE R2T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5575 0.5640 0.5596 4.2000 0.0000 0.8827 0.8818 7.1333 0.0446 0.9961 0.9959 5.8667 0.0999 1.0000 1.0000 7.6000 1696134000.0000 0.3901 0.3876 9.8000 120.1524 0.9407 0.9404 14.6667 299.0820 0.6897 0.6892 11.0667 201.8060 0.8229 0.8227 5.2667 0.6623 0.9947 0.9944 6.8000 t t t GMSE (JGMSE (JR} R2G (JR~T R~G (JNodes t 0.0314 0.0030 0.0028 1.1265 0.0000 0.0077 0.0075 2.4598 0.0019 0.0002 0.0002 1.1366 0.0023 0.0000 0.0000 1.0700 63944000.0000 0.0230 0.0226 2.6050 3.7302 0.0018 0.0018 3.9683 4.0010 0.0042 0.0041 3.2156 3.1868 0.0028 0.0028 1.7991 1.2698 0.0101 0.0108 1.3235 5.1746 0.5625 0.5248 1779.1700 0.0000 0.8791 0.8712 2490.6300 0.0494 0.9956 0.9933 961.1330 0.0907 1.0000 1.0000 1471.8700 2047700000.0000 0.3426 0.3230 4036.0700 132.8084 0.9349 0.9316 2269.8000 300.1520 0.6829 0.6788 3149.5700 202.1960 0.8175 0.8157 6296.5000 0.5565 0.9946 0.9879 930.4670 (JRb (JR2 ~ TMSE ~ t t t 0.3875 0.0328 0.0335 0.8807 0.0000 O.oI05 0.0097 1.0000 0.0063 0.0006 0.0011 0.9022 0.0034 0.0000 0.0000 1.1014 1140178000.0000 0.0994 0.0944 0.8283 33.1722 0.0163 0.0176 0.9047 5.1448 0.0054 0.0055 0.9964 3.7328 0.0034 0.0034 0.9981 0.8542 0.0084 0.0197 1.1902 tion"). A larger injection rate intuitively leads to a longer optimisation time, because more patterns are iterated over to calculate fitness values etc. Similarly, a larger initial window size leads to a longer optimisation time. A window acceleration of less than 0.005 has the consequence that not all patterns are presented to the GPMCC method before the maximum number of generations are reached. Tables 4.12 - 4.20 show the effect of various initial window sizes and window acceleration rates on the outcomes of the GPMCC method. As expected, a larger initial window and window acceleration leads to an increase in the average simulation completion time t. However, there is no general relationship between the two window parameters and any of the other outcomes (MSE, nodes, R2, R~ etc.). The windowing parameters are problem specific, but what is interesting to note is that the continual presentation of all training patterns (shown by table 4.20) was not the strategy that resulted any optimal outcomes (with the exception of the mono dataset). Therefore, the initial window parameter should be chosen as 0.05 and the window acceleration parameter should be chosen as 0.005, because these parameters ensure that all training patterns are presented in a timely manner (this would not be the case if the initial window parameter was less than 0.05). Also, these parameter choices result in the smallest training times. The fragment pool initialisation is controlled by two parameters: the number of initial clusters ("DecisionPooINoClustersStart" Division"). in table 4.2) and the split factor ("DecisionPooINoClusters- The initial size of the fragment pool is determined by the two parameters, e.g. if the initial number of clusters is 30 and the split factor is 2, the initial size of the fragment pool is 30 + 15 + 7 + 3 = 55. As was discussed in section 4.3.3, the parameters also control the number of piecewise approximations fitted over the training patterns. These piecewise approximations are then used as terminal nodes for the GPMCC method. Tables 4.21 - 4.28 show the effect of various initial clusters numbers and window acceleration rates on the outcomes of the GPMCC method. There is no general relationship between the two fragment pool initialisation parameters and any of the outcomes (MSE, nodes, R2, R~ etc.). The initialisation parameters are thus fairly problem specific. Therefore, the two parameters can be chosen arbitrarily. For the remainder of this thesis the number of initial clusters Dataset Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries TMSE °TMSE GMSE OGMSE R2T OR} R2G R~T °R~T R~G Nodes 4.5595 0.5638 0.5589 4.5333 0.0000 0.8818 0.8809 6.5333 0.0447 0.9961 0.9959 5.0000 0.1005 1.0000 1.0000 7.4000 1719294000.0000 0.3818 0.3792 9.6000 121.3478 0.9401 0.9398 12.0667 304.8100 0.6838 0.6833 10.2000 202.5780 0.8223 0.8220 4.8667 1.1491 0.9908 0.9900 6.5333 t t ONodes t 0.0348 0.0033 0.0027 1.6344 0.0000 0.0069 0.0067 2.1453 0.0030 0.0003 0.0003 1.3896 0.0025 0.0000 0.0000 0.8137 59086400.0000 0.0212 0.0209 3.0240 3.7463 0.0018 0.0018 4.4793 7.1109 0.0074 0.0073 2.9989 2.9978 0.0026 0.0026 1.8144 1.6181 0.0129 0.0143 1.4559 5.1538 0.5643 0.5215 1556.9700 0.0000 0.8763 0.8684 2275.1300 0.0472 0.9958 0.9939 879.2000 0.0920 1.0000 1.0000 1078.2300 1806200000.0000 0.3646 0.3432 3538.8000 126.8866 0.9378 0.9353 2212.6300 307.0360 0.6756 0.6716 3060.1700 203.1580 0.8166 0.8149 5669.7000 1.0057 0.9902 0.9142 906.6670 ORb 2 ~°R TMSE ~ t t t t t 0.5150 0.0435 0.0476 0.8847 0.0000 0.0078 0.0071 1.0000 0.0054 0.0005 0.0011 0.9467 0.0028 0.0000 0.0000 1.0931 76553400.0000 0.0269 0.0257 0.9519 6.0072 0.0029 0.0031 0.9563 9.0024 0.0095 0.0093 0.9928 3.4670 0.0031 0.0031 0.98 1.3210 0.0130 0.2508 1.1425 t t Dataset TMSE (JTMSE R2T (JR} R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5804 0.5618 0.5574 3.8667 0.0000 0.8820 0.8811 6.8000 0.0446 0.9961 0.9959 5.6000 0.1009 1.0000 1.0000 7.8000 1726956000.0000 0.3790 0.3766 9.2667 121.3546 0.9401 0.9398 12.6667 306.3280 0.6822 0.6818 8.5333 202.4640 0.8224 0.8222 4.4000 0.8853 0.9929 0.9925 6.8667 (JR~T (JNodes t 0.0278 0.0027 0.0025 1.0080 0.0000 0.0067 0.0066 1.8458 0.0030 0.0003 0.0003 1.4994 0.0027 0.0000 0.0000 1.3493 58523200.0000 0.0210 0.0207 3.3930 3.7759 0.0019 0.0018 3.0663 7.4583 0.0077 0.0077 2.5560 2.3244 0.0020 0.0021 1.1919 0.9890 0.0079 0.0084 1.2794 GMSE (JGMSE R2G (JRb (JR~G R~G TMSE t 5.2324 0.5576 0.5186 1531.5000 0.0000 0.8783 0.8703 2218.2700 0.0490 0.9957 0.9934 816.4330 0.0913 1.0000 1.0000 1051.6000 1921038000.0000 0.3344 0.3149 3597.3300 130.8084 0.9359 0.9332 2083.3000 308.5400 0.6740 0.6706 3178.8700 202.8180 0.8169 0.8154 5500.2000 0.7180 0.9930 0.9514 847.0670 Ti'i'<'i' t 0.4345 0.0367 0.0386 0.8754 0.0000 0.0086 0.0082 1.0000 0.0068 0.0006 0.0014 0.9102 0.0039 0.0000 0.0000 1.1057 370382000.0000 0.0831 0.0822 0.8990 23.9748 0.0117 0.0124 0.9277 9.5733 0.0101 0.0099 0.9928 3.0523 0.0028 0.0029 0.9983 0.6632 0.0065 0.1807 1.2329 t Dataset TMSE R2T R~T Nodes Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries 4.5750 0.5624 0.5576 4.1333 0.0000 0.8821 0.8811 7.0000 0.0438 0.9962 0.9960 5.2000 0.0991 1.0000 1.0000 7.6667 1715464000.0000 0.3832 0.3806 8.8000 120.5956 0.9405 0.9402 13.5333 301.5000 0.6872 0.6868 9.0667 201.4420 0.8232 0.8231 4.6667 1.1043 0.9912 0.9906 6.6667 (JTMSE GMSE (JGMSE (JR} R2 (JR~ (JR~T R~G (JNodes t t t t to r 0.0295 0.0028 0.0026 1.1366 0.0000 0.0058 0.0057 2.1656 0.0028 0.0002 0.0002 1.5177 0.0040 0.0000 0.0000 1.0933 58716400.0000 0.0211 0.0209 2.7966 3.2237 0.0016 0.0016 3.9977 5.6420 0.0059 0.0058 1.5298 2.1989 0.0019 0.0019 1.4933 1.4290 0.0114 0.0122 1.4933 G t 17.6606 0.5365 0.4963 1629.5000 0.0000 0.8783 0.8702 2090.2300 0.0485 0.9957 0.9935 842.9670 0.0906 1.0000 1.0000 1047.4000 1799968000.0000 0.3668 0.3457 3466.3000 126.3778 0.9381 0.9352 2006.9700 302.7800 0.6801 0.6766 2751.3700 201.5840 0.8180 0.8164 4860.7700 1.0168 0.9901 0.9749 807.4330 (JR2 ~ TMSE 'TAAW 67.9024 0.1071 0.1018 0.2591 0.0000 0.0079 0.0081 1.0000 0.0073 0.0006 0.0016 0.9037 0.0045 0.0000 0.0000 1.0931 152635400.0000 0.0537 0.0538 0.9531 4.7560 0.0023 0.0026 0.9542 6.9198 0.0073 0.0073 0.9958 2.5408 0.0023 0.0023 0.9993 1.2497 0.0122 0.0405 1.0860 Dataset TMSE crTMSE R2T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5708 0.5628 0.5586 3.6667 0.0000 0.8831 0.8822 7.2667 0.0441 0.9961 0.9960 5.5333 0.0999 1.0000 1.0000 7.6667 1701230000.0000 0.3883 0.3858 8.8000 120.4616 0.9406 0.9403 12.9333 301.0580 0.6876 0.6872 10.3333 201.9240 0.8228 0.8226 4.5333 1.0636 0.9915 0.9909 7.1333 crR} R2 crR~T R~G crNodes t GMSE 0.0315 0.0030 0.0027 1.0933 0.0000 0.0049 0.0048 2.3332 0.0024 0.0002 0.0002 1.3830 0.0033 0.0000 0.0000 1.5162 72622800.0000 0.0261 0.0258 2.4269 3.0208 0.0015 0.0015 3.8768 5.6103 0.0058 0.0058 2.6436 2.3849 0.0021 0.0021 1.6344 1.4651 0.0117 0.0124 2.0297 crGMSE crRb crR2 G -5& TMSE t 5.3082 0.5512 0.5147 1654.1300 0.0000 0.8789 0.8705 2065.1300 0.0541 0.9952 0.9929 764.8000 0.0901 1.0000 1.0000 1029.6700 1948646000.0000 0.3484 0.3278 3403.9300 125.6458 0.9384 0.9358 1928.7700 302.0440 0.6809 0.6768 2722.6000 202.2740 0.8174 0.8158 4850.4000 0.9553 0.9907 0.9779 810.4670 ~ t to to t t 0.5044 0.0426 0.0436 0.8611 0.0000 0.0055 0.0058 1.0000 0.0109 0.0010 0.0017 0.8151 0.0038 0.0000 0.0000 1.1085 735114000.0000 0.0893 0.0873 0.8730 5.0534 0.0025 0.0024 0.9587 6.8308 0.0072 0.0070 0.9967 2.7546 0.0025 0.0025 0.9983 1.3346 0.0131 0.0316 1.1133 t Dataset TMSE crTMSE R2T crR} R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5531 0.5645 0.5598 4.2000 0.0000 0.8830 0.8820 7.1333 0.0445 0.9961 0.9959 5.6000 0.0997 1.0000 1.0000 7.4667 1697054000.0000 0.3898 0.3871 9.0667 120.1176 0.9407 0.9404 13.8000 300.0740 0.6887 0.6882 9.2667 201.0540 0.8236 0.8234 4.8000 1.3801 0.9890 0.9883 6.4000 crR~T crNodes t t t t t t t 0.0359 0.0034 0.0032 1.4480 0.0000 0.0071 0.0071 1.8889 0.0024 0.0002 0.0002 1.6733 0.0031 0.0000 0.0000 1.0080 72796600.0000 0.0262 0.0259 2.0667 4.2472 0.0021 0.0020 4.1223 5.3272 0.0055 0.0055 2.0160 1.9399 0.0017 0.0017 1.6060 1.6197 0.0129 0.0138 1.6733 GMSE crGMSE R2 G crRb crR2 -5J!i TMSE R~G i 5.3183 0.5504 0.5081 1641.3000 0.0000 0.8798 0.8714 2032.7700 0.0502 0.9956 0.9931 781.1000 0.0914 1.0000 1.0000 996.4330 1785874000.0000 0.3718 0.3492 3447.9000 123.9618 0.9393 0.9364 1945.0700 301.0620 0.6819 0.6783 2756.1300 201.2580 0.8183 0.8167 4654.4000 1.1452 0.9888 0.9739 825.5330 ~ tT t t t t t t t t t 0.6657 0.0563 0.0649 0.8561 0.0000 0.0088 0.0088 1.0000 0.0074 0.0007 0.0018 0.8870 0.0046 0.0000 0.0000 1.0915 123829600.0000 0.0436 0.0435 0.9503 7.9181 0.0039 0.0037 0.9690 6.6463 0.0070 0.0071 0.9967 2.3755 0.0021 0.0022 0.9990 1.2081 0.0118 0.0296 1.2051 tt t Dataset TMSE crTMSE R2T crR} R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5626 0.5636 0.5587 4.3333 0.0000 0.8832 0.8823 7.0000 0.0443 0.9961 0.9960 5.2667 0.1006 1.0000 1.0000 7.8667 1689180000.0000 0.3926 0.3898 9.7333 119.8186 0.9409 0.9406 12.6667 300.6600 0.6881 0.6876 9.8667 201.7160 0.8230 0.8228 4.8000 1.4982 0.9881 0.9872 6.2667 crR~T crNodes 0.0367 0.0035 0.0029 1.4223 tT t t t 0.0000 0.0076 0.0075 2.0342 0.0024 0.0002 0.0002 1.1427 0.0027 0.0000 0.0000 1.3578 76237000.0000 0.0274 0.0271 2.8519 3.2694 0.0016 0.0016 3.6797 6.1792 0.0064 0.0064 2.3302 2.4743 0.0022 0.0022 1.3235 1.7086 0.0136 0.0145 1.7006 GMSE crGMSE R2G crR~ crR2 R~G ~ TMSE t 5.4460 0.5426 0.5016 1610.3300 0.0000 0.8792 0.8713 2029.2000 0.0510 0.9955 0.9933 772.4670 0.0911 1.0000 1.0000 1041.3700 1763444000.0000 0.3797 0.3565 3276.7700 128.9602 0.9368 0.9339 2015.2000 301.5380 0.6814 0.6778 2787.3300 202.0160 0.8176 0.8161 4686.8700 1.2820 0.9875 0.9581 845.2000 TMSF t 1.4684 0.1087 0.1042 0.8378 0.0000 0.0091 0.0086 1.0000 0.0093 0.0008 0.0014 0.8681 0.0034 0.0000 0.0000 1.1033 91387200.0000 0.0321 0.0311 0.9579 12.2305 0.0060 0.0061 0.9291 7.8649 0.0083 0.0082 0.9971 3.3177 0.0030 0.0031 0.9985 1.4768 0.0145 0.0639 1.1686 t Dataset TMSE R2T R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5563 0.5642 0.5596 4.0667 0.0000 0.8830 0.8821 7.0000 0.0446 0.9961 0.9959 4.9333 0.1003 1.0000 1.0000 7.6667 1681064000.0000 0.3955 0.3930 8.8000 119.7560 0.9409 0.9406 12.8667 301.5620 0.6871 0.6867 9.1333 201.7360 0.8230 0.8228 4.5333 0.7055 0.9944 0.9939 7.4667 crTMSE GMSE crR} R2 crR~T crNodes t t t t t t t t t t t t 0.0303 0.0029 0.0025 1.2576 0.0000 0.0062 0.0060 2.1656 0.0022 0.0002 0.0002 1.229900 0.0029 0.0000 0.0000 1.0933 71950200.0000 0.0259 0.0256 2.9408 3.0794 0.0015 0.0015 3.6741 6.5922 0.0068 0.0068 2.2854 2.7894 0.0024 0.0024 1.4559 1.0167 0.0081 0.0087 1.2521 crGMSE crRb G crR2 ----5ill TMSE R~G t 5.1816 0.5619 0.5218 1590.4700 0.0000 0.8789 0.8711 2030.0300 0.0478 0.9958 0.9937 814.633000 0.0920 1.0000 1.0000 1007.6700 1749322000.0000 0.3846 0.3636 3485.8700 124.9156 0.9388 0.9361 1972.9000 302.7680 0.6801 0.6765 2825.9300 202.0880 0.8176 0.8160 4715.5700 0.6232 0.9939 0.9749 750.5000 ~ t t t t t t t t 0.3516 0.0297 0.0323 0.8793 0.0000 0.0074 0.0074 1.0000 0.0069 0.0006 0.0011 0.9322 0.0034 0.0000 0.0000 0.8793 91357200.0000 0.0321 0.0309 0.9610 5.7129 0.0028 0.0027 0.9587 8.1946 0.0087 0.0084 0.9960 3.4211 0.0031 0.0032 0.9983 0.8189 0.0080 0.0524 1.1321 t Dataset TMSE aTMSE R2 T aR} R~T Abalone Elevators Federal Reserve Economic Data Function Example House-16H Lena Image Mono Sample Stereo Sample Timeseries Nodes 4.5692 0.5629 0.5585 3.7333 0.0000 0.8819 0.8810 7.1333 0.0442 0.9961 0.9960 5.6000 0.0999 1.0000 1.0000 8.0667 1708460000.0000 0.3857 0.3830 9.6000 120.2276 0.9407 0.9404 13.4667 303.2560 0.6854 0.6849 9.4667 200.6680 0.8239 0.8237 4.8000 1.0371 0.9917 0.9912 6.6667 aR~T aNodes t t t 0.0342 0.0033 0.0030 1.1121 0.0000 0.0075 0.0073 2.5695 0.0024 0.0002 0.0002 1.5888 0.0031 0.0000 0.0000 1.6386 77 513400.0000 0.0279 0.0276 2.5271 3.6120 0.0018 0.0018 2.8616 6.3768 0.0066 0.0066 2.3887 1.7839 0.0016 0.0015 1.9191 1.5863 0.0127 0.0136 1.2954 GMSE aGMSE R2 G aRb R~G aR~G TMSE t 5.2241 0.5583 0.5204 1617.7000 0.0000 0.8783 0.8701 1972.7000 0.0488 0.9957 0.9933 751.1000 0.0907 1.0000 1.0000 948.8670 1779002000.0000 0.3742 0.3521 3325.8700 125.0488 0.9387 0.9359 1976.2000 305.3640 0.6774 0.6734 2876.6000 200.7440 0.8188 0.8170 4771.5700 0.8519 0.9917 0.9803 786.6330 ~ t t t t t t 0.4925 0.0416 0.0447 0.8746 0.0000 0.0090 0.0085 1.0000 0.0060 0.0005 0.0015 0.9043 0.0042 0.0000 0.0000 1.1013 128243800.0000 0.0451 0.0444 0.9603 5.6104 0.0027 0.0027 0.9614 8.0327 0.0085 0.0086 0.9931 2.1620 0.0020 0.0019 0.9996 1.2269 0.0120 0.0358 1.2175 t is set to 30 and the split factor is set to 2. This should provide enough clusters to cover the turning points of a dataset. Although a large number of initial parameters were introduced in the previous section, the GPMCC method appears to be fairly robust in that different values for parameters do not have a significant effect on accuracy. This section compares the GPMCC method to two other methods discussed earlier in this thesis. The first comparison method is Setiono's NeuroLinear method from section 4.2.1. Setiono presents a table of results, obtained by running NeuroLinear on 5 databases from the DCI machine learning repository [91]. The databases used by Setiono are the Abalone, Auto-Mpg, Housing, Machine and Servo databases discussed in section 4.4.1. Setiono performed one 10fold cross validation evaluation on each of the previously mentioned datasets. The predictive accuracy of NeuroLinear was tested in terms of the generalisation mean absolute error: E ~IPI I -I IPI - Lti=l Yi - Yi MA- Setiono also provided the average number of rules generated for each dataset. The second comparison method is a commercial version of the M5 algorithm (successor to C4.5) called Cubist [82]. Cubist internally utilises model trees with linear regression models. Cubists presents these model trees in the form of a production system. Both Cubist and the GPMCC method were used to perform 10 lO-fold cross validation evaluations (equivalent to 100 simulation runs) on the datasets mentioned previously, in order to determine the generalisation mean absolute error. The average number of generated rules, rule conditions (the length of the path from the root to the terminal node) and rule terms (the number of terms in the model) were also obtained for each dataset. Table 4.29 shows the initialisation parameters for the GPMCC method as determined by the findings of the previous section. For all datasets the maximum polynomial order was set to 5, except for the house-16H dataset which was set to 10. Table 4.30 shows the results for Cubist, the GPMCC method and NeuroLinear for the 5 databases mentioned above. In all the cases the GPMCC method was the least accurate of the Parameter DecisionReoptimize V sSelectLeaf DecisionPoolN oClustersStart DecisionPoolN oClustersDivision DecisionPoolFragmentLifeTime DecisionInitialPercentageSampleSize DecisionSampleAcceleration DecisionCrossoverRate Cross Validation Value 0.05 30 2 50 0.05 0.005 0.1 1 Table 4.30: Comparison of Cubist, GPMCC and NeuroLinear (GMAE is the average generalisation mean absolute error, {JGMAE is the standard deviation for the generalisation mean absolute error, rules represents the average number of rules, {Jrules represents the standard deviation for the number of rules, conds represents the average number of rule conditions, {Jconds is the standard deviation for the number of rule conditions, terms represents the average number of rules per term and {Jterms is the standard deviation for the number of rules per term) Dataset Abalone Auto-Mpg Housing Machine Servo Method Cubist GPMCC NL Cubist GPMCC NL Cubist GPMCC NL Cubist GPMCC NL Cubist GPMCC NL GMAE 1.4950 1.6051 1.5700 1.8676 2.1977 1.9600 1.7471 2.8458 2.5300 26.9280 34.3228 20.9900 0.3077 0.4496 0.3400 C50MAE 0.0609 0.0156 0.0600 0.2536 0.3703 0.3200 0.2633 0.5217 0.4600 7.5873 17.0849 11.3800 0.1252 0.1755 0.0800 rules 12.6500 2.5400 4.1000 5.1000 2.1800 7.5000 12.6700 2.8200 25.3000 4.8800 3.4600 3.0000 9.6100 5.1500 4.7000 C5rules conds C5conds terms C5terms 4.1056 0.6264 1.4500 1.3890 0.6724 5.0800 3.3727 0.7962 17.1300 1.4162 0.9773 3.0000 1.9638 1.9456 2.3100 2.6545 1.3407 0.5700 0.3688 n/a 0.4225 0.4985 n/a 0.4542 0.4534 n/a 0.4106 0.4983 4.7560 8.3213 0.4978 1.0634 4.6818 3.4095 0.7785 0.9773 n/a n/a n/a 0.3027 0.8876 2.1033 2.7935 0.2890 0.8482 n/a n/a n/a n/a 2.0816 1.0765 n/a 3.2125 1.5103 n/a 1.9367 1.8770 n/a 2.7298 2.6247 n/a n/a n/a 3.0013 7.2195 0.4972 1.6180 n/a n/a 5.2164 7.6613 0.6068 1.4220 n/a n/a three methods in terms of generalisation accuracy (but not significantly). Cubist, on the other hand, was the most accurate of the three methods. However, the number of rules generated by the GPMCC method was significantly less than that of the other methods, with the exception of the Servo and Machine datasets. Also, the total complexity of the GPMCC method, in terms of the average number of rules, the average number of rule conditions and the average number of rule terms was significantly less than that of Cubist. The GPMCC method and Cubist were also compared on the remaining datasets not shown in table 4.30. Once again, 10 lO-fold cross validation evaluations were performed in order to obtain the generalisation mean absolute error. Additionally, the average number of generated rules, rule conditions and rule terms were obtained for each database. The GPMCC method was once again initialised using table 4.29. Dataset Elevators House-16H Federal Reserve ED Function Example Lena Image Mono Sample Stereo Sample Timeseries Method Cubist GPMCC Cubist GPMCC Cubist GPMCC Cubist GPMCC Cubist GPMCC Cubist GPMCC Cubist GPMCC Cubist GPMCC GMAE 0.0019 0.0018 16355.2840 24269.8000 0.0999 0.1514 0.9165 1.3713 5.0160 6.4107 13.3550 13.7162 11.2470 11.2498 0.7664 0.6442 crGMAE 0.0001 0.0000 375.0254 3889.8900 0.0137 0.0366 1.4499 1.8050 0.2102 0.2466 0.1749 0.1762 0.1507 0.1923 0.1853 0.3718 rules 18.0400 3.3200 35.7100 5.1300 17.0300 2.8700 41.4600 4.1700 32.8000 6.5500 35.2700 4.5300 11.5500 2.6100 30.2400 3.6700 crrules 2.3047 0.9522 4.7637 1.2032 4.0613 0.6765 5.3756 0.5695 4.5969 1.6229 4.5480 1.1845 2.8298 0.7092 3.8379 0.8996 conds 2.9669 1.7929 4.4092 2.6361 2.5765 1.5553 2.8032 2.0916 4.3619 3.0743 3.6664 2.4217 2.9741 1.3883 3.5024 2.0062 crconds terms crterms 0.3206 0.5161 0.4649 0.5451 0.3249 0.4031 0.2843 0.2158 0.4775 0.5332 0.3942 0.5613 0.4580 0.4392 0.3778 0.4271 2.9493 8.8217 4.1570 7.2623 4.8203 7.6408 1.1384 2.1400 3.9577 7.4953 4.4224 4.7366 4.9500 7.8308 3.1646 5.0386 0.3229 0.8498 0.4291 1.5700 0.4984 1.1640 0.1221 0.1654 0.3999 1.0319 0.4533 0.4293 0.5000 0.6162 0.3315 1.6865 Table 4.31 shows the results for Cubist and the GPMCC method for the remaining databases not used in table 4.30. In all cases the GPMCC method outperformed Cubist in terms of the average number of rules generated and the total complexity. In fact, the average number of rules generated by the GPMCC method and the total complexity of the GPMCC method were significantly less than that of Cubist. Additionally, the GPMCC method even managed to outperform Cubist in terms of the generalisation mean absolute error on some of the datasets, i.e. Elevators and Time-series. However, there is no statistically significant difference in accuracy. This section discusses the quality of the rules inferred by the GPMCC method for each of the datasets in section 4.4.1. Each model tree represents the best outcome of 10 lO-fold cross validation evaluations in terms of the mean squared error on the generalisation set. The GPMCC method was initialised using table 4.29. Values between parenthesis show how many patterns are covered by the antecedent of a rule. Values between angle brackets indicate the mean squared error of the rule on patterns covered by the antecedent of that rule. For both types of parenthesis, the first value between parenthesis represents the outcome for the training set and the second value represents the outcome for the validation set. if (Sex == "F") { if (Viscera > 0.545792) { Rings = 13.6301*pow(Shell,1) -16.1805*pow(Viscera,1) -26.6776*pow(Shucked,1) +13.0827*pow(Whole,1) +8.99643; //(1, 0) <48.201, 0> else { Rings =-44.7887*pow(Shell,1)*pow(Length,1) +42.2753*pow(Shell,1) +132.724*pow(Viscera,3)*pow(Diameter,2) -21.1746*pow(Viscera,1) +22.7078*pow(Shucked,2) -45.8945*pow(Shucked,1) -6.61542*pow(Whole,2) +28.4355*pow(Whole,1) +0.782952*pow(Height,1) +4.46808; //(1027,130) <6.01941, 5.99693> } else { Rings 13.1915*pow(Shell,1) +17.8778*pow(Viscera,1) *pow(Whole,l) *pow(Diameter,l) -40.99*pow(Viscera, 1)*pow(Length,l) +59.2038*pow(Shucked,2) -35.4237*pow(Shucked,1)*pow(Wh01e,1) -42.489*pow(Shucked,1) +0.10544*pow(Whole,4) +27.2256*pow(Whole,1) +4.30859*pow(Diameter,1) +3.96395; //(2313,288) <3.83083, 3.1415> } TMSE: 4.51687 VMSE: 4.02955 GMSE: 4.904 The largest order used by the models of the model tree is 5. Also, the first rule represents an outlier. Obviously this outlier skewed the coefficients of the GPMCC method so drastically that the GPMCC method had no choice but to isolate it. This indicates that this training pattern should be removed from the dataset. if (displacement> 97.2829) { mpg = -0.061885*pow(model_year, 1)*pow(cylinders, 1) +0. 933105*pow(model_year, 1) +0.00151765*pow(weight,1)*pow(cylinders,1) -0.0147998*pow(weight,1) -0.037649*pow(horsepower,1); // (251, 30) <7.37801, 7.72467> else { mpg = -1.18039*pow(origin,1) +0.0479332*pow(model_year,2) -0.0172612*pow(model_year,1)*pow(cylinders,2) -5. 99653*pow(model_year, 1) -0.00545806*pow(weight,1) -0.0244613*pow(horsepower,1) -0.0546243*pow(displacement,1) +14.5401*pow(cylinders,1) +192.894; //(61, 10) <17.2217, 3.97088> } TMSE: 9.30258 VMSE: 6.78622 Only two non-linear rules where generated. The largest order utilised by the models of the model tree is 3. if (SaTime1 < -0.000562935) { if (diffRollRate > -0.011995) { Goal = 0.0848297*pow(Sa, 1)*pow(diffClb, 1) -56.4038*pow(Sa,1) -9511.27*pow(SaTime4,2) -0.00331258*pow(SaTime3,1)*pow(climbRate,1) -1.43111*pow(SaTime1,1) +0.747046*pow(diffRollRate,1) +0.00236211*pow(absRoll,1) +0.00547074*pow(p,1) +0.0106479; //(2925, 372) <7.95712e-06, 5.63356e-06> else { Goal = -36.6557*pow(Sa,1) -4101.46*pow(SaTime4,2) -0.00368597*pow(SaTime3, 1)*pow(climbRate, 1) +0.0014319*pow(SaTime2,1)*pow(Sgz,1) +0. 463288*pow(diffRollRate,1) +0.00154505*pow(absRoll,1) +0.00781973*pow(q,1) +0.00305086*pow(p,1) +0.0123348; //(1189, 151) <3.92677e-06, 3.27794e-06> } else Goal -25.0523*pow(Sa,1) +0.116456*pow(SaTime4,1)*pow(diffClb,1) -0.00669548*pow(SaTime3,1)*pow(climbRate,1) +0.433775*pow(diffRollRate,1) +0.00126904*pow(absRoll,1) +0.00372367*pow(p,1) +0.016304; //(2886, 353) <4.34178e-06, 4.5439ge-06> } TMSE: 5.78198e-06 VMSE: 4.78845e-06 GMSE: 4.41977e-06 A large number of terms were utilised by the models of the model tree. However, the maximum polynomial order of the models in the model tree is 4. if (Y3TCMR > 15.8756) { M1CDR = 0.04954*pow(TWEIMC,1) +0.00308282*pow(M3TBRAA,2); //(3, 1) <0.246856, 0.641734> else { if (M3TBRSM > 11.1668) { M1CDR = -0.132345*pow(TLLACB,1) +0.0941262*pow(TCD,1) +0.56461*pow(M1MS,1) -0.162584*pow(BCACB,1) +0.206319*pow(Y3TCMR,1) -0.0400925*pow(M3TBRAA,1) +0.446434*pow(Y30CMR,1); // (82, 8) <0.11477, 0.0222504> else { M1CDR = 0.450881*pow(M1MS,1) -0.0102812*pow(DDCB,1) +0.00126507*pow(CCMS,1) +0.0538425*pow(BCACB,1) -0.00205841*pow(Y5TCMR,2) -0.356311*pow(Y5TCMR,1) +0.00877064*pow(Y3TCMR,2) +0.00308282*pow(M3TBRAA,2) +0.69845*pow(Y30CMR,1) -0.11687; //(754,96) <0.0422099, 0.0382938> } TMSE: 0.0500334 VMSE: 0.0428185 GMSE: 0.0250136 A large number of terms were utili sed by the models of the model tree. The maximum polynomial order of the models in the tree is 2. if (type == "A") { Y = 2.01245*pow(x,1) +0.512857; // (266, 22) <0.0912585, 0.0873773> else { if (type == "e") { if (x > 0.997578) { y = 1.48816*pow(x,2) -6.83703*pow(x,1) +2.09168; //(66, 9) <0.0959942, 0.0494538> else { y = 1.01387*pow(x,3) +400.504; //(204, 31) <0.100298, 0.0883689> } else { y = -1.9986*pow(x,1) +0.512857; //(264,38) <0.0853964, 0.0836496> } TMSE: 0.0920197 VMSE: 0.0828551 GMSE: 0.0811045 What is interesting to note, is that the model tree is almost identical to the generating function of section 4.4.1. if (P11p4 < 0.00635415) { price = -249056*pow(H10p1,1)*pow(P16p2,1) -150165*pow(P27p4,3) +286593*pow(P16p2,1); // (123, 12) else { if (H40p4 < if (H13p1 price = <3.1888ge+09, 4.06234e+09> 0.0122007) { < 0.847563) { 287953*pow(H13p1,2) -292744*pow(H13p1,1) -1987.53*pow(H8p2,1) +45552.3*pow(H2p2,1) +318809*pow(P27p4,1)*pow(P16p2,1) +130331*pow(P14p9,6)*pow(P6p2,2)*pow(P1,1) -1789.72*pow(P11p4,4) +12.4096*pow(P1,1) +70607.7; //(3177, 397) <1.2344ge+09, 6.54793e+08> else { price = 49642.6*pow(P15p1,1); //(57, 6) <1. 19254e+09, 5.26761e+08> } else { if (H10p1 > 0.965729) { if (H13p1 < 0.707063) { price = 1.42302e+06*pow(H18pA,1)*pow(H10p1,8) -1.5027ge+06*pow(H18pA,1) -1.42104e+06*pow(H13p1,4) +1.78124e+06*pow(H13p1,2)*pow(H10p1,3) -960514*pow(H13p1,1) -2.91708e+06*pow(H10p1,1) +491007*pow(P27p4,1)*pow(P16p2,1) -97214.2*pow(P14p9,1) +3.09173e+06; // (12264, 1542) <1. 58384e+09, 1.18774e+09> else { price = 49642.6*pow(P15p1,1); // (61, 2) <1.7607e+09, 1.05144e+09> } else { if (P18p2 < 0.0693344) { price = -49761.3*pow(H40p4,1) +741151*pow(H18pA,2) -352497*pow(H18pA,1) +27277.1*pow(H10p1,2) -86146*pow(H2p2,1) +2.0637ge+06*pow(P27p4,1)*pow(P5p1,1) -5.10314e+06*pow(P18p2,1) *pow(P5p1, 1) +179486*pow(P16p2,2) -8558.44; //(2543, 320) <2.70133e+09, 2.22773e+09> else { price = 49642.6*pow(P15p1,1); //(1, 0) <1.8252ge+08, 0> } TMSE: 1.6889ge+09 VMSE: 1.2542e+09 GMSE: 1.29921e+09 A large number of non-linear rules were obtained from the dataset. The maximum utili sed polynomial order for the rules was 9. if (OIS < 1. 81274) { MEOV = 0.0311044*pow(LSTAT,2) -0.0852524*pow(LSTAT, 1) *pow(PTRATIO, 1) *pow(NOX, 1) -0.596869*pow(LSTAT,1) +0.431583*pow(RAO,1)*pow(CHAS,1) -1.35841*pow(OIS,1) +0.119659*pow(RM,3) -11.7921*pow(RM,1) -0.108755*pow(CRIM,1) +83.3143; //(63, 8) <33.3665, 3.50684> else { MEOV = -0.767679*pow(LSTAT,1)*pow(NOX,1) -0.729239*pow(PTRATIO,1) -0.0137382*pow(TAX,1) -0.255243*pow(RAO,1)*pow(RM,1) +1.71962*pow(RAO,1) -0.59144*pow(OIS,1) +2.81649*pow(RM,2) -30.0403*pow(RM,1) +1.5585*pow(CHAS,1) +124.502; // (341, 43) <10.7152, 7.98617> } TMSE: 14.2475 VMSE: 7.28353 GMSE: 6.29143 Only two non-linear rules were obtained. The maximum order of the models of the model tree is 3. if (blend[7] < 203.273) { intensity = -0.0123416*pow(across,1) +0.00189951*pow(blend[7],2) -0.0477566*pow(blend[6],1) +0.326583*pow(blend[5],1) +0.694096*pow(blend[4],1) -0.00194599*pow(blend[3],2) +0.22087*pow(blend[3],1) -0.132875*pow(blend[2],1) -0.0649604*pow(blend[1],1) +1.96046; //(12328,1532) else { if (blend[3] > if (blend[O] if (across intensity <127.918, 120.595> 118.928) { > 49.8796) { < 55.2729) { = 0.507778*pow(blend[7],1) -0.0444378*pow(blend[6],1) +0.354999*pow(blend[5],1) +0.659356*pow(blend[4],1) -0.303265*pow(blend[3],1) -0.176485*pow(blend[2],1); //(38, 6) <351.494, 62.1143> else { intensity = 0.0105132*pow(blend[7],2) //(734,100) -0.0205424*pow(blend[7],1)*pow(blend[4],1) -0.00172814*pow(blend[5],2) +0.00143276*pow(blend[5],1)*pow(blend[1],1) +0.317081*pow(blend[5],1) +5.42638*pow(blend[4],1) -0.543182*pow(blend[1],1) -0.167443*pow(blend[0],1) -391.867 ; <150.048, 128.225> } else { intensity = 1.94436*pow(blend[2],1); //(2, 0) <1.63576, 0> } else { intensity 1.47793*pow(blend[5],1) -0.62556*pow(blend[2],1) +0.263644*pow(blend[1],1); <0.00493363, 1451.8> } TMSE: 129.747 VMSE: 121.659 GMSE: 106.126 Essentially, the rules consist of linear blends of the context pixels to obtain the predicted pixel value. One of the rules represents an outlier. Also, the maximum polynomial order of the models is 2. if (CACH < 128.391) { if (MMIN < 25112.4) { PRP = 0.0347948*pow(CHMAX,1)*pow(CACH,1) -0.14693*pow(CHMIN,2) +0.00117564*pow(CHMIN,1)*pow(MMIN,1) +0.00558463*pow(MMAX,1) -0.0010641*pow(MMIN,1); //(160, 21) <1212.63, 568.998> else { PRP = 31.0917*pow(CHMIN,1) +0.00430516*pow(MMIN,1); //(1, 0) <127.71, 0> } else { PRP = -2.33197*pow(CHMAX,1) +30.9474*pow(CHMIN,1) +0.00440527*pow(MMIN,1); //(6, 0) <736.413, 0> } TMSE: 1189.02 VMSE: 568.998 GMSE: 333.024 A small number of linear rules where generated. The largest order utili sed by the models of the model tree is 2. Once again one of the rules represents an outlier. if (t < 10566.2) { if (buf [0] > 130.813) { Y = 0.869102*pow(buf[2],1) -0.398713*pow(buf[1],1) +0.436347*pow(buf[0],1) +11.8149; //(3932, 497) <307.129, 311.468> else { y = 0.8834*pow(buf[2],1) -0.454296*pow(buf[1],1) +0.475617*pow(buf[0],1) +12.0682; //(4510, 540) <314.362, 334.597> } else { if (t < 20022.7) { if (buf[l] < 145.598) { Y = -0.00243541*pow(buf[2],1)*pow(buf[1],1) +1.17106*pow(buf[2],1) +0.00179399*pow(buf[1],2) -1.04915*pow(buf[1],1) +0.721939*pow(buf[0],1) +28.1399; //(5460, 692) <332.485, 356.925> else { y = 0.00263594*pow(buf[2],2) -0.447044*pow(buf[1],1) +0.566219*pow(buf[0],1) +61.1108; //(2109, 235) <286.787, 252.748> } else { y = 1.08554*pow(buf[2],1) -0.646192*pow(buf[1],1) +0.366292*pow(buf[0],1) +24.683; //(9495, 1225) <263.164, 249.476> } TMSE: 295.787 VMSE: 297.108 GMSE: 294.327 A small number of linear rules where generated. The largest order utili sed by the models of the model tree is 2. if (motor == liD class = -0.343752*pow(pgain,1) +2.13126; //(17, 2) <0.267481, 0.999964> else { if (motor == "E") { if (screw == "A") { class = -0.0971804*pow(vgain,1)*pow(pgain,1) +0.59491*pow(vgain,1) -0.489579*pow(pgain,3) +7.57494*pow(pgain,2) -38.6539*pow(pgain,1) +65.5058; II ) { //(6, 1) <1.43629, 0.00659144> else { class = -0.343752*pow(pgain,1) +2.13126; //(19, 2) <0.126267, 0.00564462> } else { class = -0.122795*pow(vgain,1)*pow(pgain,1) +0.730519*pow(vgain,1) -0.447423*pow(pgain,3) +7.05858*pow(pgain,2) -36.4128*pow(pgain,1) +61. 5672; //(91, 12) <0.39332, 0.016951> } TMSE: 0.386136 VMSE: 0.13066 GMSE: 0.0315292 A small number of rules where generated, however some of the rules are non-linear. The maximum order of the models of the model tree is 3. if (t > 23500.3) { Y = 0.510853*pow(left[3],1) +0.754001*pow(right[2],1) -0.0890549*pow(left[2],1) -0.677194*pow(right[1],1) +0.333603*pow(left[1],1) +0.393169*pow(right[0],1) -0.180878*pow(left[0],1) -5.84898; //(6356, 797) <198.139, 195.359> else { if (t > 7772.91) { if (left[l] < 167.104) { Y = 0.595663*pow(left[3],1) +0.584884*pow(right[2],1) -0.14832*pow(left[2],1) -0.646768*pow(right[1],1) +0.375642*pow(left[1],1) +0.477733*pow(right[0],1) -0.185438*pow(left[0],1) -6.5451; //(11317, 1434) <219.241, 215.915> else { y = 0.616402*pow(left[3],1) +0.603696*pow(right[2],1) -0.18247*pow(left[2],1) -0.623408*pow(right[1],1) +0.39671*pow(left[1],1) +0.452291*pow(right[0],1) -0.199426*pow(left[0],1) -8.13588 ; // (1241, 133) <211. 091, 152.021> } else { y = 0.716837*pow(left[3],1) +0.639319*pow(right[2],1) -0.394613*pow(left[2],1) -0.560577*pow(right[1],1) +0.459409*pow(left[1],1) +0.430266*pow(right[0],1) -0.211955*pow(left[0],1) -9.64711; //(6230, 779) <173.144, 164.914> } TMSE: 202.083 VMSE: 195.358 GMSE: 189.556 Essentially, the rules consist of linear blends of the left and right channels to obtain the predicted sample. (tl < 0.0263429) { (to> 0.00343757) { y = -2.63082*pow(tO,1) +0.972257*pow(t1,1)*pow(t2,1) -0.0932611*pow(t2,1) +0.326075; //(101, 12) <0.123809, 0.0717015> else { y = 1.00471*pow(tO,1)*pow(t1,1)*pow(t2,1) +0.669822; //(97, 8) <0.102582, 0.037943> if if } else { y = -0.19592*pow(t1,1) +0.979469*pow(t2,1) +0.342466; //(193, 25) <0.0924138, 0.0926662> } else { = 0.305796*pow(t1,1) -1.39869*pow(t2,2) +1.4918; //(409, 55) <0.0823577, y 0.0943271> } TMSE: 0.0924692 VMSE: 0.086686 GMSE: 0.0857686 What is interesting to note, is that the model tree is almost identical to the generating function of section 4.4.1. From the best solutions listed above, the maximum utilised polynomial order was 9 (for the house-16H dataset). This suggests that a maximum polynomial order of larger than 9 will result in no improvement in generalisation performance. For a large proportion of the above solutions, the utilised polynomial order was no greater than 3. This indicates that cubic surfaces sufficiently describe most databases, including time-series. For some of the above results, outliers in the dataset were detected and isolated by rules. This indicates that there is still redundancy to be removed from the model tree solutions. Also, the removal of these outliers will improve the generalisation accuracy of the models. These outliers should be removed by some heuristic, e.g. rules that cover a smaller number of patterns than some threshold should be removed and the patterns covered by the rules should be discarded from the training set. Thus, further improvements in accuracy and complexity are possible. This chapter discussed a genetic program for the mining of continuous-valued classes (GPMCC). The performance of the GPMCC was evaluated against other algorithms such as Cubist and NeuroLinear for a wide variety of problems. Although the generalisation ability of the GPMCC method was slightly worse than the other methods, the complexity and number of generated rules were significantly smaller than that of other methods. The GPMCC method was also fairly robust, in that the parameter choices did not significantly effect any outcomes of the GPMCC method. The success of the GPMCC method can be attributed to the specialised mutation and crossovers, and can also be attributed to clustering. Another important aspect to the GPMCC method is the development of a fragment pool, which served as a belief space for the genetic program. The fragments of the fragment pool resulted in structurally optimal models for the terminal nodes of the GPMCC method. The fitness function was also crucial, because it penalised chromosomes with a high level of complexity. Although the genetic program presented in this chapter seems to be fairly effective both in terms of rule accuracy and complexity, the algorithm was not particularly fast. The speed of the algorithm is seriously affected by the recursive procedures used to perform fitness evaluation, crossover and mutation on the chromosomes (model trees). This problem can be solved in two ways: implement a model tree as an array or change the model tree representation to a production system. If the model trees of the genetic program are represented as an array, clever indexing of the array will negate the need for any recursive functions. However, the array would have to represent a full binary tree which could unnecessarily waste system memory if the model trees are sparse. If the model tree representation is changed to a production system, the mutation and crossover operators will have to be re-investigated. Envisioned future developments to the GPMCC method include the revision of the attribute tests. These attribute tests could be revised to implement non-linear separation boundaries between continuous classes. The GASOPE method could then be used to efficiently approximate these non-linear separation boundaries. The fragment pool discussed in this chapter could be used to implement a function set for the separation boundaries, in a manner similar to that of the terminal set. The next chapter presents the conclusion to this thesis. Chapter 5 CONCLUSION This chapter briefly summarises the findings and contributions of this thesis, followed by a discussion of directions for future research Data mining is becoming increasingly important as a means of automating and enhancing traditional knowledge discovery processes. In order to completely satisfy the four objectives of data mining, i.e. accuracy, comprehensibility, crispness and novelty, new tools need to be developed. However, the development of new tools should also satisfy a number of qualitative requirements, i.e. scalability, efficiency, reliability etc.. The mining of continuous-valued classes provides its own unique challenges, evident in the relatively small number of data mining algorithms that exist for the mining of continuous classes. This thesis developed a new genetic programming approach for the mining of continuous classes (GPMCC). Essentially, the GPMCC method evolved model trees in order to describe a data set, which was made up of patterns with continuous targets. The models for the model trees were obtained from a fast, efficient genetic algorithm that evolved structurally optimal polynomial expressions (GASOPE). Both the GASOPE and the GPMCC method utili sed a fast, rough clustering algorithm in order to reduce the search space. The GASOPE method evolves a population of individuals, which represent polynomial expressions. Specifically, the GASOPE method optimises the structure of the polynomial expressions, i.e. the GASOPE method determines the best term architecture, and the discrete least squares method is employed to obtain the coefficients of the terms in an expression. A modified k-means clustering algorithm is utilised by the GAS OPE method to continually draw stratified random samples of the training patterns during the learning process. This stratified random sampling strategy drastically reduces the computation time required by the method. The GPMCC method evolves a population of individuals, which represent model trees. The models are obtained from a fragment pool, which implements a belief space of terminal symbols. The fragment pool implements mutation and crossover operators to adjust this belief space periodically. An iterative learning strategy is utilised by the GPMCC method in order to reduce the number of training patterns presented to the GPMCC learning process. In addition, the iterative learning strategy results in increased generalisation accuracy. Experimentally, the GASOPE method was compared to a standard back-propagation neural network, which implemented gradient descent. The GASOPE method performed significantly better than the neural network both in terms of generalisation accuracy and training time. In fact, the training time of the GASOPE method was orders of magnitude faster than that of the neural network. The GASOPE method was also shown to produce the best approximating polynomial structure for a given data set. The GPMCC method was compared to both NeuroLinear and Cubist. Although the GPMCC method was not significantly less accurate than both NeuroLinear and Cubist, the GPMCC method was shown to generate a substantially smaller number of rules. The rules generated by the GPMCC method were also less complex than those of the other methods. Overall, the GPMCC method proved to be a very capable tool for the mining of continuous-valued classes. Throughout this thesis a number of new directions for future research presented themselves. These ideas are briefly summarised below: Both the GASOPE and the GPMCC methods of chapters 3 and 4 introduce a number of new parameters, some of which were determined experimentally to be fairly robust. These parameters can be directly optimised using the evolutionary strategies approach of section 2.2.3. This As was mentioned in section 3.5, polynomial structures are poor predictors of periodic data, particularly when applied to extrapolation tasks. This problem can be solved in one of two ways: Build an expression that utilises a periodic function such as cosine or sine, or use only linear predictors at the ends of the approximation interval. The speed of the GPMCC method is seriously affected by the recursive procedures used to perform fitness evaluation, crossover and mutation on the chromosomes (model trees). This problem can be solved in two ways: implement a model tree as an array or change the model tree representation to a production system. The attribute tests of the GPMCC method could be revised to implement non-linear separation boundaries between continuous classes. The GAS OPE method could then be used to efficiently approximate these non-linear separation boundaries. The fragment pool discussed in section 4.3.3 could be used to implement a function set for the separation boundaries, in a manner similar to that of the terminal set. Bibliography [1] H.A. Abass, RA. Saker, and C.S. Newton, editors. Data Mining: A Heuristic Approach. Idea Publishing Group, 2002. [2] G. Adomavicius and A. Tuzhilin. Using data mining methods to build customer profiles. Computer, 34(2):74-82,2001. [3] E. Alba, J.E Aldana, and J.M. Troya. Genetic algorithms as heuristics for optimizing ANN design. In RE Albrecht, C.R Reeves, and N.C. Steele, editors, Proceedings of the International Conference on Artificial Neural Nets and Genetic Algorithms, pages 683-690. Springer-Verlag, 1993. [4] K. Alsabti, S. Ranka, and V. Singh. An efficient parallel algorithm for high dimensional similarity join. In IP PS: 11th International Parallel Processing Symposium. IEEE Computer Society Press, 1998. [6] P.J. Angeline. Evolving predictors for chaotic time series. In S. Rogers, D. Fogel, J. Bezdek, and B. Bosacchi, editors, Proceedings of SPIE (Volume 3390): Application and Science of Computational Intelligence, pages 170-180, Bellingham, Washington, 1998. [7] T. Back, D.B. Fogel, and T. Michalewicz, editors. Evolutionary Computation 1: Advanced Algorithms and Operators. lOP Press, 2000. [8] T. Back, D.B. Fogel, and T. Michalewicz, editors. Evolutionary Computation 1: Basic Algorithms and Operators. lOP Press, 2000. [9] G. Ball and D. Hall. A clustering technique for summarizing multivariate data. Behavioral Science, 12:153-155, 1967. [10] E. Basson and A.P. Engelbrecht. Approximation of a function and its derivatives in feedforward neural networks. In IEEE International Joint Conference on Neural Networks. paper 2152, Washington D.C., 1999. [11] M.L. Beretta, D.G. Antoni, and A.G.B. Tettamanzi. Evolutionary synthesis of a fuzzy image compression algorithm. In Proceedings of the Fourth European Congress on Intelligent Techniques and Soft Computing, volume 1, pages 466-470, Aachen, Germany, September 1996. Verlag Mainz, Aachen (Germany). [12] J. Biles. Genjam: A genetic algorithm for generating jazz solos. In The Computer Music Association, Proceedings of the International Computer Music Council, 1994. [14] C. Blake, E. Keogh, and c.J. Merz. UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, 1998. http://www.ics.uci.edu/rvmlearnMLRepository. [15] J. Bridle. Alphanets: A recurrent neural network architecture with a hidden markov model interpretation. Speech Communication, 9(1):83-92, 1990. [16] R.L. Burden and J.D. Faires. Numerical Analysis, 6th Edition. Brooks/Cole Publishing Company, 1997. [17] N. Cesa-Bianchi, P. Long, and M. Warmuth. Worst-case quadratic loss bounds for a generalization of the widrow-hoff rule. In Proceedings of the 6th Annual Workshop on Computer Learning Theory, pages 429-438, New York, 1993. ACM Press. [18] Y. Chen, Z. Nakao, and X. Fang. A parallel genetic algorithm based on the island model for image restoration. In Proceedings of the 1996 IEEE Signal Processing Society Workshop, pages 109-118, Kyoto, September 1996. IEEE Press. [19] T. Cibas, F. Fogelman Soulie, P. Gallinari, and S. Raudys. Variable selection with neural networks. Neurocomputing, 12:223-248, 1996. [20] P. Clark and R Boswell. Rule induction with CN2: Some recent improvements. In Proceedings of the Fifth European Working Session on Learning, pages 151-163, Berlin, 1991. Springer. [21] P. Clark and T Niblitt. The CN2 induction algorithm. Machine Learning, 3:261-283, 1989. [22] I. Cloete and J. Ludik. Increased complexity training. In Proceedings of the International Workshop on Artificial Neural Networks, pages 267-271, Berlin, 1993. SpringerVerlag. [23] I. Cloete and J. Ludik. Delta training strategies. In IEEE World Congress on Computational Intelligence, Proceedings of the International Joint Conference on Neural Networks, volume 1, pages 295-298, Orlando, June 1994. [24] I. Cloete and J. Ludik. Incremental training strategies. In International Conference on Artificial Neural Networks, volume 2, pages 743-746, Sorrento, Italy, May 1994. [25] D.A. Cohn. Neural network exploration using optimal experiment design. In AI Memo No 1491, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1994. [26] W. Daelemans, P. Berek, and S. Gillis. Linguistics as data mining: Dutch diminu- tives. In T Andemach, M. Moll, and A. Nijholt, editors, CLIN V, Papers from the Fifth Computational Linguistics in the Netherlands Meeting, pages 59-72,1995. [28] G.J. Deboeck and TK. Kohonen, editors. Visual Explorations in Finance: With SelfOrganizing Maps. Springer Verlag, 1998. [29] J.E. Dennis (Jr) and RE. Schnabel. Numerical methods for unconstrained optimization and nonlinear equations. Prentice Halls, Englewood Cliffs, New Jersey, 1983. [30] J. Eggermont, A.E. Eiben, and J.I. van Hemert. Adapting the fitness function in GP for data mining. In R Poli, P. Nordin, W.B. Langdon, and Te. Fogarty, editors, Genetic Programming, Proceedings of EuroGP'99, volume 1598, pages 193-202, Goteborg, Sweden, 26-27 1999. Springer-Verlag. [31] AP. Engelbrecht. A new pruning heuristic based on variance analysis of sensitivity information. IEEE Transactions on Neural Networks, 12(6):1386-1399,2001. [32] AP. Engelbrecht and R. Brits. to active learning. Supervised training using an unsupervised approach In Neural Processing Letters, pages 247-269. Kluwer Academic Publishers, Netherlands, 2002. [33] AP. Engelbrecht and I. Cloete. Feature extraction from feedforward neural networks using sensitivity analysis. In International Conference on Systems, Signals, Control and Computers, volume 2, pages 221-225, Durban, South Africa, 1998. [34] AP. Engelbrecht and I. Cloete. Selective learning using sensitivity analysis. In IEEE World Congress on Computational Intelligence, International Joint Conference on Neural Networks, pages 1150-1155, Anchorage, Alaska, 1998. [35] AP. Engelbrecht and 1. Cloete. Incremental learning using sensitivity analysis. In IEEE International Joint Conference on Neural Networks, Washington D.C., 1999. paper 380. [36] S.S. Epp. Discrete Mathematics with Applications, Second Edition. Brooks / Cole Publishing Company, 1995. [37] K. Fanghanel, R. Hein, K. Kollmann, and H.C. Zeidler. Optimizing wavelet transform coding using a neural network. In Proceedings of the IEEE International Conference on Information, Communications and Signal Processing, volume 3, pages 1341-1343, Singapore, September 1997. [38] L. Fletcher, V. Katkovnik, F.E. Steffens, and AP. Engelbrecht. Optimizing the number of hidden nodes of a feedforward artificial neural network. In IEEE World Congress on Computational Intelligence, In proceedings of the International Joint Conference on Neural Networks, pages 1608-1612, Anchorage, Alaska, 1998. [39] D. B. Fogel. Evolutionary programming: an introduction and some current directions. Statistics and Computing 4, pages 113-129, 1994. [40] J.B. Fraleigh and RA Beauregard. Linear Algebra, 3rd Edition. Addison-Wesley Publishing Company, 1995. [41] AA Freitas. A genetic programming framework for two data mining tasks: Classifi- cation and generalized rule induction. In J.R Koza, K. Deb, M. Dorigo, D.B. Fogel, M. Garzon, H. Iba, and RL. Riolo, editors, Genetic Programming 1997: Proceedings of the Second Annual Conference, pages 96-101, Stanford University, 13-16 1997. Morgan Kaufmann. [42] B. Fritzke. Incremental learning oflocallinear mappings. In Proceedings of the Interna- tional Conference on Artificial Neural Netwo rks, pages 217-222, Paris, October 1995. [43] K. Fukumizu. Active learning in multilayer perceptrons. In M.C. Mozer and M.E. Hasselmo, editors, Advances in Neural Information Processing, volume 8, pages 295301, 1996. [44] S. Geman, E. Bienenstock, and R Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4:1-58,1992. [45] Y.L. Geom. Genetic recursive regression for modelling and forecasting real-world chaotic time series. Advances in Genetic Programming, 3:401-423, 1999. [46] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural network architectures. Neural Computation, 7:219-269, 1995. [47] N. Gockel, G. Pudelko, R Drechsler, and B. Becker. A hybrid genetic algorithm for the channel routing problem. In Proceedings of the International Symposium on Circuits and Systems, volume 2, pages 675-678, 1996. [48] D.E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Publishing Company, 1989. [49] G.E. Goldberg and K. Deb. A comparitive analysis of selection schemes used in genetic algorithms. In G.J.E Rawlins, editor, Foundations of Genetic Algorithms, pages 69-93. Morgan-Kaufman, 1991. [50] R. Haggarty. Fundamentals of Mathematical Analysis, 2nd Edition. Addison-Wesley Publishing Company, 1993. [52] B. Hassibi and D.G. Stork. Second order derivatives for network pruning. In C. Lee Giles, S.J. Hanson, and J.D. Cowan, editors, Advances in Neural Information Processing Systems, volume 5, pages 164-171, 1993. [53] R. Hinterding, H. Gielewski, and T.e. Peachey. The nature of mutation in genetic algorithms. In Larry Eshelman, editor, Proceedings of the Sixth International Conference on Genetic Algorithms, pages 65-72, San Francisco, 1995. Morgan Kaufmann. [54] Y. Hirose, K. Yamashita, and S. Hijiya. Back-propagation algorithm which varies the number of hidden units. Neural Networks, 8:1277-1299,1996. [55] J.H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, 1975. [56] L. Holmstrom and P. Koistinen. Using additive noise in back-propagation training. IEEE Transactions on Neural Networks, 3(1):24-38, 1992. [57] K. Hornik. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359-366, 1989. [58] K. Hornik, M. Stinchcombe, and H. White. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks, 3:551-560, 1990. [59] S. Horowitz and T. Pavlidis. Picture segmentation by a tree traversal algorithm. Journal of the Associationfor Computing Machinery, 23(2):368-388, April 1976. [60] A. Iserles. Generalized leapfrog methods. IMA Journal of Numerical Analysis, 6:381392,1986. [61] A. Ismail and A.P. Engelbrecht. Training product units in feedforward neural networks using particle swarm optimization. In V.B. Bajic and D. Sha, editors, Development and Practice of Articial Intelligence Techniques, Proceedings of the International Conference on Articial Intelligence, pages 36-40, Durban, South Africa, 1999. [62] M.A. Kaboudan. forecasting. Genetic evolution of regression models for business and economic In Proceedings of the Congress on Evolutionary Computation, volume 2, pages 1260-1268. IEEE Press, 1999. [63] R Kamimura and S. Nakanishi. Weight decay as a process of redundancy reduction. World Congress on Neural Networks, 3:486-489,1994. [64] S. Kaski. Data Exploration using Self-Organizing Maps. PhD thesis, Laboratory of Computer and Information Science, Department of Computer Science and Engineering, Helsinki University of Technology, 1997. [65] T. Kohonen. Self-Organization and Associative Memory, 3rd Edition. Springer-Verlag, Berlin, Germany, 1989. [67] J.R Koza. Genetic evolution and co-evolution of game strategies. In The International Conference on Game Theory and Its Applications, Stony Brook, New York, July 1992. [68] J.R Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992. [69] Y. Le Cun, B. Boser, J.S. Denker, D. Henderson, RE. Howard, W. Howard, and L.D. Jackel. Handwritten digit recognition with a back-propagation network. In D.S. Touretzky, editor, Advances in Neural Information Processing Systems, volume 2, pages 396404. Morgan Kaufmann, San Mateo, 1990. [70] Y. Le Cun, J. Denker, S. Solla, RE. Howard, and L.D. Jackel. Optimal brain damage. In D.S. Touretzky, editor, Advances in Neural Information Processing Systems, volume 2, pages 598-605, San Mateo, 1990. Morgan Kauffman. [71] D.J.C. MacKay. Bayesian Methods for Adaptive Models. PhD thesis, California Institute of Technology, 1992. [72] J. MacQueen. Some methods for classification and analysis of multivariate observa- tions. In L.M. Le Cun and J. Neyman, editors, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1: Statistics, pages 281-297, Berkeley and Los Angeles, 1967. University of California Press. [73] RE. Marmelstein and G. Lamont. Pattern classification using a hybrid genetic programdecision tree approach. In J.R Koza, editor, Genetic Programming 98: Proceedings of the Third International Conference, pages 223-231. Morgan Kaufmann, 1998. [74] J.L. McClelland and D .E. Rume1hart. Training hidden units: The generalized delta rule. In Explorations in Parallel Distributed Processing, volume 3, chapter 5, pages 121-159. MIT Press, 1988. [75] M.P. M¢ller. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6:525-533, 1993. [76] N. Nikolaev and R. Iba. Genetic programming using chebishev polynomials. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 89-96, San Francisco, 2001. Morgan Kaufmann Publishers. [77] N.J. Nilsson. Artificial Intelligence: A New Synthesis. Morgan Kaufmann Publishers, Inc., San Fransisco, 1998. [78] N. Ohnishi, A. Okamoto, and N. Sugie. Selective presentation of learning samples for efficient learning in multi-layer perceptron. In Proceedings of the IEEE International Joint Conference on Neural Networks, volume 1, pages 688-691, Washington D.C., January 1990. [79] YR. Pao and Y Takefuji. Functional-link net computing: Theory, system architecture, and functionalities. Computer, 25(5):76-79, May 1992. [80] W.M. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling. Numerical recipes: The art of scientific computing. Cambridge University Press, 1986. [81] J.R Quinlan. Learning efficient classification procedures and their application to chess endgames. In RS. Michalski, J.G. Carbonell, and T.M. Mitchell, editors, Machine Learning: An Artificial Intelligence Approach, volume 1, pages 463-482. Tioga Press, Palo Alto, 1983. [82] J.R Quinlan. Learning with continuous classes. Proceedings of Artificial Intelligence In Adams and Sterling, editors, '92, pages 343-348, Singapore, 1992. World Scientific. [83] J.R Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, 1993. [84] J.R Quinlan. Ru1equest research data mining tools. RuleQuest Research Pty Ltd, Australia, 2002. http://www.rulequest.com/see5-info.html. [85] I. Rechenberg. Evolution strategy. In J.M Zurada, R Marks II, and C. Robinson, editors, Computational Intelligence - Imitating Life, pages 147-159. IEEE Press, 1994. [86] RG. Reynolds. An introduction to cultural algorithms. In Proceedings of the Third Annual Conference on Evolutionary Computing, pages 131-139, 1994. [87] C. Sage. An overview of radiofrequency/microwave radiation studies relevent to wire- less communications and data. In Procceedings of the International Conference on Cell Tower Siting, pages 90-105, Salzburg, Austria, June 2000. [89] F. Schweitzer, H. Rose, W. Ebeling, and O. Weiss. Optimization of road networks using evolutionary strategies. Evolutionary Computation, 5(4):419-438, 1997. [90] RS. Sellar, M.A. Stelmack, S.M. Batill, and J.E. Renaud. Response surface approximations for discipline coordination in multidisciplinary design optimization. In American Institute of Aeronautics and Astronautics, number 96 in 1383, 1996. [91] R Setiono. Generating linear regression rules from neural networks using local least squares approximation. In J. Mira and A. Prieto, editors, Connectionist Model of Neurons, Learning Processes, and Artificial Intelligence, Proceedings of the 6th International Work-Conference on Artificial and Natural Neural Networks, volume I, pages 277-284, Granada, Spain, June 2001. Springer. [92] R. Setiono and L.c.K. Hui. Use of quasi-newton method in a feedforward neural network construction algorithm. IEEE Transactions on Neural Networks, 6(1):273- 277,1995. [93] R. Setiono and W.K. Leow. Pruned neural networks for regression. In R. Mizoguchi and J. Staney, editors, Proceedings of the 6th Pacific Rim Conference on Artificial Intelligence, PRICAI2000, Lecture Notes in AI 1886, pages 500-509, Melbourne, Australia, 2000. Springer. [94] R. Setiono, W.K. Leow, and J .M. Zurada. Extraction of rules from artificial neural networks for nonlinear regression. IEEE Transactions on Neural Networks, 13(3):564577,2002. [95] E. Spertus. Smokey: Automatic recognition of hostile messages. In Proceedings of Innovative Applications of Artificial Intelligence (IAAI), pages 1058-1065, 1997. [96] AG.W. Steyn, c.F. Smit, S.H.V. du Toit, and C. Strasheim. Modern Statistics in Practice. J. L. van Schaik, 1996. [97] C.M. Vest. Biotech research presents a major opportunity for massachusetts economy. Boston Sunday Globe, 11 August 2002. [98] AS. Weigend, D.E. Rumelhart, and B.A Huberman. Generalization by weight- elimination with application to forecasting. In Advances in Neural Information Processing Systems, volume 3, pages 875-882, 1991. [99] P. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974. [100] L.EA. Wessels and E. Barnard. Avoiding false local minima by proper initialization of connections. IEEE Transactions on Neural Networks, 3(6):899-905, [l01] P.M. Williams. 1992. Bayesian regularization and pruning using a laplace prior. Computation, 7:117-143,1995. Neural [102] S.W. Wilson. Function approximation with a classifier system. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 974-981, San Francisco, 2001. Morgan Kaufmann Publishers. [103] D.H. Wolpert and W.G. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67-82, April1997. [104] J. Yao, Y. Li, and C. Tan. Forecasting the exchange rates of CHF vs USD using neural networks. In Journal of Computational Intelligence in Finance, volume 5, pages 7-13, 1997. [105] Z. Zheng. Constructing New Attributesfor Decision Tree Learning. PhD thesis, Basser Department of Computer Science, The University of Sydney, Australia, 1996. [106] J. M. Zurada. 1992. Introduction to Artificial Neural Systems. PWS Publishing Company, Appendix A SYMBOLS Symbol a b b b· c d connections fAN , fAN g h i j k I m n net 0 p r r s t v v w w x y y. A Co C.1f C.l'Ubscript E EMA EMS EQ Ess Fro G GOA GFP Gop Meaning Input attribute Target output Vector of target outputs Predicted output Number of classes Model complexity Artificial neuron inputs Artificial neuron activation function Artificial neuron activation function derivative Evolutionary computing generation counter Artificial neural network hidden unit iterator Artificial neural network input unit iterator or pattern iterator Term iterator Number of clusters, classes, coefficients or complexity (clear from context) Pattern iterator Total number of input attributes Generic maximum, mostly used to mean the maximum order of a polynomial Artificial neuron weighted sum variable Test outcomes Maximum number of terms Real-valued coefficient Vector of real-valued coefficients Sample size Artificial neural network output unit iterator Possible attribute value Artificial neural network output weight vector Artificial neural network weight Cluster centroid vector or artificial neural network hidden weight vector Input value Target output Predicted output Attribute or Attribute matrix (clear from context) Cluster Class Antecedent term Error term Mean absolute error Mean squared error Quantisation error Sum squared error Fragment in a fragment pool Evolutionary computing population Genetic algorithm population Fragment pool Genetic program population Symbol H I la Ip Iy Iw Ix M L N Nx 0 p Q R S T T1; Usubscript X 0, o,subscript ~.I'ubscript 8 C csub.l'cript 1'\ e A A,l'Ub.l'cript ~ cr 't PI <j> X 'I' ill Meaning Total number of hidden units in an artificial neural network Total number of input units in an artificial neural network Parent individual to be involved in crossover Parent individual to be involved in crossover Child individual resulting from crossover Individual in a GAS OPE population or fragment pool Individual in a GPMCC population Maximum number of generations 3-Piece piecewise linear approximation Maximum number of individuals An arbitrary node in Ix Evolutionary computing temporary population Training set or penalty term (clear from context) A set of patterns belonging to some case A set of patterns covered by a rule Sample set Total number of output units in an artificial neural network One of the sets of term-coefficient mappings in and individual Iw User defined parameter Test Artificial neural network momentum term Intercept Gradient Cluster iterator Artificial neural network epochs iterator Penalty terms Artificial neural network learning rate Artificial neuron threshold Control parameter for the sigmoid function Natural valued order Term-coefficient iterator Artificial neural network error signal Input attribute iterator or bias (clear from context) Pattern Outcome iterator Population iterator for a genetic program Class iterator Individual iterator Index allele, 13 context modelling, 93 antecedent, 5, 6 correlated, 40 approximation covers, 6 discrete least squares, 38 function, 37 architecture selection, 28 artificial neural network criterion gain, 8 gain ratio, 8 crossover feed forward, 22 one-point, 14 functional link, 22 two-point, 14 recurrent, 22 uniform, 14 Elman, 22 Jordan, 22 artificial neural networks, 20, 43, 81 data preparation, 26 decomposition singular-value, 63 determination child, 6 chromosome, 12 classification, 5 clustering, 31 hierarchical, 31 k-means, 31, 45 partitional, 31 coefficient correlation, 40 adjusted coefficient of, 40, 56 coefficient of, 40 distance Euclidean, 32-34 Manhattan, 32 divide-and-conquer, 8 domain specific knowledge, 18 error mean absolute, 141 coevolution, 17 mean squared, 24 consequent, 5, 6 quantisation, 34 crossover, 13,55,94, 105 elitism, 13 rank-based, 14 tournament, 14 mutation, 13,51,94,98 selective learning, 29 selection, 13 self organising maps, 33 particle swarm optimisation, 23, 28 path, 6, 94 phenotype, 13 polynomials Lagrange, 41, 68 Taylor, 41 space belief, 17, 92 population, 17 split-and-merge, 35 systems classification, 6 regression, 6 production rule, 6 production system, 6 decision, 6, 87 model, 10, 87 reduction regression, 10, 87 Gauss-Jordan, 63 regression, 5, 40 non-linear, 91 product, 21 summation, 21 symbolic, 85 regularisation, 28 relationships predator-prey, 17 symbiotic, 17 root, 6 sample stratified random, 45 scaled conjugate gradient, 23 selection proportional, 14 random, 13 variable dependent, 40 independent, 40

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement