Neural networks for time series analysis Submitted in fulfillment of part of the Requirements for the degree of Department of Statistics Faculty of Science University of Pretoria Acknowledgements 2. Dr Johan Holm, my co-promoter. for his advice and encouragement especially in the field of neural networks. Summary Promoter: Dr H Boraine Co-promoter: Dr JEW Holm The analysis of a time series is a problem well known to statisticians. Neural networks form the basis of an entirely non-linear approach to the analysis of time series. It has been widely used in pattern recognition, classification and prediction. Recently, reviews from a statistical perspective were done by Cheng and Titterington (1994) and Ripley (1993). One of the most important properties of a neural network is its ability to learn. In neural network methodology, the data set is divided in three different sets, namely a training set, a cross-validation set, and a test set. The training set is used for training the network with the various available learning (optimisation) algorithms. Different algorithms will perform best on different problems. The advantages and limitations of different algorithms in respect of all training problems are discussed. In this dissertation the method of neural networks and that of ARlMA. models are discussed. The procedures of identification, estimation and evaluation of both models are investigated. Many of the standard techniques in statistics can be compared with neural network methodology, especially in applications with large data sets. To illustrate the adaptability of neural networks the problem of forecasting is considered. A few alternative theoretical approaches to obtain forecasts for non-linear time series models are discussed. It is shown that bootstrap methods can be used to calculate predictions, s~dard errors and prediction limits for the forecasts. Neural network methodology is applied to predict an electricity consumption time senes. Studieleier: Dr H Boraine Medestudieleier: Dr J.E.W Holm Voorgele ter vervulling van 'n deel van die vereistes vir die graad Magister in Wisklmdige Statistiek Die ontleding van tydreekse is 'n bekende probleem vir statistici. Neurale netwerke vorm die basis van 'n geheel en al nieliniere benadering tot die analise van 'n tydreeks. Dit word in 'n wye spektrum van probleemgebiede gebruik, byvoorbeeld patroonerkenning asook klassifikasie en vooruitskatting. Slegs onlangs is neurale netwerke deur navorsers soos Cheng en Titterington (1994) en Ripley (1993) vanuit 'n statistiese oogpunt beskou. Een van die belangrikste eienskappe van neurale netwerke is die vermoe om te leer. Neurale netwerkmetodologie behels dat die datastel verdeel word in drie dele, naamlik: 'n leerstel, 'n kruisvalidasiestel en 'n toetsstel. Die leerstel word gebruik om die netwerk af te rig deur gebruik te maak van verskeie leeralgoritmes. Sekere algoritmes presteer beter as ander, na gelang van die tipe probleemsituasie. Die vooren nadele van leeralgoritmes word bespreek. In hierdie verhandeling word die metodiek van neurale netwerke en die van ARMA modelle hespreek. Die prosedures van identifikasie, hemming en evaluasie van heide modelle word bespreek. Baie van die standaardtegnieke in statistiek word vergelyk met neurale netwerkmetodologie en spesifiek met die toepassing van groot datastelle. Om die aanpasbaarheid van neurale netwerke te illustreer, word die probleem van vooruitskatting in oenskou geneem. 'n Paar a1tematiewe teoretiese beskouings om vooruitskattinings te maak vir nieliniere tydreekse, word bespreek. Daar word aangetoon dat bootstrap-metodes gebruik kan word om standaardfoute en intervalle vir vooruitgeskatte waardes te bereken. vooruitskattings, Symbol Commentary A set of input variables for example: Yl-], Yt-2. temperature. days ... EUf) I E6i) ="2 f,; {yj N f(:!j;~r (H). = o2E(~ Y Ow.Ow. I J b= OE(~I - ow!!o Error function used for maximum likelihood estimation Contents Chapter 2: Neural Networks 3 2.1 Introduction 3 2.2 Historical Development 3 2.3 Artificial Neural Networks 4 2.4 Relevance of Neural Networks in Research 5 2.5 The Model of a Neural Network 7 2.6 Architecture of a Neural Network 11 2.6.1 Single-Layer Feedforward Networks 12 2.6.2 Multi-Layer Feedforward Networks 12 2.6.3 Recurrent Networks 14 Appendix:Terminology 17 Chapter 3: Learning in Neural Networks 19 3.1 Introduction 19 3.2 The Error Function 19 3.3 Taylor Series Approximations 22 3.4 Training 24 3.4.1 The Method of Gradient Descent 24 3.4.2 3.4.3 Superlinear methods 25 The Newton Methods 28 3.5 Cross- Validation as part of the Learning Process 29 3.6 Conclusions 30 Chapter 4: ARMA Processes and Neural Network Models 31 4.1 Introduction 31 4.2 The ARMA(p,q)-Process 31 4.3 Identification 4.4 Estimation 4.4.1 Maximum Likelihood Estimation (MLE) 4.4.2 4.5 Estimation when using neural networks Evaluation Chapter 5: Forecasting a Time Series Using Neural Networks 51 5.1 Introduction 49 5.2 The Forecasting Model 49 5.3 Non-linear Forecasting 51 5.3.1 The Naive Method 51 5.3.2 Closed Form 51 5.3.3 Monte Carlo 52 5.3.4 Bootstrap 52 5.3.5 Direct Method 53 The Bootstrap Technique 54 5.4.1 Prediction Limits for Autoregressive Models 54 5.4.2 Multi-step Forecasts for Neural Network Models 55 5.4.3 Prediction Limits for Yn+h 57 5.4 Chapter 6: Forecasting Electricity Time Series by using Neural Networks 6.1 Introduction 6.2 Model Identification 6.3 Estimation 6.4 Evaluation of the Model 62 6.5 Forecasting by using the Neural Network Model 6.5.1 The Naive Method 6.5.2 The Direct Method 6.5.3 The Bootstrap Method Appendix: Fortran Computer Programme Chapter 7: Conclusions and Recommendations for Further Research References Detecting trends and patterns in a time series traditionally involves statistical methods such as clustering and regression analysis. However, these methods are linear and may fail to forecast a non-linear data set. Neural networks form the basis of a different approach to the non-linear analysis of time series. This study investigates the use of a linear time series ARIMA model in comparison with a non-linear neural network model for application in forecasting. Chapter 2 deals with the history and development of neural networks. This is done in the framework of a variety of problems, such as classification, pattern recognition and forecasting. The model of a neural network is illustrated along with different types of applied neural networks. Of all the possible networks, multi-layered feedforward neural networks are used in this dissertation. A very important property of a neural network is its ability to learn, which is the phase during which parameter estimation takes place. The data set is divided into three different sets, namely a training set, cross-validation set, and a test set. Neural networks are data driven in the sense that they use data to train and build statistical models of a recorded process. The training set is used for training the network with various available learning (optimisation) algorithms. These various available learning algorithms are treated in Chapter 3. Optimisation is the process in which an error function, E, is minimised. The cross-validation set is used to prevent the network from over-training. Over-training occurs when the network is able to represent the data very well but without the ability to generalise. In well-behaved optimisation problems, the value of the error function will have a downward slope in regard to the parameters or weights. The algorithm will terminate when a specified criterion is met, such as a required degree of accuracy. A stopping criterion that is often used in neural network methodology, to prevent over-training, is the value of the error function calculated for the cross-validation set (a data set not used for the estimation of the parameters of the model). If this value shows an upward trend, while the corresponding value for the training set decreases, over-training is indicated. The test set does not form part of the learning process and is not considered in the estimation of the model. Further in Chapter 3. different optimisation algorithms are discussed. It is emphasised that it is not possible to highlight one specific algorithm because all algorithms have advantages and limitations. Chapter 4 reviews the ARMA(p.q)-process.The different stages of modelling. namely identification. estimation by means of maximum likelihood and evaluation are discussed. This is illustrated by examples where both an AR(2) model and a nemal network model are fitted to two time series. a computer generated AR(2) time series and an electricity consumption time series. In Chapter 6. the model used to describe the electricity consumption is extended in order to describe complex seasonal patterns. Time series models are often used for forecasting. Since neural network models are nonlinear. minimum mean squared error forecasts are not of a simple form. In Chapter 5. a procedure is proposed to calculate one- and multi-step forecasts as well as prediction limits for the forecasts based on bootstrap methodology. In Chapter 6. more complex models are considered to fit electricity consumption. The aim of Chapter 6 is to demonstrate the use of neural networks as a statistical tool for forecasting. Forecasting is done for one to twelve and twentyfour hours ahead. Three of the forecasting procedures. discussed in Chapter 5. are also implemented Chapter 7 concludes this work and suggests three extensions to this research. namely the inclusion of bootstrap methods in new software tools. comparative studies and applications. An introduction to neural networks will be given in this chapter. In Section 2.2 the history of Artificial Neural Networks (ANN), the development thereot: the different fields of applications as well as the contributions that have been made between Neural Networks and Statistics are discussed. To conclude the section, the connection between a biological neuron and the neural network is illustrated. The capabilities and properties of neural networks are given in Section 2.3 and 2.4. The construction of a neural network as well as network architectures will be discussed in Section 2.5 and Section 2.6. The first time interest was taken in the term Artificial Neural Networks was when Mc Culloch and Pitts (1943) introduced simplified neurons which represented biological neurons, and which could perform computational tasks. In 1969 Minsky and Papert (1969) published their book Perceptrons, which focussed on the deficiencies of perceptron models. Unfortunately this caused many researchers to leave the field Interest in neural networks in the early eighties re-emerged after the publication of several important theoretical results. This renewed interest is clearly visible in the number of societies and journals associated with neural networks. The INNS (International Neural Network Society) is widely known, and ranges from Europe (ENNS) to Japan (JNNS). The field of Electrical and Electronic Engineering is served by the journal series IEEE, and the field of Agriculture by the journal Neural Network Application Agriculture (NNAA). In the field of Economics and Finance there are the Journal of Economics and Complexity and the Journal of Computational Intelligence in Finance etc. Other well known journals include Neural Networks and Neural Computation. Several conferences are held regularly, and for example the World Congress of Neural Networks (WCNN'95) took place in Washington in 1995. A.I.arge 3 variety of software packages are available in the market today. Neural networks are popular because of the many application fields in which they can be used Many articles were written in recent years in several interesting fields, for example on classification (Ripley, 1994) and pattern recognition, where applications include the automatic reading of handwriting and SPeeChrecognition (park et oJ, 1991). Another field of application for neural networks is prediction. In the world of finance, Tam and Kiang (1992) showed that the neural network is a promising method of predicting bankruptcy. In the medical world neural networks are used for instance to predict mortality following cardiac surgery (Orr, 1995). Several articles have been published in which statistical principles are used with neural networks. Cheng and Titterington (1994) pointed out some links between statistics and neural networks and encouraged cross-disciplinary research to improve results. De Jongh and De Wet (1991) introduced neural networks as a powerful tool that is close to statistical methods in solving various problems like pattern recognition, classification and forecasting. The neural network and statistical literature contain many of the same concepts but usually with different terminology (Sarle, 1996). A list of statistical and neural network terminology is given in the Appendix of this chapter. In the following paragraphs a complete discussion will be given of what a neural network is and how it comPares with a biological neural network. Artificial Neural Networks (ANN) are computer algorithms or computer programs develoPed to model the activity of the brain (Stem, 1996:205). The term "neural network" originates from the attempt by scientists to imitate the ability of the brain to recognise patterns and to classify. This is the reason why it is called "Artificial NeuroJ Networb". Scientists prefer to refer to ANN only as NN (neural networks). In this dissertation they will also be referred to only as neural networks. The brain is a highly complex, non-linear and massively parallel computer. It can perform computations like pattern recognition, perception and motor control many times faster than any digital computer. This is possible because of the existence of neurons. Neurons are simple elements in the brain connected to each other to form a network of neurons. In Figure 2.1 the biological neuron is illustrated A neuron is activated by another activated neuron(s) and, in ~ activates another neuron(s). Synapses are units that mediate the interaction between the neurons. It can either excite or inhibit a receptive neuron. If a neuron receives a signal from a neighbouring neuron, which excites, the neuron's soma potential rises above some threshold value and the neuron "fires". This in turn sends a signal to the next neuron, and so on. New synaptic connections between neurons and the modification of existing synapses constantly take place as the brain learns. The axon.~ are the transmission lines, and the dendrites are the receptive zones. The neural network resembles the human brain in two ways, namely that the network has to learn to acquire knowledge and then the knowledge is stored in synaptic weights, which represent the strength betw"eenconnected neurons.(Haykin, 1994:2). For classification problems, discriminant analysis is used and linear regressIOn analysis is commonly used for prediction. Why would one use neural networks when there are so many other methods available? In this section only a few reasons mentioned by Haykin (1994) are given. Firstly a neural network develops out of interconnections between neurons which defines a non-linear relationship between the neurons whereas the procedure for the above-mentioned techniques are linear. Secondly neural network methodology is a non-parametric because the network learns from examples by constructing input-output mapping for the specific problem. Although there are usually many Parameters in a neural network model, these Parameters are essentially artefacts in the fitting process. The emphasis is therefore not on interpreting their values but on finding the optimal set for the classification or prediction problem. One of the training methods is supervised learning. This involves the changing of the synaptic weights by Presenting a training set of examples to the network. Every time the weights are changed, an error signal is computed This error signal is defined as the difference between the actual response of the network and the desired response. The moment the error signal reaches a minimum value, the desired weights for the specific problem have been obtained. Neural networks are therefore adaptable. A more detailed discussion on supervised learning as a training method is given in Chapter 3. In classification problems, neural networks'are not only designed to give information on which specific pattern to select, but also on the confidence in the decision made. Evidential response is thus an important property of neural networks in classification problems. Neural networks are fault tolerant in the sense that, for example if a neuron or its connecting links should be damaged, the neural network will gradually degrade in performance, rather than suddenly. Another important issue regarding neural networks is that the broad approach is a very computer- intensive method, but with the constant improvement in computer technology it becomes cheaper and faster to do calculations. Lastly, neural networks are perhaps more user-friendly than other methods because the user only has to differentiate between the independent (input) and dependent (output) variables before selecting a model. There are other broad classes of neural networks such as Hopfield associative networks, Linear networks, Probabilistic neural networks and Radial Basis Function neural networks. A feed forward neural network, as used in this dissertation, is a name used for a wide variety of mathematical models used to define the connection between a number of explanatory variables and one or more dependent variables. In Section 2.5 it is shown that the architecture of the model depicts the mathematical outlay of the neural network. As mentioned in Section 2.2, the neurons are the elements that are connected to each other to form the network. In Figure 2.2 the non-linear model of a neuron is illustrated. The different components of a non-linear model are given. Let x I, X2, ... Xp be a set of independent or explanatory variables and let Yk be the non- linear function Activation function Input Output signals y}; Synaoeic weights Figure 2.2: The non-linear model ofa neuron (Haykin, 1994:8) The Xj'S are the input signals. They are each connected to a weight or strength of its own. A signal Xj at the input of synapse j connected to neuron k is multiplied by the synaptic weight wkj. The first subscript refers to the neuron in question and the second subscript refers to the input end of the synapse to which the weight refers. If the weight is positive, the associated synapse is excitatory. If negative, the synapse is inhibitory. The threshold function, Bt, is a bias or constant parameter added as an extra input with value in most cases chosen as one. The non-linear regression model can be described as where Yk is the outp~ wkj, j = 1,2, ...p are the parameters or weights associated with each input valuexj,j= 1,2,...p and &.t is the residual term. The linear combiner described as I sums the input signal Xj, weighed by the respective synapses of the neuron. The activation Junction, ffJ, transforms the amplitude of the output of a neuron. Normally the amplitude will lie in the interval (0.0;1.0) or (-1.0;1.0), depending on the type of activation function. Three types of activation functions, namely the threshold function, the piecewise linear function, and the sigmoidal function are discussed. P Vk = LW.qxj j=l -8k I if V> 0 qJ( V) = { 0 if v < 0 1 0.8 0.6 0.4 0.2 o -1.5 -2 , -1 -0.5 I if v ~ 0.5 v if -0.5<v<0.5 qJ(V)= { o if v:::;-o.5 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 o-2 -1.5 -1 -0.5 0 0.5 The sigmoidal function is the most popular activation function. The fact that it is a differentiable function is especially important in the training of neural networks. The reason for this will be discussed later. An example of a sigmoidal function is the logistic !W2Ction. qJ(v) = 1 . l+exp(-av) Note that the activation functions range from 0 to 1. In Figure 2:5 the sigmoidal function is illustrated for different values of a. 2 1.8 1.6 :p(v) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 v The threshold has the effect of lowering the net input to the activation function. If the opposite effect is needed a hia') term can be implemented. The bias and threshold function are, therefore nUIl?ericallythe same but with different signs. where Yk is the output of the signal threshold The use of the threshold transformation to the output rp(.) is the activation function and 8k is the 8k has the effect of applying an affine of the Jinear combiner. See Figure 2.6. Uk. Total internal activity level, Vk Threshold 8k < 0 8k=0 8k>0 Linear combiner's output, Uk Figure 2.6: Affine transformation produced by the presence of a threshold (Haykin, 1994:9) Whether the threshold 8k is positive or not, the relationship between the activation potential Vk of neuron k and the linear combiner output Uk stays constant. A neural network is an interconnection of neurons. Neurons can be connected in many different ways to define different neural network architectures. In Section 2.5 the architecture of the neural network is illustrated along with the different classes of neural network architectures. Neural networks are mostly illustrated as block diagrams in which the vanous elements of the model are described. There are four different classes of network architectures, namely single-layer feedforward networks, multi-layer feedforward networks, recurrent networks and lattices structures. Only the first three will be discussed. When a network is organised in a form of layers, it is referred to as a layered network. A linear regression model is equivalent to a feedforward neural network in its simplest form with an input and an output layer and a linear activation fimction. In Figure 2.7, a feedforward network with a single layer of neurons is illustrated The most simple feedforward neural network has an input and output layer. Figure 2.7:Feedforward network with a single layer of neurons (Schalkhoff, 1992:250) This network consists of an input layer, one or more hidden layers and an output layer. The computation nodes of the hidden layer are called the hidden neurons or units. This type of network defines a more complicated model. The relationship between the inputs and outputs can be non-linear functions of non-linear functions, so that the degree of non-linearity increases with each additional layer. In Figure 2.8 a fully connected feedforward network with one hidden layer is illustrated. In a fully connected feedforward network, every input neuron in the input layer is connected to every neuron in the follqwing hidden layer, as well as every neuron in the last hidden layer connected with every neuron in the output layer (Schalkhoff, 1992:236-258). 0·· Figure 2.8: Fully connected feedforward network with one hidden layer and output layer(De Jongh and De Wet, 1994:5) There are many fields of application for the multi-layered feedforward network, like classificatio~ pattern recognition and forecasting. The following example illustrates a time series prediction problem. Letyt. Yt-l, Yt-2 ... be a stationary time series and let Then a set of d such values Yt-d+], ... Yt Yt+l be the value to be predicted. can be selected from the time series to be the inputs to a feedforward network, and the next value Yt+ 1, used as the target for the output. In Figure 2.9 a feedforward neural network with one hidden layer is illustrated (Bishop, 1996:303). y t ~---., t t I I I I I I I • • •Yt Yt+l• Yt-? Yt-l ~ A recurrent network differs from a feedforward network in the sense that it has at least one feedback loop. If feedback exists in a neural network, it can be shown that its output depends on all previous input values. Let Yl• Yl •... YN be a time series. Suppose that a recurrent network architecture is used to define a model for the time series and that the observation at time t can be written as Yt-1 and the previous output value, Y t-I, which is characteristic of a recurrent neural network model. By recursive substitution it can be shown that Yt can be written as a function of Yt-1.Yt-l•... : y, = f(Y,_I,Yt-l,~+&, = f(Yt-l,f(Y,-2'Y 1-2,w),~ + &, If Yt is generated by a moving average process (see par 4.2), it can be expressed as a linear function of an infinite nwnber of previous time series observations: Yt-J• Yt-2•... (see 4.2.7). A recurrent network will therefore be suitable in the case of any process with moving average terms. Basic ModelGenlM 1.0 of Crusader systems, the software package that was used in this study, offers both feed forward and Elman networks, which are recurrent In Figure 2.10 The Elman-net is illustrated Figure 2.10: An Elman Recurrent network (Basic ModelGenlM1.0, 1997:53-54) In an Elman network one or more of the hidden units from the previous time step are used as an input at the next time step. This is useful when the output depends on the history of the inputs, which is the case in this study. O'Brien feedforward (1997) uses a neural network and an Elman recurrent neural network to extract knowledge of a physical system (O'Brien, 1997). A neural network can thus be compared with a non-linear model which defines a connection between certain input or explanatory or independent variables and output or dependent variables. Neural networks are used for different applications such as classification and prediction. In order to predict or classify, the parameters in the case of the non-linear model and known in neural network literature as weights, first have to be estimated The estimation of the weights is referred to as '~ining" Chapter 3 investigates the various methods of training a network. the network. Below is a list of neural network terms and the corresponding statistical terms.(Sarle, 1996: 1-5) .---- Neural network term Statistical term Architecture Model Training, Learning, Adaptation Estimation, Model fitting, Optimisation Classification Discriminant analysis Mapping, Function approximation Regression Supennsedleanting Regression, Discriminant analysis Unsupennsed learning, Principal components, Cluster analysis, Self-organization Data reduction Training set Sample, Construction sample Test set, Validation set Hold-out sample Input Independent variables, Predictors, Regressors, Explanatory variables, Carriers Output Predicted values Forward propagation Prediction Training values, Target values Dependent variables, Responses, Observed values Training pair Observation containing both inputs and target values Neural network term Statistical term Errors Residuals Noise Error term Generalisation Interpolation, Extrapolation, Prediction Prediction Forecasting Squashing function Bounded function with infinite domain Error bars Confidence intervals Weights, Synaptic weights (Regression) coefficients, Parameter estimates Bias Intercept Them~rencereM~nilie~pected Bias value of a statistic and the corresponding true value Backpropagation Computation of derivatives for a multilayer perceptron and various algorithms based thereon Least mean squares (LMS) Ordinary least squares (OLS) One of the most important properties of a neural network: is its ability to leam In neural network methodology, the data set is divided into three different sets, namely a training set, a cross-validation set, and a fest set. The training set is used for training the network with the various available leaming (optimisation) algorithms. These are discussed in this chapter. Section 3.2 defines the error function associated with neural network learning. In Section 3.3 and Section 3.4 methods based on first and second order Taylor expansions are reviewed, including the gradient descent method, Newton methods, conjugate gradient and quasi-Newton methodv. The powerful superlinear methods namely Levenberg-Marquardt which involve combinations discussed. process, The procedure is described algorithm and Snyman's leap-frog method, of first and second order Taylor expansions, are also of cross-validation, which forms part of the learning in Section 3.5. Different algorithms will perform best on different problems. It is therefore not possible to highlight one specific algorithm. The advantages and limitations of different algorithms in respect of all training problems are discussed. Optimisation is the process in which an error function E is minimised There are many optimal error functions, of which the sum of squares error function is the most widely used. The error is a function of the weights in the network as well as the training data. For a multi-layered perceptron (MLP) (discussed in Chapter 2), the derivatives of an error function with respect to the weights can be obtained efficiently by using backpropagation. The gradient information is of central importance in the use of algorithms for network training. There are four main stationary points at which the local gradient of an error function can possibly vanish. Figure 3.1 illustrates this point by showing the value of an error function E against a single weight w. Point A is called a local minimum because it is not the global minimum value in the error space. Point B is a local maximum, while point C is called a "saddle point". Point 19 9. is a region where the error function can be flat and some algorithms tend to get stuck on this flat surface for long periods. This behaviour is also found with local minima and can lead to premature termination of the optimisation algorithm. The desired error minimum is point D, the global minimum, because the value of the error function is minimal at this point. To illustrate basic terminology, the simple case where the error function depends on only two weights, Wl WI and Wl, is considered The problem is to find values for WI and where the error function reaches a global minimum. Figure 3.2 is a geometrical representation of an error function E(!YJ as a surface lying above the weight space. Point A represents a local minimum while point B represents the global minimum of the error function. If C is any point in the error surface, then the local gradient of the error function is VE, where VE is the gradient of the error function with respect to the weights. The gradient in Figure 3.2 will thus be the two-dimensional vector of partial derivatives with respect to the weights: Neural networks usually have many weight parameters, which may lead to extended training times. As a result, effective algorithms are necessary to find a suitable local minimmn of the error function in the shortest time possible. The parameters or weights are not identifIable because more than one set of parameters can give the same error function value. For instance, a two layered neural network: with M hidden neurons exhibits a symmetry factor of MJrJ (Bishop, 1996:256) equivalent points which generate the same network mapping and therefore give the same value for the error function. It is clear that the error function therefore does not have a unique global minimum. As a result, the specifIc values of the weights are not important. When the error surface is locally convex, the error function can be written as a Taylor series expansion in the vicinity of a local minimum. Optimisation algorithms can be classified according to their relation with the terms in the Taylor expansion. Four categories can be distinguished. Firstly, methods based on zero order Taylor expansions are simplex and random walk. These methods are well treated in different texts, and will not be of interest in this work (Fletcher, 1987). A second category includes methods based on first order Taylor expansions such as gradient descent. A third category which can involve combinations of either first and of second order Taylor expansions are superlinear methods such as the Levenberg-Marquardt, leap- frog, quasi- Newton and conjugate gradient methods. Finally, the fourth category \ '~lolt?-z.(."l 'b \ 4 :,C;<t, 0 b '2.. includes methods based purely on second order Taylor expansions such as the Newton methods. Because of the non-linearity of the error function, it is impossible to find global solutions to the error. To overcome this problem, algorithms that search through the weight space are used; formulated in mathematical terms as follows: where w is a weight vector (chosen at random) in a specific update direction, n the iteration step and Aw(n) the adjustment of the weight vector. Different algorithms involve different choices for Aw(n). Algorithms such as conjugate gradient and quaviNewton are formulated not to increase the error function as the weights change. Non- linear optimisation algorithms cannot guarantee global minimum and therefor the disadvantage of these methods is that the error function can get stuck in a local minimum, which, in turn, leads to premature termination. The process of learning or estimation involves the process of optimising the error function by finding the vector of weights which minimises the error function. By obtaining a local quadratic approximation by using a Taylor series expansion, the error function can be minimised in a convex region around a local minimwn. Let E be a function of more than one argument, EU!) = [(WI. W2. ...• w,J (Hamilton, 1994:735-738). Furthermore, let E: fit' -+91 with continuous second order derivatives. A first order Taylor series expansion of EU!) around any point WI is given by T oE ow = VET is a (lxn) vector (the transpose of the gradient) and RI(.) is a remainder If E(.) has continuous second order derivatives, a second order Taylor senes expansion of E(!!!J around WI is given by Consider a second order Taylor series approximation of the multi-variable error function E(!!!J: It' ~ RI around some point minimised. When E(!!!J is WI in the weight space, which is to be twice differentiable, the second order Taylor series approximation of E(w) around WI is b = oE(!i) - ow I ~l The first derivative of the approximated error function (3.3-3) with respect to given by W is The gradient is set equal'to zero at a stationary point, say !!2 = w * and solved for ~ then This forms the basis of many of the learning algorithms because, for points w that are close to w·, these above expressions will give acceptable approximations to the error function and its gradient. The method of gradient descent, also known as steepest descent, makes use of the first order Taylor expansion of the error function, E<!f) as given by (3.3-1). As mentioned in Section 3.2, different algorithms involve different choices for L1w(n), the weight vector update. An initial guess for the weight vector, denoted by wo, is required. The weight vector is updated in such a way that with each iteration step n the direction of the movement is towards the negative gradient of the error function oE~), A!!" evaluated at !fn. This is a discretisation of a first-order differential equation in w so that the trajectory in weight space leads to a stationary point w*. Thus for the error function oE~ --"- ow" )T . =!!, the weIght vector update is A ( aE(w,,) ) l..lwn =-TJ--- - aw" 2 O<TJ<--, Amax where A.max is the largest eigenvalue of H(J!) (Haykin, 1994:49). This leads to guaranteed convergence, although in practice 1] is determined empirically or by rule of thumb. The algorithm that implements the gradient descent learning strategy for feed forward neural networks is known as back propagation. It can be described in two steps. The first step is a forward propagation of the input pattern from the input to the output layer of the network. The second step is a back propagation of the error vector from the output layer to the input layer of the network (Berthold et ai, 1999: 228-229). Superlinear methods involve either first or second order Taylor expansions, such as Levenberg-Marquardt, leap-frog, and quasi-Newton where the Levenberg-Marquardt methods. Consider (3.3-8) methods replace B-1 with rt,.l) where 1] is a step size parameter and I the unity matrix. When TJ is equal to 1, rt,.l)/Ldominatesand the method resembles to gradient descent and when TJ becomes small the method becomes second order. It is important to mention that if residuals, which is the difference between the model output and the desired output, are not small. it should be better to use general quasi- methods or hybrids thereof (see later in this section). The Levenberg-Marquardt algorithm (Bishop, 1996:290-292) minimises the error function while at the same time it tries to keep the step size small to ensure that the linear approximation remains valid The leap-frog optimisation method of Snyman (1982:449-462) is a reliable and robust algorithm which will generally compute acceptable minima without being overly sensitive with respect to difficult environmental conditions. The method is based on the motion of a particle of unit mass in an N-dimensional conservative force field where the total energy of the particle is conserved In the case of neural networks the force field is the weight space. The energy of the particle consists of two components, namely potential energy and kinetic energy from the fact that a force acts on a unity mass particle. The force accelerates the particle to a point where it has maximum kinetic energy and minimum potential energy. When this happens the particle stops because the force is zero. The force acting on the particle is the negative of the gradient of the potential energy (Snyman, 1983). The leap-frog method is based on Euler relations for dynamic systems. The updating equation (cf 3.2-2) for this method where Aw(n) g(n) = rln)At(n) is the step size in the weight space, At(n) the time step, and the acceleration of the particle at step n. (3.4.2-1) relates the trajectory of the particle in the weight space to the particle velocity! and acceleration Q. Quasi-Newton methods use the local gradient of the error function, but instead of calculating the Hessian directly and then evaluating its inverse, an approximation A of the inverse Hessian is built up over a number of steps. The minimum of the quadratic error function is typically reached after W steps (where W is the number of weights and biases in a network) as is the same with the method of conjugate gradient A sequence of matrices is generated to represent increasingly accurate approximations to the inverse Hessian Ifl. This is accomplished by the use of only first order derivatives of the error function (by using backpropagation). By starting from a positive definite matrix such as the unity matrix, the problem of non-positive definite Hessian matrices is eliminated. The update procedures are such that the approximation to the inverse Hessian Ifl is guaranteed to remain positive definite although the condition number of Ifl may become inconveniently large. Two commonly used update procedures are the Broyden-Fletcher-Goldfarb-Shanno (BFGS) and the Davidson-Fletcher-Powell (DFP) (Bishop, 1996:288) with BFGS the method of choice. For the method of gradient descent, the direction vector is the negative of the gradient vector. Small steps are taken to reach the minimum of the error function, which can be a time consuming effort. The method of conjugate gradients guarantees the error function is minimised within W search steps, without calculating the Hessian matrix. This property is an improvement on the gradient descent method because the gradient descent needs many steps to minimise a simple quadratic error function. For conjugate gradient the choice for L1w(n) (the weight update in (3.2-2» is where pIn) denotes the search direction vector at iteration n of the algorithm and TJ is the learning rate parameter. The initial direction vector, ]l(O), is set equal to the negative gradient vector, b = oE(~)1 • _0 ~ w=w uw -- p(O) = -~o Each successive direction vector is then calculated as a linear combination of the current gradient vector and the previous direction vector: p(n + 1) = -Q"+l + p(n)p(n) where JX...n) is a time varying parameter that is determined in terms of the gradient vectOTSQ"and Q,,+l' The Fletcher-Reeves fonnula and the Polak-Ribiere fonnula can be used to detennine JX...n) (Bishop, 1996:280-281). This method was applied to small Boolean learning problems and it was shown that backpropagation learning based on conjugate gradient required fewer iterations than the standard backpropagation algorithm, but is more complex to compute (Kramer and Sangiovanni-Vincentelli, 1989 and Johansson et ai, 1992). There are known problems with conjugate gradient when the dimensionality of the network becomes large. This is due to curnmuIative mis-adjustment on the search direction update formula given in (3.4.2-5). Newton method,; may take only a few steps to converge, but are relatively slow because the Newton methods make explicit use of the full Hessian matrix, H, which has to be calculated. These can lead to computational expense. However, these methods are elegant and applicable and must be treated in this text. Consider the second order Taylor expansion given in (3.3-3) and assume that EUf) is twice differentiable. Then from (3.3-8), where the weight vector w corresponding to the minimum of the error function satisfies w= w· -H-1b. - - - The vector _H-1Q is known as the Newton direction or step (Bishop, 1996:287). Since the quadratic approximation used to determine (3.3-3) is not exact, the weight update equation has to be applied iteratively and H re-evaluated at each new search point. The Newton method has significant disadvantages. For non-linear networks, the Hessian matrix take NW 2 steps to compute, where W is the number of weights in the network and N the number of patterns in the training set. It also has to be inverted, 1..NW 3 requiring 2 steps. The Hessian has to be positive definite to ensure descent in the Newton direction. Cross validation is the process whereby a trained network's performance is measured on unseen data records. The data is divided into three different subsets, namely the training set, cross validation set and the test set. The cross validation set is used during the training process in order to stop training when the network starts to overtrain. With each iteration, the root mean square error (RMSE) of both the training and the cross validation sets are calculated and compared. The RMSE of the training set will typically decrease, and after a number of iterations, stabilise. The RMSE as a measure of performance can be determined by calculating the sum of squares of the error function whereYj is the target output, f(~j;~ a function of the input vector ~and the weight vector wand N is the number of training patterns. The MSE (mean squared error) is then M~E=E~ N Overtraining occurs when the network model fits the training set well but fails to generalise on samples of the same population. The problem is more accute in situations where the number of observations is small in comparison with the number of weights. To overcome the problem of overtraining, the neural network model fitted to the training set is also (in a sense) fitted to the cross validation set. A measure of fit such as the root of the mean squared error (RMSE) is calculated for both data sets. If the RMSE of the cross validation set increases after a number of iterations while the corresponding value decreases for the training set this is a sign of overtraining and unacceptable generalisation results. The remaining data is used to test the network's performance after training. This will be discussed in Chapter 4 by means of an example, when the test set is used to test the performance of the model after training. In this chapter different training or learning algorithms are given. It is emphasised that there is no ideal learning algorithm and that discretion must be used in selecting the correct method for a specific problem. In the next chapter time series analysis with neural network models and Box-Jenkins autoregressive moving average (ARMA) models are discussed. The aim of this chapter is to illustrate the identification, estimation and evaluation of neural network models and autoregressive moving average (ARMA) models for a stationary time series. Neural network models have been used in time series prediction (Le. Baron and Wiegen~ 1996). Autoregressive moving average models (ARMA) are well-known for purposes of time series analysis (Ansuj et aI. 1996). These models define linear relationships between a time series observation at time t. the dependent variable. and a set of time series observations that occurred prior to time t. Any linear model can be expressed as a simple feed forward neural network model with only linear activation functions. This is referred to as a regressor. The versatility of a neural network lies in the fact that it is used to model non-linear relationships between input and output variables. This can result in very complicated models with a large number of parameters. where the parameters have no physical meaning. In model building there is always a trade-off between models with a large number of parameters that fit very well, but with poor generalisation abilities. and models with fewer parameters that do not fit all that well but produce better forecasts because of better generalisation abilities. In Section 4.2 to Section 4.5 the ARMA{p,q}-process is defined. maximum likelihood estimation and the evaluation of the results are discussed. It is shown that these procedures can also be applied as part of the neural network methodology. Suppose that Y,. t = ... -I, 0, I ... is an equally space~ weakly stationary or covariance stationary. time series. A well-known class of linear models for the analysis of time series in the time domain belongs to an autoregressive moving average (ARMA) class of the form (4.2-1) (Hamilton, 1994:59-61), where {&t} is a sequence of uncorrelated variables, also referred to as a white noise process, with conditions = o and E{et) E{e e ) = t' and ~ r {CT: for t = r 0 th . (PI. ... o efWlse ¢p, 01, ... Oq are unknown constants or parameters. The model (4.2-1) is an ARMA(p,q) model or Box-Jenkins model. The process is stationary if the roots of the equation i(BJ = 0 all lie within the unit circle. TheARMA(p,q) model can be expressed as a moving average (AM) model, TI(B)Y, = p. +&, (J(B) with n(B) = =-~(B) Let n(B) = 1-JrtB-Jr2B2 - •••• The equation (4.2-6) can be written as Any ARMA model can therefore be written as an AR or MA model (Cryer, 1986:7374). Box and Jenkins (Box, Jenkins and Reinsel, 1994) developed a general framework:for time series modelling. They suggested a model-building strategy where model identification, estimation and diagnostic checking are done iteratively to select the best possible model for a given series. The ARMA model is based on linear relationships between successive observations as measured by the autocorrelation function. Plots of the sample autocorrelation function (ACF) and the sample partial autocorrelation function (PACF) are compared to the corresponding population functions to identify the ARMA model by selecting the values of p and q, the autoregressive and moving average order, of the model. (Cryer, 1986:111). . r r = COV(Y'_l , &, ) = 0 :;t: if k > 0 0 if k ~ 0 _r1 ro P 1- t/ti is the correlation coefficient in the bivariate distribution of Yt. Yt-k conditional on Yt-I, Yt-l, ... , Yt-k+I (Cryer, 1986:106). The ACF of an AR(P) process shows an exponential decay or a damped sine wave, while its PACF is zero for lags greater than p. A pure moving average model of order q has nonzero autocorrelations wdY on the first q lags and its partial autocorrelations are not zero after q lags. Two time series are used to illustrate the concepts of identification, estimation and evaluation of a model. Time Series 1 consists of 150 tenns generated from an AR(2) model with t/JJ. =1.3, th = -0.5, and zero mean, and St - N(O,a;). Time Series 2 consists of 336 data points of electricity consmnption, measured hourly, for two weeks. As mentioned in Section 4.3, the sample autocorrelation function and sample partial autocorrelation function can be used to identify a possible model that can be used to describe a time series. A time plot can be used to identify a trend, seasonal patterns, outliers, discontinuities, etc. A time plot of hundred observations of Series I is given in Figure 4.1. 10 5 o -5 I-Vtl -10 -15 Figure 4.2 is the sample autocorrelation function. It has the appearance of a damped sine wave which is typical of an AR model while the sample partial autocorrelation illustrated in Figure 4.3 gives a clear indication of an AR(2) model. The first two partial autocorrelations differ significantly from zero while the rest do not differ from zero on the 5% significance level. Lao Covariance Correlation ·1 o 7.670817 1.00000 1 6.013996 0.78401 2 3.712388 0.48396 3 2.166189 0.28239 4 1.322617 0.17242 5 0.765067 0.09974 6 0.505147 0.06585 7 0.830336 0.10825 8 1. 042072 O.13585 9 0.744163 0.09701 10 0 •323893 0 .04222 11 -0.124980 ·0.01829 12 -0.368546 ·0.04805 13 ·0.454332 -0.05923 14 -0.431367 ·0.05823 15 ·0.391593 ·0.05105 16 -0.338712 ·0.04418 17 -0.359287 ·0.04884 18 -0.789711 ·0.10034 19 -1.156028 ·0.15070 20 ·1.868117 ·0.21748 21 -2.225459 ·0.29012 22 -2.784961 ·0.36045 23 ·2.763432 ·0.36025 24 -1.983370 ·0.25856 Std ,I················ . I······. I'·' I'• I' I" I'·' I" ,. .,., .,., .,., .., I "'1 ····1 ....... ,, ........ . ....., marks two standard o 0.100000 O.'49310 0.164249 0.169035 0.170784 0.171368 0.171619 0.172300 0.173368 0.173910 0.174012 0.174028 0.174160 0.174362 0.174543 0.174692 0.174804 0.174929 0.175504 0.176793 0.179448 0.184079 0.191007 0.197685 errors Figure 4.2: Sample autocorrelation function for an AR(2) with ;1=1.3 and ria. = -0.5 (SAS System for Windows v6.12) Lag Correlation ·1 9 8 7 6 5 432 I 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0.88347 ·0.42102 0.00224 0.00211 ·0.00117 0.02936 0.02602 0.08493 0.06300 0.11536 -0.06218 ·0.00949 ·0.08332 0.19943 -0.13970 -0.04186 -0.00209 -0.14519 ·0.07380 0.02051 0.04710 -0.06309 -0.15938 ·0.00652 I 0 I 2 3 4 5 6 7 891 ........ I·················· , I I I I" I' I'• I' I" '1 1 "1 I···· ····1 '1 1 .* •• , '1 ,. ., .... , 1 I Figure 4.3: Sample partial autocorrelation function for an AR(2) with ;1 =1.3 and -0.5 (SAS System for Windows v6.12) th = ~ ~ ~= .~ :::(,)-,..., 3 2.5 2 .- c + b 0 CD 1.5 (,)._ 0 CD .•••~ _ Cox 1 W E ~ U) I I 1-.ELEK:yt: 0.5 C o o CJ There are explicit daily, 12-hourly and weekly seasonal patterns in the electricity consumption, due to consumer behaviour. The autocorrelation function depicted in Figure 4.5 shows evidence of a sine wave, which may suggest an AR(2) component. The sample autocorrelations from lags 20 to 24 are quite high due to the 24-hour seasonal pattern. Lag Covariance Correlation o 5842606 1.00000 1 5379454 0.92073 2 4337118 0.74233 3 3086162 0.52822 4 1901803 0.32551 5 917772 0.15708 6 194390 0.03327 7 -238615 -0.04084 8 -361234 -0.06183 9 -197386 -0.03378 10 146418 0.02506 11 456639 0.07816 12 511863 0.08761 13 243946 0.04175 14 -219729 -0.03761 15 -631504 -0.10809 16 -783852 -0.13416 17 -635189 -0.10872 18 -208851 -0.03575 19 465597 0.07969 20 1324200 0.22665 21 2276192 0.38959 22 3199092 0.54755 23 3893269 0.66636 24 4086444 0.69942 't:****'/r+**** ********+.*** •••• ****+ ••••• The partial.a.utocorrelation function's first two lags are significant and the remaining lags are significantly reduced which indicates an AR(2)- process. This is illustrated in Figure 4.5. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0.89902' ·0.40365 ·0.08511 ·0.12721 ·0.22907 ·0.17068 0.15599' 0.13347' 0.04207 0.02022 -0.04473 -0.12335 -0.09437 0.08936' ·0.19785 -0.09166 ·0.00793 ·0.13995 0.00804' -0.10087 -0.02190 -0.06457 -0.00557 0.03435' I·················· I1 • •••••• , 1 ·1 · "I , 1 •• •• 1 ••••• 1 I , 1 · •• ·1 1 I •• I •• 1 I·... ,. I I 1 , 1 I I , '1 1 .., .., , I , I I 1 I • •• , 1 I 1 '1 I 1 I ,. 1 • , · I I •••• •• , 1 ,.. . · I I , 1 · ••• , , , , I I Figure 4.6: The partial autocorrelation function for the sample of electricity consumption. A spectral analysis of the data can be used to identitYthe most prominent cycles in the data. A periodogram of the data is given in Figure 4.7 .. _ 6 co + - 8 4 X - 2 ~ N ~ M ~ ~ ~ M ID ~ ~ ro ~ ~ ~ 0 ~ ~ N ~ The graph s~~ws peaks at 12-hour. 24-hour (the most) and 48-hour periods. A large percentage of the total variation in the data can therefore be ascribed to the identified cycles. and the information can be used to define input variables for a neural network model. Through the method of spectral analysis. the series of observations Yt can be described as a weighed sum of periodic functions of the form cos(~). where {j} denotes a particular frequency (Hamilton, 1994:152). For Series 2 with dominant periods of 24 hours and 12 hours. as seen in Figure 4.7. the frequencies can be calculated as follows: Period =24 = 2Jr therefore lU I = 0.2618 lUl Period =12 = 2Jr therefore lU2 = 0.5236. lU2 Different numbers of nodes in the hidden layer were tried and it was decided to use no more than two nodes in the hidden layer. The added complexity introduced by adding nodes in the hidden layer did not result in a significant improvement of the model fit. Let fl, f2, ... , f N denote a stationary time series, and suppose that the observation at time t can be described by the model where w is the parameter vector or weight vector and the error terms ... ~-I,~,~+I •.. are assumed to be white noise. Iff is a linear function, the above model is an AR(P) model. The parameters of an ARMA model can be estimated by the method of maximum likelihood. There are various other methods, for example, the conditional least .~quare method (CLS) and the unconditionallea.,>;tsquares method (ULS). In the case of an AR model, if the error tenns are assumed to be independent nonnal, it can be shown that the maximum likelihood method is equivalent to the least squares method. For purposes of this dissertation only the method of maximum likelihood will be discussed. be the vector of parameters for anAR(p) model and let (Y ••f2. ... fM>be a stationary time series. The value of l'that maxirnises the joint probability or likelihood function is the maximum likelihood estimate. The likelihood function for an AR(p )-process, conditions on both Y' sand i's. lfthe initial values of ~o = (Yo'Y-J'···'Y-P+)' the sequence {e) ,e2, ••• ,eN } and fo =(Eo,E_),...~E_q+)'are given, can be calculated from {Y)'Y2' ...'YN }by iterating on with the option of setting the initial y's and cs equal to their exPected values (Hamilton, 1994:132). The error function used for maximum likelihood estimation can be written as In neural network terminology, estimation of the parameters of the model takes place during the training period, during which the network learns with generalisation through certain learning algorithms. Weights connected to each input value (parameters) are constantly updated until the error function, E(!!J, which is often the sum of squared errors (see Chapter 2), reaches its global minimum. An AR(2) model was fitted, by using the SAS System for Windows v6.12, to Series 1 by the method of maximum likelihood. The estimates, approximate standard errors and corresponding t-ratios are given in Table 4.1. Parameter Estimate Approx. Std Error T-Ratio Lag MU -o.672n 0.89122 -0.75 0 AR1,1 1.26298 0.09195 13.74 1 AR1,2 -0.42242 0.0937 -4.51 2 For large sample sizes the t-ratio may be used to test the significance of the estimates with respect to the null hypothesis that the corresponding parameter is zero. The tvalue for the mean is small, not rejecting the null hypothesis. This is expected, since the time series was generated with a zero mean. The parameter is redundant and should be excluded from the model. The t-values for the two estimates are large, indicating that ;J and th should not be excluded from the model. In Figure 4.8 the time plots of the actual and the fitted series is given. 10 5 o l--Target - - .... Model -5 -10 A neural network is fitted to the electricity consumption time series. The program package used for this purpose is Basic ModelGen™l.O from Crusader Systems. Estimation ~.f a neural network model takes place during training by means of the training algorithm. For the training set 70% of the data is used, 20% for the crossvalidation set and the rest is used for testing the model. The model (4.3-6) has six input variables, as mentioned in Example 4.1, as well as an intercept term included in both the input layer and the hidden layer. The neural network model is non-linear with two neurons in the hidden layer. The sigmoidal function (2.5-7) is used as activation function. The sum of squared error function (3.5-1), which is a function of the weights in the network, is minimised. This is done efficiently by using backpropagation. As mentioned in Chapter 3, the cross-validation set is also used during training. The root mean square error (RMSE) of both the training set and the cross-validation set decrease with each iteration. The RMSE of the training set will usually decrease more than the RMSE of the cross-validation set because the network is trained on the data of the training set. If too many parameters are included in the model relative to the number of data points in the training set, over-training can occur. An increase in the RMSE of the cross-validation set gives an indication that the network is starting to over-train and training stops immediately. In Figure 4.9 a plot of the RMSE of the training set and the cross-validation set is illustrated. ,.-.,', , .. ; •. I O.Q4.~ : ·5 w ,,~OD36 .' . : . : :. '; : .. . .;" ., .. .:'. '.:. .,' -, :' . " i\11 , c' " " :', " ':': ' ..' t: oms .' 'lk:. "-" < . ;. . :,: :".', .' i \ . 7,- .... ',,' .. ..• c' , o.03i : I . ,', .. :'" ' ;"/. . " :. ". . , " . (. " < '.< ,, ; • '.', J, ~ • .. '-:: -~ .' - ,-~. .. ~ .' , :": ".:,' : ... " -'1'.:,',_'- ,"" The output of the model is a function of the weights but the weights t~emselves do not have any significance value. When the vector of weights, which minimises the gradient of the error function, has been detennined through a learning algorithm, training stops. The output of the model containing the six input variables (and one hidden laye~~th two neurons) as indicated in Section 4.3 fitted to Series 2 is given in Figure 4.10. a- 1.2 1 ~ .1) 0.8 ~ § 0.6 i ~0.4 U c .- 0 a. ~ 8 II) --Series1 ----...Series2 0.2 0 ~ m ~ ~ ~ N M M ~ ~ m ~ ~ ~ ~ M <0 ~ Time Although the data set is quite small, the model fits the data reasonably well, considering Figure 4.10. A rule of thumb that is often used and accepted by practitioners is that the number of observations should be at least a factor of ten times the number of weights (N=JOW). In this section, the estimation of linear and neural network models is discussed The evaluation of the estimated model is discussed in the next section. After a model has been identified and the parameters estimated, the model must be evaluated. Evaluation procedures are the same for both the ARIMA and neural network models. If the model fits the data well, then the residuals should almost have the properties of uncorrelated, identically distributed, random vat:iables with zero mean and fixed standard deviation (Cryer, 1986:147). If the estimated value of given by r. is The residuals of a fitted model are useful indicators of any inadequacies in the specification of the model or violations of underlying assumptions. Examination of various plots of the residuals, is an indispensable step in the evaluation process of any model (Box and Jenkins, 1994:289). If a plot of the residuals exhibits a trend over time, it is an indication of a trend in the data that is not adequately modelled. A histogram of standardised residuals should correspond with a symmetrical normal curve if the model fits the data well. Any outliers will imply significant differences between the Y, and the corresponding observed value, Yt that should be investigated since the model fits the data poorly at those points. The sample autocorrelation function of the residuals, r k , can be observed to check for the independence of the residuals in the model. Usually the sample autocorrelations are approximately uncorrelated and normally distributed with mean zero and variance l/n. A '"i (Chi-squared) test is used to test whether residuals are correlated (Cryer, 1986:153). Different programme packages evaluate neural network models differently. A comprehensive residual analysis is offered by the BasicModelGen™l.O Crusader systems. The mean square error (MSE) in (3.5-2) and the root of the mean square error (RMSE) are determined for every model and the model with the smallest of these values should be selected. As mentioned, the residuals play an important role in the evaluation process. A fast Fourier transform of the model residuals is calculated If the spectrum of the residuals is constant, white noise is implied and it can be concluded that the model fits the data relatively well. Other meth~ of evaluation are a sensitivity analysis performed to establish the most influential input variable of the model, and output analysis where the relationship between an output and specific input is shown. The t-test statistic is used to determine whether a model's output differs from the desired output. Together with this test the mean and the standard deviation of the model's output are calculated. The quality of the model can be observed by viewing a graphical representation of the model's output versus the desired output. Another useful measure of evaluation is the correlation (R) between the target output and the actual model output. In the following example the evaluation of the models fitted to the AR(2) and the electricity demand time series is illustrated. The residuals for the estimated AR(2) model in Example 4.2 were analysed A time plot and other residual plots revealed no deviations from the model and assumptions. A X2 (Chi-squared) test for independence of the residuals was performed and the null hypothesis was not rejected. AR(I), ARMA(2,I) and AR(3) models were also fitted, but the AR(2) model fit was superior as measured by Akaike's information criterion (Cryer, 1986:122) and the significance of the estimated pammeters. This is an expected result, since the time series is generated by an AR(2)-process. The evaluation procedure points out some inconsistencies with respect of the neuml network model fitted to the electricity data, which can be expected because the data is seasonal and the model selected to fit the series perhaps does not include all the variables necessary to fit the data. This problem is examined in Chapter 6. The observed spectrum of the residuals of the neural network model has some peaks which indicate that there is some seasonal information present in the residuals. The sensitivity analysis, illustmted in Figure 4.11, shows that the variable ft-l has the greatest influence on the model output which is quite reasonable because it represents the electricity consumption during the previous hour. 700600500400 300 .~ 1ii 200 - 8100 "2. 0 g -100, &-200 ..,. !! -300 -< ~. . -400 -500·· .. -600 -700-800 The t-test is used to determine how significantly a model's output differs from the desired or target output. The time plot of the model output and the desired output for the test set are given in Figure 4.12. Model Output versus Target Output The correlation between the target output and model output is 0.897. This value differs significantly from O. Table 4.2 provides this information together with the mean and standard deviation of the model and the desired outputs for the test set, as well as a 95% confidence interval for R. Table 4.2: The correlation coefficient for the test set as determined by the neural network model Standard 95% Confidence Mean Deviation R Backprop net 19861.1 2065.443 0.89722 Desired 19461.3 2144.093 Ii 0.80501 UmitsforR [0.82EH:>.940] Although the evaluation of the model for the electricity consumption may indicate some inadequacies, one can always improve the model by trial and error. In Chapter 6 an electricity load forecasting will be done on one year's data. In this chapter the process of identification, estimation and evaluation was discussed by means of examples. When the process of evaluation has been completed, and the model gives a satisfactory description of the data, the model can be used for forecasting or classification. In Chapter 5 forecasts and prediction intervals are derived and calculated for neural network models. In this chapter a neural network model is used to forecast a time series. It is shown that bootstrap methods can be used to calculate standard errors and prediction limits for the forecasts. In Section 5.2 one- and two-step minimum mean square error (MSE) forecasts are given for both a linear autoregressive model and a non-linear neural network model. In case of the non-linear model, a different approach has to be considered. A few alternative theoretical approaches to obtain forecasts for non-linear time series models are discussed in Section 5.3 and the bootstrap technique is selected for the purpose of obtaining prediction intervals. Section 5.4 shows how bootstrap methodology can be used to construct prediction intervals. One- and multi-step predictions, together with their standard errors and 95% prediction intervals are calculated for the simulated AR(2) series. Consider the ARMA model (4.2-1) for a stationary time series fit} with t =1,2, ..., N. The linear AR( 1) model is given by where {Et} is a white noise process that satisfies the conditions (4.2-2). Suppose that the white noise variables are identically distributed. The minimum mean square error (MSE) linear predictor for one step is the conditional expectation of YN+l, given Yl• Y1•... YN: YN(l) = E(YN+.IY.,Y2,··.,YN) = E(f;'YN + E,v+l!Y.,Y2,···,YN) = tAYN By using (5.2-1) for the second step prediction, Y,v+1, the U';;E linear predictor can be written in terms of the observed data Y;v: YN(2) = E(YN+2IY.'Y2'···'YN) = ;IE(Y,v+IIY1'Y2, ...'YN )+0 = ;IYN(l) =;.2YN = E[f(YN+I,a)+EN+21~'Y2' = E{f[f(Y N ,a) + EN+.],a YN(2) = E{flYN(I)+&N+d,al~'Y2' ...'YN] lY 1, Y2,..., YN } ...'YN} Alternative approaches have been proposed to calculate the forecasts; such methods are discussed in the following paragraphs. Five techniques for forecasting have been discussed by Brown and Mariano (1989) and summarised by Lin and Granger (1994). Some of the methods are naive in the requirement of assumptions regarding the distribution of the white noise terms. The methods vary in respect of ease of implementation and computing sensitivity. This technique is widely used and easy to implement but due to the fact that the forecast is b~ it is not satisfactory. If one considers the non-linear model for time series forecasting given in (5.2-5), the naive technique states that the two-step forecast of YN+2 for YN+l is given by in the model equation. The estimator a is obtained by minimising the sum of squared error terms or by maximising the likelihood function of the observations. Usually the expected value of the function is not equal to the function of the expected value and therefore the forecast will be biased Even if the functional form f( ) is known, the bias will not go to zero jf the sample size N is large. J YN(2) = f{(YN(I)+&),~)dF(&), = JfTf(Y N,~)+&),~]dF(&) where F() is the distribution function of & necessary to know the distribution of which is in practice generally unknown. &, andj(.) a non-linear function of &. It is Brown and Mariano (1989) assume a N(O,I) distribution, but this has to be verified Numerical integration can be used to approximate (5.3.2-1). Apart from the fact that this forecast can be difficult to calculate, it may be incorrect, because of the wrong assumption on the distribution of &. The dimensionality of the error distribution will increase with the lead time of the forecast, and will consequently result in an increase in computer time. For large values of N this technique has the same disadvantage as the closed form technique (in that the distribution the error terms is needed) but it is easier to implement. The two-step forecast is defined as where the sequence ej's are independent and identically distributed and chosen from the error terms. This is a particular form of numerical integration based on simulations. The mathematical expectation (5.2-7) is approximated by the arithmetic average from a sample of possible realisations of YN+2 given Yl, Y2, ... , YN. The bootstrap technique is based on the estimated residuals. The two-step forecast is given by are the realised one-step forecast errors arising up to time N. No assumption on the distribution of the error terms is therefore necessary. The forecast improves as time advances and is easily formed (Lin and Granger, 1994:2). This technique is used for purposes of this dissertation and discussed further in Section 5.4. This method involves the direct modelling of the relationship between YN+2 and YN by using the same model as before but estimating a new set of coefficients directly. The previous techniques involved the use of the same function with the same set of coefficients, f(f( )), for the two-step forecasts, which decreases its quality. The f( ) is rarely known in practice and has to be approximated from a specific search. If could be considered For a non-parametric procedure such as neural networks this would be a sensible technique. Lin and Granger (1994) investigated all these techniques with a simulation study for the two-step case. The results lead to a mild recommendation of the bootstrap predictor because of its small bias and not more than 5% inefficiency in mean squared error, when compared with the parametric model, using the corr~ Further study on broader classes of time series models is recommended specifications. The bootstrap technique can be applied to obtain interval forecasts for an autoregressive time series. Masarotto (1990) finds the bootstrap technique useful for three reasons, namely: it is distribution-free, it takes into account that the Parameters and order of the model are unknown, and improved computer technology makes the difficult calculations involved with the bootstrap technique easier. Boraine (2000) showed that the bootstrap results for linear models can be extended to non-linear time series models. A discussion is given in this section. In Section 5.4.1 the prediction limits for linear autoregressive models are given. These are the standard results given in any time series text book. The multi-step forecasts for the neural network models are introduced in Section 5.4.2 and the prediction limits for the multi-step forecasts are given in Section 5.4.3. where YN(J) = YN-i if j < o. The forecasts are calculated recursively and converge to the mean of the time series for large values of h. YN +h IYl, Y2 , ••. , YN is normal with mean YN(h) and variance h-l· a; (1+ Ltp~) where the j 's j:l The approximation of I-a probability limits for fN +hll; , f2 ,""", fN is given by where Z a 1- 2 is the 100(1 - ~)Ih 2 percentile of the standard normal distribution (Box, Let fl, f1, ... ,fN be a stationary time series described by a non-linear neural network model where j{ ) is a non-linear function defined by the neural network architecture, p is the number of input variables in the network, and w the weight vector. The c,'s are uncorrelated, identically distributed random variables with mean zero and variance u/. The observation at time N+ 1 can be written as For the estimation of the minimum MSE forecast, the bootstrap methodology can be used and therefore the bootstrap forecast for YA(2) is proposed as AIm A YN (2) -- - "f *(j) • LJ (YN+l'YN""'YN+2-p,!:!:0 m j=1 ~{l is an observation, and & -. drawn with replacement, from &p+1,..., &N with AIm A Y (h) "f(y*(j) y*(j) y*(j). ",\ N - - LJ N+h-1' N+h-2"'" N+h-p'~ m j=1 Y;}!2 = f(Y;~L1,Y;}!L2, " ... ,Y;}!Lp;!:!:0+&~{1 *(j) - Y YN+h-p - N+h-p 1'fh -p -< and 0 The bootstrap procedure can be summarised in a few steps: First fit the model (5.4.2-1) to the time series Yl, Yl, ...,YN. Then calculate estimates of the residual term using (5.4.2-8). Thirdly calculateY;+1, ...,r:~+h conditional on Yl, Yl, ... ,YN: Y;+Ir-p = YN+h-p if h-p:::; Oand&~+h is an observation drawn randomly with replacement from &p+l, ...,&N .Repeat this for m times where m =100, usually. Now calculate the one- to h-step forecasts for the time series generated in the previous step using (5.4.2-3), (5.4.2-6) and (5.4.2-9). In the next .s~on it is shown that bootstrap methodology can also be used for the construction of prediction intervals for YN+h. The ideas used in this section are based on the method proposed by Masarotto (1990) for linear time series models. The prediction error is By using the bootstrap methodology the distribution of r};{h) can be approximated using the Monte Carlo algorithm. rL andru are the B(Y )-th andB(l- Y )- th order statistic of rN(h); •...• rN(h)~ where 2 2 B is the number of bootstrap replications. eN(h) are required Note that rN(h)is rN (h). a function of u. A uis required for each For good approximation of the distribution of rN (h). at least 1000 bootstrap replications are required First calculate the bootstrap time series ~. ,..., Y;+h. To do this a neural network model of the form (5.4.2-1) is fitted to the observed time series, Yl, Y1,... YH. If w is the estimated weight vector, the estimated residuals as in (5.4.2-8) are &, = Y, - fO~-t,···,}'~-p;~ with A Y,· = f(Y'~J'Y'~2""'Y'~P;~+&' t = p+ l,p+2, ...,N. ",* A* where &, is drawn randomly with replacement from It is assumed that the bootstrap time series values Y:3P-t = Y:3P-2 = ...= y,,:,p = Y. u· , the standard error of YN (h), is calculated by using a bootstrap procedure within the bootstrap procedure, that estimates the distribution of the standardised residual The neural network model of the form (5.4.2-1) is fitted to the time series, ~. ,..., Y;. The residuals of the model are determined by using (5.4.2-8). By using this model residuals, a bootstrap series is generated namely, The forecasts, YN (h) are calculated by taking into account the first N observations, Yl- ,Y;...,Y;. The prediction error defined in (5.4.3-1) is e;(h) = Y;+h - Y N(h). This is repeated m times. The standard deviation of e;(h)(l), ...,e;(h)(m) is used as an ....• time series J;* ' ...'Y;+h' as well as the calculation of YN(h) andu A* and 7N(h)*, B times where B must be at least 1000, the interval (5.4.3-4) can be constructed. A practical application of forecasting a time series from a linear AR(2) model is given in the following example. A Fortran program (see Appendix: Chapter 6) was developed to train the network and to implement the bootstrap for forecasting and the calculation of the prediction limits. The program uses a Gauss-Newton algorithm to estimate the parameters of the neural network. As an exam:l~lea generated AR(2) time series (see Example 4.1 to 4.3) consisting of 200 values is used to illustrate the calculation of the forecasts and the corresponding 95% prediction limits. The generated process is linear, and therefore it can not be expected that the neural network model would produce better prediction results than a linear model. A feed forward neural network with two input nodes for two past values of the time series and one hidden layer with a sigmoidal activation function was trained. Bootstrap methodology was used to calculated the prediction limits. Figure 5.1 and Figure 5.2 illustrate the prediction results. The results for the neural network where bootstrap methods were used and the linear model correspond more or less. Autoregressive time series: predictions and 95% prediction limits (neural network) 0.8 0.6 0.4 0.2 o -0.2 -0.4 -0.6 -0.8 -1 -1.2 •.•. 2 •.•. •. •.- 4 --- 6 8 - - ~--- ..•.... 10 12 - - .. .... .. •. Lead Pred ictio n ---- - ·L95 - - - - - 'U95 I "Autoregressive time series: predictions and 950/0 prediction limits (linear regression model) 0.8 0.6 0.4 0.2 o -0.2 -0.4 -0.6 -0.8 -1 ----Prediction - - - - - ·L95 - - - - - 'U95 I It is shown that bootstrap methodology can be used to calculate one- and multi-step predictions of neural networks with their standard errors and prediction limits. A minimum of 1000 time series have been resampled from the observed times series and for each resampled time series a network is trained to do one step predictions. These highly computer intensive calculations were no problem considering the everincreasing speed of computers. In can be seen in the two examples given that neural networks as well as linear regression models are used to model the time series. Predictions and prediction limits are calculated. The results compare reasonably well. Further research is required to establish whether the bootstrap results can be improved. For an example, the increase in the number of bootstrap replications may lead to an improvement in the results. The objective of this chapter is to forecast a time series by using neural networks. Some scientists devote careers to the building of models for electricity load forecasting. The purpose of this chapter is not to improve on the work done on model estimation but only to introduce the neural network as a forecasting tool. Several studies have been done on electricity load forecasting. For example: the modelling of one-hour-ahead hourly electricity demand prediction has been done by Connor (1996) while Hwang and Ding (1997) focused on the construction of prediction intervals for electricity load forecasting. Lee et of (1992) used neural networks for short term load forecasting by dividing the data into different classes of daily and weekly load variations while Peng et 01 (1993) proposed a new strategy in selecting training cases for a neural network model. The data used here is ESKOM data taken hourly. measuring the average electricity consumption for the Republic of South Africa. Section 6.2 deals with the analysis of the data. The different strategies and techniques used to prepare the data for estimation of the models are proposed in Section 6.3. The model is investigated in Section 6.4 by means of evaluation. Forecasting one to twelve hours ahead by using the selected model is done in Section 6.5. Unfortunately the programme package. Basic ModelGen does not include the bootstrap method as a forecasting technique. Two methods. namely the naive technique discussed in paragraph 5.3.1 and the direct method discussed in paragraph 5.3.5 are used and compared in Section 6.5 by the use of the programme package. Basic ModelGen. The bootstrap method is then used in Section 6.5.3 on the above mentioned electricity data. As mentioned earlier. ESKOM data. taken hourly. measuring electricity consumption was analysed. A one-year period was considered starting from the 25th of October 62 1997 to the ~.5thof October 1998, a total of N = 8760 observations. The data therefore contains seasonal components, as seen in Figure 6.1, where a four-week period is illustrated. 3 e -------------------------------------1 0 +i:;r Q.+ 2.5 E CD ::so (1)- ex o .. CJ= ~tO ..2 -~ ~ ...= 0 CJ_ .!! w 2 1.5 1 0.5 Seasonal components cause difficulties in the estimation of the model parameters. The sample autocorrelation function (ACF) and sample partial autocorrelation function (PACF) point out a definite AR(2) component. Therefore the consumption in periods I-I and 1-2will be considered as two of the inputs in the neural network model. From the electricity data it is clear that there is more than one seasonal component. Several seasonal components can influence the behaviour of a time series. A spectral analysis can be performed to attribute the total variation in the electricity consumption to cycles of different frequencies. The results of the spectral analysis indicated the 24-hour and the 12-hour periods respectively as the more important seasonal components. Several other factors were considered, for example the effect of the day of the week and of public holidays, and the average temperature for each day. Galpin (1997) investigated the inclusion of rainfall, temperature and humidity as input variables in a regression model. Alt~~ugh she found that these climate data decrease the forecast error, it was found that public holidays have an overriding impact on the forecasting process. These variables were all considered as explanatory variables. Networks with different combinations of input variables and a different number of hidden inputs were trained The model that was finally selected to describe the electricity data has as explanatory variables periodic components that were determined by a spectral analysis (see Example 4.1), and the electricity consumption of the two previous hours. If Yt denotes the electricity consumption at time I, the proposed model is: Y, = I(Yt-), Y'_2,COS[liJ)(I -1)],sin[liJ)(I -1)],COS[liJ2(1 -1)],sin[liJil -I)];.!!) +&, (6.2-1) as well as two nodes in the hidden layer. The &,'S are assumed to be generated by a white noise process and liJ} = 0.2618 and liJ2= 0.5236. The ideal network is one that is not redundant, in other words, the specific network with the fewest parameters and which represents the data the best, is preferred (Hwang and Ding, 1997:748-757). The data is divided into a training set (first 70% of data), a cross-validation set (10% of data excluding the training set) and the rest is used as a test set (last 20%). By calculating the MSE for each model, the one with the smallest MSE value was selected. This was done for both the feed forward neural network, which uses backpropagation, and the Elman recurrent neural network The feedforward neural network is a function of a fixed number of previous values of a time series, while the Elman recurrent neural network, implicitly depends on all previous observations of a time series. The feedforward models considered in this dissertation, performed better as can be seen from the MSE values in Figure 6.2. - -------------------1. 50 30 I I 'ii 20 I N • -CD 0 )( 40 1 OMSE:FF CD :s .MSE:ELMAN I ~ w 10 UJ :E 0 ~ty "\ty Model The labels on the horizontal axis denote the number of input nodes and the number of hidden nodes of the different models considered. For example: 3.2 indicates a neural network with three inputs and two hidden nodes. The number of parameters, W, in the fully connected model is calculated as follows: where nj is the number of input nodes and nh is the number of nodes in the hidden layer. (The term, one, that is added. provides for a bias or intercept in both the input and hidden layer.) Therefore the neural network model (6.3-1) with six input nodes and two hidden nodes, has 17 parameters or weights. This is acceptable because the total number of patterns in the training set should ideally not be less than 32W(Bishop, 1996:410), but 10Wis also an accepted norm in some cases. The software package of the Basic ModelGen™ 1.0 from Crusader Systems was used. In the training phase the input signals are passed through an activation function (see Chapter 3). The activation function used here is a symmetric sigmoidal function (see 2.5-7). The backpropagation method was used to calculate the gradient of the error function. The training stopped after 250 iterations. After the network has been trained, the model has to be evaluated. Procedures for evaluation differ from programme package to programme package. A few of the procedures available in the Basic ModelGen™ 1.0 from Crusader Systems, are discussed in Section 6.4. Figure 6.3 is a representation of the root mean squared error (RM()E), and therefore also the mean square error (MSE), of the training and cross validation sets during training. This gives the user an indication of how well the training progresses with each iteration or epoch and, should there be some default, training can be stopped immediately. 0.049 0.046 0.044 0.041' (5 l::0.038, w ~0.036- ~ \ \ 0.033- ; 0.031 ) t I 0.028' !I " ';' ; 1 1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 _______ ,.,_. . "_,, __.•"._, Epochs .__ ,,.. ,,_,...__ ..__ .. ,_. . , . ! .1 The RMSE for both the training and cross-validation sets fall sharply at the beginning of the iterative process. The RMSE of the training set is lower and decreases steadily, while the RMSE of the cross-validation set shows a slight upward trend before it stabilises after about 130 iterations. This is an indication that the model should fit reasonably well to an independent data set from the same population. The investigation of the model residuals by means of the Fourier transform gives a spectrum that is not zero. This means the residuals may still contain some information of the underlying time series. The sensitivity analysis points out that the model is the most sensitive for the input variables Yt-1 and Yt-2• To inspect the sensitivity of the model output to the other input variables, Figure 6.4 can be observed. A graph of the output of the model and the desired outputs also gives an indication of the quality of the fit of the model. Figure 6.5 shows these results for a randomly selected section of the time series. The graph appears satisfactory because the model output is relatively similar to the target or desired output values. The last peak is overestimated by the model. The performance investigated further. of the model on weekends should be ----, j II I-To_... - - ••••. I Model: New Vt The following table provides the correlation coefficient, R, between the outputs predicted by the model and the actual desired output on the test set. The test set is used for model evaluation. This is data that was not used during the training phase. Standard Mean Deviation Feed Forw 0.096 0.341 Desired 0.035 0.341 Desi"ed 0.035 0.341 MODELS 95% Confidence R R2 UmitsforR 0.9821 0.9645 [0.980 - 0.984] The R of 0.9821 is quite high and lies in the 95% confidence interval. This indicates that the model fits the data well. The narrow interval is due to the large number of observations in the test set. The evaluation of the model has been done by means of v~ous methods of measurements. If this model is satisfactory, the forecasting process can take place. In Section 6.5 the forecasting of twelve hours ahead is done. The model (6.2-1) was estimated and evaluated in the previous sections. This model is now used to forecast Basic ModelGen™ 1.0 from Crusader Systems unfortunately is not a specialised time series analysis package. It only provides one-step predictions. For a given set of input variables, the predicted model value (6.2-1) can be calculated. It was decided to use the nalve-, the direct method and the bootstrap method (see Section 5.3 and 5.4) to forecast twelve electricity consumption values. A Fortran programme (see Appendix: Chapter 6), was used to produce forecasts and prediction intervals using bootstrap methodology. In the next section these three methods are implemented. = fey N(I),YN,cos[0,2618(N + 1)],sin[0.2618(N + 1)], cos[O.5236(N + 1)],sin[O.5236(N + 1)];w) = fCY N(2),Y N Cl),cos[O,26H{N + 2)],sin[O.261a:N + 2)], cos[0.523((N + 2)],sin[0.523((N + 2)];~ The following graph shows how this method performed. The forecast output values are compared with the target outputs for twelve hours. - e o -'¢ 0.+ .- E CD ::10 2.5 -------------------------.--i 2 ex o .. 1.5 i, II .- 3= .2 0 •• II (I),,"" u= ~as ~= .4 10.5 - o- -.JI:: u_ .!!! w 4.~ i • Target fTIDirect I I n, I I ~ I I ~ I xCb I I ,,~ I ,,'1, 4.~ 4.~ 4.~ 4.~ 4.~ 4.~ By inspection of the graph one can see that after two forecasts the forecast values predicted by the model are less than the target values. This can be a problem because it can cause an underproduction of electricity. The next method implemented is the direct method. For the direct method, a different model has to be estimated for each lead time. To forecast two steps, the neural network is trained to estimate Y 1+2 = g(Y I, Y,-l' cos[O,2618(t + 1)],sin[0.2618(t + 1)], cos[O.5236(t + 1)],sin[O.5236(t + 1)]; w) + &1+2 Yt+2 from .Yi: and .Yi:-J: Y Hi = h(Y, ,Yt-I ,cos[0,2618(t + l-l)],sin[0.2618(t + 1-1)], cos[0.5236(t + 1-1)],sin[0.5236(t + I-I)]; w) + CHi Figure 6.7 illustrates the performance of the direct method with once again the forecast output values compared with the target outputs for twelve hours. c o ;; .•...•... D.~ 2.5 --·--------·-.--------------- .. -----1 i E + ~ c! 2 H en c >< o .. u t:: 1.5 ~ .a-as~ ..~ -~ ~= 0 u_ .!! w I I DDirect 1 0.5 o .Target J , -\." Here the model- forecast values differ a bit less from the target values. To compare the naive method with the direct method Figure 6.8 is shown. .. e o ...-.-. + Q,"l:t E ::s8 (1)- ex o .. u= ~«S .- ~ ..u_= .!:! 0 -.lllI: .!! w 3 -..----.--------------------.----.-----~---------, 2.5 2 1.5 1 ~ I 'n r-.-T-a-r-g-et-' EJNa"ive mJDirect '--------' 0.5 - o- Figure 6.8 illustrates that the direct method performs better. The predicted values of the model for this method are for most of the time periods the nearest to the target. In this chapter a neural network was fitted to hourly electricity consumption. Two methods for determining the forecasts were discussed. The direct method performed better for most of the forecast values. Chatfield (1996) investigated alternative approaches to forecasting. The investigation of neural networks as a forecasting model showed that models with less parameters generally give better out-of-sample predictions than those neural network models with more parameters, even though this smaller models fit worse than less parsimonious models. Chatfield (1996) also found that wider prediction intervals may reflect model uncertainty better. Neural network software programs may be improved by including methods for determining prediction intervals. The same neural network model used in Chapter 5 is used to predict the electricity consumption for 1 to 24 hours ahead. A time plot of the data is given in Figure 6.9. Hourly electricity load d e man d tim e s e rie s 0.8 0.6 0.4 0.2 o -0.2 -0.4 -0.6 -0.8 Bootstrap methodology is used for prediction together with their standard errors and a 95% prediction interval. The results are illustrated in Figure 6.10. Predicted (Bootstrap ~ o .5 () "t: t) GI 'ii '0 GI ii hourly electricity load. approach applied to neural network.) '0 CIS 0 .2 -0 .5 () en - 1 -----Predicted -----·L95 -----·U95 I A linear regression model was also used to predict the electricity lo~d. The same input variables were used and it was assumed that the prediction errors are normal distributed when determining the prediction limits. In Figure 6.11 the prediction limits as well as the predictions are illustrated. Predicted (linear hourly electricity load regression model) 1 ~ (J °C (J •• ~ (I) 0.5 "C fCS .2 0 "C .!!! fCS (J U) -0.5 .•. -. -_ ... -1 I ----Predicted - - - - - 'L95 - - - - - 'U95 II It can be seen that the forecasts obtained by the neural network (Figure 6.10) and the linear model (Figure 6.11) compares well. In the case of the linear model the predictions and the prediction intervals are smoother than that obtained by the bootstrap method for the neural network model. This is because the bootstrap method is data driven and does not rely on any assumptions on the probability distribution of the data. PREDICTNN is an artificial feed forward neural network that does one-step and multistep time series prediction, as well as the calculation of standard errors and prediction intervals based on bootstrap methodology. (Reference). Minimum hardware specifications: The program runs under DOS Version 4. Memory: 32 MB CPU: 486 Disk space: 1 MB Disclaimer: The authors do not take responsibility in any terms for any damages that may result from using PREDICTNN. Create a new directory, say TSPREDICT. Copy PREDICTNN.EXE to the TSPREDICT directory. All input and output files will be written to or read from TSPREDICT directory. Go to DOS Window. cd\path to TSPREDICT Type PREDICTNN. PREDICTNN requires parameter input (from screen or ASCII file) and the training data set (from ASCII file). The user can either specify the parameters by typing it in as prompted by the program, or the parameters can be read from an input file: I Enter 0 if input from screen or 1 if input from file:--] If the user choose to specify the parameters by typing it in from the screen, the following values must be entered as prompted by th~ program: Enter Enter Enter Enter Enter Enter Enter Enter Enter Enter the the the the the the number number number number number number of patterns in the training set: of bootstrap samples to·be drawn: of input variables: of hidden neurons: of lags of the dependent variable: of forecasts to be calculated: lOO*(l-?)% confidence interval: the maximum number of iterations: the number of nn models to fi t: the input file name: It is assumed that the training set is the first pattern in the data file up to the number of patterns in the training set as specified. The number of bootstrap samples to be drawn refers to the number of samples to be used to calculate the prediction interval. A minimum of 1000 bootstrap samples is recommended in the literature. Note that an A1~ is trained on every bootstrap sample. If you only want predictions and standard errors, you may specify a value of3 for this parameter. The program will draw 100 bootstrap samples from the original time series to calculate predictions and standard errors. The number of input variables is the number of independent! explanatory variables in the model including the number oflagged values of the dependent! output variable. The dependent variable is a time series. The number of lags of the dependent variable is the autoregressive order of the model. Number of hidden neurons - number of neurons in hidden layer. Number offorecasts to be calculated - size of forecasting window. 100*(1-?)% confidence interval- a value of 0.05 will, for instance, specify a 95% confidence coefficient. Maximum number of iterations - iterative optimization algorithm will stop after the maximum number of iterations has been reached, even if convergence to the global minimum has not taken place. More complex problems require more iterations. In relatively small problems, (20 parameters), 60 iterations should suffice. The number of nn models tofit ~efers to the number of ANN's to be trained from random starting weights. The network with the lowest mean square error is then selected. Values between 3 and 10 should suffice. The input file name is the name of the data file. (Not longer than 20 characters). The values entered are automatically writt.en to the file PARAMB.DAT. \ If the parameters are read from a file, the program will read the parameter values from the ASCII file PARAMB.DA T one value per line in the following order: NTRAIN: Number of patterns in the training set. NBOOT: Number of bootstrap samples to be drawn. NINP: Number of input 'variables. NHID: Number of hidden neurons. NLAGS: Number of lags of the dependent variable. LEAD: Number of forecasts to be calculated. PALPHA:IOO*(l-?)% confidence interval. MAXITE: Maximum number of iterations. NFIT: Number of neural network models to fit. FNAME: Input file name. The data file is an ASCII file. The data are real numbers separated by one or more blanks. The name of the data file is specified by the user (see input parameters). The first pattern in the training set is given in the first row of the data file, starting with the output variable, the independent variables (if any) the first lag of the output variable, the second lag of the output variable, and so on. More than one line of data may be used for each pattern in the training set. The computer will go to a new line after a pattern has been read. FORECAST.OUT Format of output: Forecast lead (e.g. 1 for one step), forecast, standard error of forecast, lower and upper prediction limits, length of prediction interval. BOOT.OUT Format of output: Bootstrap sample number, value of root mean squared error for trained network (on training set). OPTIM.OUT Work file used by optimization algorithm~ This file may have useful information on optimization problems. Iteration results are given for the last network trained. Neural network architecture: Pre processing: Input and output variables are scaled to (-I: I). Initialization of weights: A set of weights are generated using a random number generator. In order to prevent local minimum problems, a number (as specified by user) of ANN's are trained from random points, and the best ANN is selected. One hidden layer with sigmoidal activation function. Error function: Root mean squared error. One output. Number of input variables (NINP) and number of nodes in hidden layer (NHID): NHID*(NINP+2)<100 Number of patterns in training set (NTRAIN): NTRAIN<IOOO (NINP+ 1)*NTRAIN < 6000 Number of forecast lead times (LEAD): LEAD<100 Number of bootstrap samples (NBOOT): NBOOT < 2000 LEAD*NBOOT < 10000 The data file ELECS.TXT contains 716 patterns of an electricity load demand time series. Each pattern consists of the output variable and 8 input variables including two lagged values of the output variable. Suppose we want to train a network with two nodes in the hidden layer on the first 400 patterns and produce 5 forecasts with 95% prediction limits, using 1000 bootstrap samples. 400 1000 8 2 2 5 0.05 40 3 elecs.txt To run the program, type PREDICTNN. The pr~m will estimate the required running time. If you want to terminate the program, hit 19!21 ~. . The output file are FORECAST. OUT and BOOT.OUT. The temptation exists to use neural networks as "black boxes", and scientists often succumb. This is very unfortunate because a neural network is a powerful statistical device, used for many different problems. Statisticians should provide the knowledge to implement statistical methods together with neural networks to bring out its best performances. In this dissertation, neural networks were used to analyse a time series and especially in request of the problem of forecasting. Statistical methods, used for time series analysis, are usually linear but may not succeed in forecasting a non-linear data set. This study specifically investigated the use of ARMA model in comparison with the neural network model for the problem of forecasting. Two time series, namely a generated AR(2) and an electricity consumption time series, were used to illustrate the process of time series analysis. Box and Jenkins proposed three phases for the analysis of a time series: identification, estimation and evaluation. These phases were given for ARMA models and also applied to neural network models in Chapter 4. To illustrate the use of neural networks for the problem of forecasting, electricity consumption data as a time series was used. A neural network model was estimated to predict twelve hours ahead. The model for electricity consumption can be improved. Different forecasting methods suitable for non-linear models were discussed A program was developed to detennine prediction intervals for the examples of the linear model and the neural network model by using the bootstrap method. The neural network program package that was used in this study can only calculate one-step predictions and does not permit the implementation of the bootstrap method. Calculation of multi-step forecasts and prediction intervals should be implemented in neural network software. Neural. ~etworks as a versatile, adjustable tool are highly recommended. This study focussed on the identification, estimation and evaluation of neural network models for time series, and forecasting. Many of the standard techniques in statistics can be compared with neural network methodology, especially in application with large data sets. Further research includes the use of bootstrap methods to calculate predictions, standard errors and prediction intervals for the forecasts. Bootstrap techniques can be implemented in neural network computer software packages. Research in the application and comparison by use of bootstrap techniques with neural networks are recommended. ANSUJ, A.P., CAMARGO M.E., RADHARAMANAN (1996). Sales forecasting R., and PETRY D.G. using time series models and neural networks. Computers and Indu~trial Engineering, 31(1,2), 421-424. BISHOP, C.M. (1996). Neural networks for pattem recognition. Oxford University press, New York. BERTHOLD, M., HAND, D.J., (1999). Intelligent data analysi/t: An introduction. Springer- Verlag. BROWN, B. and MARIANO, R (1989). Residual based stochastic predictors and estimation in non-linear models. Econometrica, 52, 321-343. BORAINE, H (2000). Prediction intervals for forecasts of neural network models for time series. Proceedings of the second ICSC Smymposium on Neural Computation, Berlin, Germany (23-26 May, 2000). BO~ G.E.P., JENKINS, G.M and REINSEL, G.C. (1994). Time Serie.~ Analysis: Forecasting and Control. Third Edition, Prentice Hall, Englewood Cliffs, New Jersey. CHATFIELD, C (1996). Model Uncertainty and Forecast Accuracy. Journal ofForeca~ting, 15,495-508. CHENG, B. and TITTERINGTON, D.M (1994). Neural networks: A view from a statistical perspective. Statistical Science, 9(1),2-54. CONNOR, J.T. (1996). A robust neural network filter for electricity demand prediction. Journal ofForeca~ting, 15(6),458(22). CRUSADER guide, 1-142. SYSTEMS (Pty)Ltd. (1998). Basic ModeIGen1MJ.O user's DE JONGH, P.J. and DE WET, T. (1994). An Introduction to neural network:;;, 1-19. FLETCHER, R (1987). Practical Methods of Optimization (Second ed.). New York: John Wiley. GALPIN, J.S. (1997). Investigation of improvement of load forecasting by using rainfall, temperature and humidity data. ESKOM Technology research report, 2-93. GEMAN, S., BIENENSTOCK E., DOURSAT, R (1992). Neural Networks and the Bias/Variance Dilemma. Neural Computation, 4, 1-58. HA YKIN, S. (1994). Neural network<; - A comprehen.vive foundation. Macmillan College Publishing Company, Inc., printed in USA. HWANG, J.T.G. and DING, AA (1997). Prediction intervals for neural networks. Journal of the American Statistical A.<;sociation, 92(438), Theory and Methods, 748-757. HAMILTON, J.D. (1994). Time serie.<;ana/y,<;i,<;. Princeton University Press, Princeton, New Jersey. JOHANNSON, E.M, DOWLA, F.U. and GOODMAN, D.M (1992). Backpropagation learning for multi-layer feedforward neural networks using conjugate gradient method. International Journal of Neural Systems, 2(4), 188-197. KRAMER, AH and SANGIOVANNl-VINCENTELLI, A (1989). Efficient parallel learning algorithms for neural networks. In Touretzky (Ed), Advance.v in Neural Information Processing Systems, I , 40-48. San Mateo, CA: Morgan Kaugmann. LE BARON, B. and WEIGEND, AS. (1996). Evaluating neural network predictors by bootstrapping, http://www.cs.colorado.edulhomeslandreas/public htmllHome.html.(January 1998). LEE, K Y., CRA, Y.T., PARK, J.R, (1992). Short term load forecasting using an artificial neural network. IEEE: Transactions on Power systems,7, 124-132. LIN, J.L. and GANGER, C.W.J. (1994). Forecasting from Non-linear Models in Practice. Journal offoreca.<;ting,13,1-9. MASAROTIO, G. (1990). Bootstrap prediction intervals for autoregressions. International Journalfor Forecayting, 6, 229-239. MC CULLOCH, W.S. and PITIS, W. (1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115133. MC LEOD, A.I. (1978). On the distribution ofresiduaI autocorrelation in BoxJenkins models. Journal of Royal Statistical Society, A40, 296-302. MINSKY, M. and PAPERT, S. (1969). Perceptrons. Cambridge, MA: MIT Press. ORR, KD. (1995). Use of an artificial neural network to predict lCO length of stay in cardiac surgical patients. Society of Critical Care Medicine. San Francisco, CA, January 1995. .l~ARK, D.C., EL-SHARKAWI, M.A., MARKS II, R.I., ATLAS, L.E., DAMBORG, MJ., (1991). Electric load forecasting using neural networks. IEEE Transactions on Power System~, 6(2), 442-449. PENG, T.M, HUBELE, N.F. and KARADY. G.G. (1993). An adaptive neural networks approach to one week ahead load forecasting. IEEE transactions on Power Systems, 8(3), 1195-1203. RIPLEY. B.D. (1993). Statistical aspects of neural networb. Invited lectures for SemStat, Sandbjerg, Denmark, 25-30 April 1992. To appear in the proceedings to be published by Chapman & Hall (January 1993). RIPLEY. B.D. (1994). Neural networks and related methods of classification. Journal of the Royal Statistical Society, 56(3), 409-456. SARLE, W.S. (1996). Neural Network and Stati.~tical Jargon. [email protected], (April 29, 1996). SCHALKHOFF, R (1992). Pattern Recognition: Statistical, Structural and Neural approaches. John Wiley and Sons, Inc., Canada. SNYMAN. J.A. (1982). A New Dynamic Method for Unconstrained Minimization. Applied Mathematical Modelling, 6, 449-462. SNYMAN, J.A. (1983). An updated version of the Original Leap-frog, Dynamic Method for Unconstrained Minimisation (LFOP1(b)). Applied Mathematical Modelling, 7, 216-218. STERN. as. (1996). Neural networks in applied statistics. American Statistical Association and the American Society for Quality Control: Technometrics, 38(3),205-213. TAM, K.Y. and KlANG, MY. (1992). Managerial applications of neural networks in the case of bank failure predictions. Journal of Management Science, 38(7), 926-947.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement