Neural networks for time series analysis Department of Statistics Faculty of Science

Neural networks for time series analysis Department of Statistics Faculty of Science
Neural networks for time series analysis
Submitted in fulfillment of part of the
Requirements for the degree of
Department of Statistics
Faculty of Science
University of Pretoria
Acknowledgements
2. Dr Johan Holm, my co-promoter. for his advice and encouragement especially
in the field of neural networks.
Summary
Promoter: Dr H Boraine
Co-promoter: Dr JEW Holm
The analysis of a time series is a problem well known to statisticians. Neural networks
form the basis of an entirely non-linear approach to the analysis of time series. It has
been widely used in pattern recognition, classification and prediction. Recently,
reviews from a statistical perspective were done by Cheng and Titterington (1994)
and Ripley (1993).
One of the most important properties of a neural network is its ability to learn. In
neural network methodology, the data set is divided in three different sets, namely a
training set, a cross-validation set, and a test set. The training set is used for training
the network with the various available learning (optimisation) algorithms. Different
algorithms will perform best on different problems. The advantages and limitations of
different algorithms in respect of all training problems are discussed.
In this dissertation the method of neural networks and that of ARlMA. models are
discussed. The procedures of identification, estimation and evaluation of both models
are investigated. Many of the standard techniques in statistics can be compared with
neural network methodology, especially in applications with large data sets.
To illustrate the adaptability of neural networks the problem of forecasting is
considered. A few alternative theoretical approaches to obtain forecasts for non-linear
time series models are discussed. It is shown that bootstrap methods can be used to
calculate predictions, s~dard errors and prediction limits for the forecasts.
Neural network methodology is applied to predict an electricity consumption time
senes.
Studieleier: Dr H Boraine
Medestudieleier: Dr J.E.W Holm
Voorgele ter vervulling van 'n deel van die vereistes
vir die graad Magister in Wisklmdige Statistiek
Die ontleding van tydreekse is 'n bekende probleem vir statistici. Neurale netwerke
vorm die basis van 'n geheel en al nieliniere benadering tot die analise van 'n
tydreeks. Dit word in 'n wye spektrum van probleemgebiede gebruik, byvoorbeeld
patroonerkenning asook klassifikasie en vooruitskatting. Slegs onlangs is neurale
netwerke deur navorsers soos Cheng en Titterington (1994) en Ripley (1993) vanuit
'n statistiese oogpunt beskou.
Een van die belangrikste eienskappe van neurale netwerke is die vermoe om te leer.
Neurale netwerkmetodologie behels dat die datastel verdeel word in drie dele,
naamlik: 'n leerstel, 'n kruisvalidasiestel en 'n toetsstel. Die leerstel word gebruik om
die netwerk af te rig deur gebruik te maak van verskeie leeralgoritmes. Sekere
algoritmes presteer beter as ander, na gelang van die tipe probleemsituasie. Die vooren nadele van leeralgoritmes word bespreek.
In hierdie verhandeling word die metodiek van neurale netwerke en die van ARMA
modelle hespreek. Die prosedures van identifikasie, hemming en evaluasie van heide
modelle word bespreek. Baie van die standaardtegnieke in statistiek word vergelyk
met neurale netwerkmetodologie en spesifiek met die toepassing van groot datastelle.
Om die aanpasbaarheid van neurale netwerke te illustreer, word die probleem van
vooruitskatting in oenskou geneem. 'n Paar a1tematiewe teoretiese beskouings om
vooruitskattinings te maak vir nieliniere tydreekse, word bespreek. Daar word
aangetoon
dat
bootstrap-metodes
gebruik
kan
word
om
standaardfoute en intervalle vir vooruitgeskatte waardes te bereken.
vooruitskattings,
Symbol
Commentary
A set of input variables for example: Yl-], Yt-2.
temperature. days ...
EUf)
I
E6i) ="2
f,; {yj N
f(:!j;~r
(H). = o2E(~
Y
Ow.Ow.
I
J
b= OE(~I
-
ow!!o
Error function used for maximum likelihood
estimation
Contents
Chapter 2: Neural Networks
3
2.1
Introduction
3
2.2
Historical Development
3
2.3
Artificial Neural Networks
4
2.4
Relevance of Neural Networks in Research
5
2.5
The Model of a Neural Network
7
2.6
Architecture of a Neural Network
11
2.6.1
Single-Layer Feedforward Networks
12
2.6.2
Multi-Layer Feedforward Networks
12
2.6.3
Recurrent Networks
14
Appendix:Terminology
17
Chapter 3: Learning in Neural Networks
19
3.1
Introduction
19
3.2
The Error Function
19
3.3
Taylor Series Approximations
22
3.4
Training
24
3.4.1
The Method of Gradient Descent
24
3.4.2
3.4.3
Superlinear methods
25
The Newton Methods
28
3.5
Cross- Validation as part of the Learning Process
29
3.6
Conclusions
30
Chapter 4: ARMA Processes and Neural Network Models
31
4.1
Introduction
31
4.2
The ARMA(p,q)-Process
31
4.3
Identification
4.4
Estimation
4.4.1 Maximum Likelihood Estimation (MLE)
4.4.2
4.5
Estimation when using neural networks
Evaluation
Chapter 5: Forecasting a Time Series Using Neural Networks
51
5.1
Introduction
49
5.2
The Forecasting Model
49
5.3
Non-linear Forecasting
51
5.3.1
The Naive Method
51
5.3.2
Closed Form
51
5.3.3 Monte Carlo
52
5.3.4 Bootstrap
52
5.3.5 Direct Method
53
The Bootstrap Technique
54
5.4.1 Prediction Limits for Autoregressive Models
54
5.4.2 Multi-step Forecasts for Neural Network Models
55
5.4.3 Prediction Limits for Yn+h
57
5.4
Chapter 6: Forecasting Electricity Time Series by using Neural
Networks
6.1
Introduction
6.2
Model Identification
6.3
Estimation
6.4
Evaluation of the Model
62
6.5
Forecasting by using the Neural Network Model
6.5.1
The Naive Method
6.5.2
The Direct Method
6.5.3
The Bootstrap Method
Appendix: Fortran Computer Programme
Chapter 7: Conclusions and Recommendations for
Further Research
References
Detecting trends and patterns in a time series traditionally involves statistical methods
such as clustering and regression analysis. However, these methods are linear and
may fail to forecast a non-linear data set. Neural networks form the basis of a different
approach to the non-linear analysis of time series. This study investigates the use of a
linear time series ARIMA model in comparison with a non-linear neural network
model for application in forecasting.
Chapter 2 deals with the history and development of neural networks. This is done in
the framework of a variety of problems, such as classification, pattern recognition and
forecasting. The model of a neural network is illustrated along with different types of
applied neural networks. Of all the possible networks, multi-layered feedforward
neural networks are used in this dissertation.
A very important property of a neural network is its ability to learn, which is the
phase during which parameter estimation takes place. The data set is divided into
three different sets, namely a training set, cross-validation set, and a test set. Neural
networks are data driven in the sense that they use data to train and build statistical
models of a recorded process. The training set is used for training the network with
various available learning (optimisation) algorithms. These various available learning
algorithms are treated in Chapter 3. Optimisation is the process in which an error
function, E, is minimised. The cross-validation set is used to prevent the network from
over-training. Over-training occurs when the network is able to represent the data
very well but without the ability to generalise. In well-behaved optimisation
problems, the value of the error function will have a downward slope in regard to the
parameters or weights. The algorithm will terminate when a specified criterion is met,
such as a required degree of accuracy. A stopping criterion that is often used in neural
network methodology, to prevent over-training, is the value of the error function
calculated for the cross-validation set (a data set not used for the estimation of the
parameters of the model). If this value shows an upward trend, while the
corresponding value for the training set decreases, over-training is indicated. The test
set does not form part of the learning process and is not considered in the estimation
of the model.
Further in Chapter 3. different optimisation algorithms are discussed. It is emphasised
that it is not possible to highlight one specific algorithm because all algorithms have
advantages and limitations.
Chapter 4 reviews the ARMA(p.q)-process.The different stages of modelling. namely
identification. estimation by means of maximum likelihood and evaluation are
discussed. This is illustrated by examples where both an AR(2) model and a nemal
network model are fitted to two time series. a computer generated AR(2) time series and
an electricity consumption time series. In Chapter 6. the model used to describe the
electricity consumption is extended in order to describe complex seasonal patterns.
Time series models are often used for forecasting. Since neural network models are nonlinear. minimum mean squared error forecasts are not of a simple form. In Chapter 5. a
procedure is proposed to calculate one- and multi-step forecasts as well as prediction
limits for the forecasts based on bootstrap methodology.
In Chapter 6. more complex models are considered to fit electricity consumption. The
aim of Chapter 6 is to demonstrate the use of neural networks as a statistical tool for
forecasting. Forecasting is done for one to twelve and twentyfour hours ahead. Three
of the forecasting procedures. discussed in Chapter 5. are also implemented
Chapter 7 concludes this work and suggests three extensions to this research. namely the
inclusion of bootstrap methods in new software tools. comparative studies and
applications.
An introduction to neural networks will be given in this chapter. In Section 2.2 the
history of Artificial Neural Networks (ANN), the development thereot: the different
fields of applications as well as the contributions that have been made between Neural
Networks and Statistics are discussed. To conclude the section, the connection
between a biological neuron and the neural network is illustrated. The capabilities and
properties of neural networks are given in Section 2.3 and 2.4. The construction of a
neural network as well as network architectures will be discussed in Section 2.5 and
Section 2.6.
The first time interest was taken in the term Artificial Neural Networks was when Mc
Culloch and Pitts (1943) introduced simplified neurons which represented biological
neurons, and which could perform computational tasks. In 1969 Minsky and Papert
(1969) published their book Perceptrons,
which focussed on the deficiencies of
perceptron models. Unfortunately this caused many researchers to leave the field
Interest in neural networks in the early eighties re-emerged after the publication of
several important theoretical results. This renewed interest is clearly visible in the
number of societies and journals associated with neural networks. The INNS
(International Neural Network Society) is widely known, and ranges from Europe
(ENNS) to Japan (JNNS). The field of Electrical and Electronic Engineering is served
by the journal series IEEE, and the field of Agriculture by the journal Neural Network
Application Agriculture (NNAA). In the field of Economics and Finance there are the
Journal of Economics and Complexity and the Journal of Computational Intelligence
in Finance etc. Other well known journals include Neural Networks and Neural
Computation.
Several conferences are held regularly, and for example the World
Congress of Neural Networks (WCNN'95) took place in Washington in 1995. A.I.arge
3
variety of software packages are available in the market today. Neural networks are
popular because of the many application fields in which they can be used Many
articles were written in recent years in several interesting fields, for example on
classification (Ripley, 1994) and pattern recognition, where applications include the
automatic reading of handwriting and SPeeChrecognition (park et oJ, 1991). Another
field of application for neural networks is prediction. In the world of finance, Tam
and Kiang (1992) showed that the neural network is a promising method of predicting
bankruptcy. In the medical world neural networks are used for instance to predict
mortality following cardiac surgery (Orr, 1995).
Several articles have been published in which statistical principles are used with
neural networks. Cheng and Titterington (1994) pointed out some links between
statistics and neural networks and encouraged cross-disciplinary research to improve
results. De Jongh and De Wet (1991) introduced neural networks as a powerful tool
that is close to statistical methods in solving various problems like pattern
recognition, classification and forecasting. The neural network and statistical
literature contain many of the same concepts but usually with different terminology
(Sarle, 1996). A list of statistical and neural network terminology is given in the
Appendix of this chapter.
In the following paragraphs a complete discussion will be given of what a neural
network is and how it comPares with a biological neural network.
Artificial Neural Networks (ANN) are computer algorithms or computer programs
develoPed to model the activity of the brain (Stem, 1996:205). The term "neural
network" originates from the attempt by scientists to imitate the ability of the brain to
recognise patterns and to classify. This is the reason why it is called "Artificial NeuroJ
Networb". Scientists prefer to refer to ANN only as NN (neural networks). In this
dissertation they will also be referred to only as neural networks.
The brain is a highly complex, non-linear and massively parallel computer. It can
perform computations like pattern recognition, perception and motor control many
times faster than any digital computer. This is possible because of the existence of
neurons. Neurons are simple elements in the brain connected to each other to form a
network of neurons. In Figure 2.1 the biological neuron is illustrated
A neuron is activated by another activated neuron(s) and, in ~
activates another
neuron(s). Synapses are units that mediate the interaction between the neurons. It can
either excite or inhibit a receptive neuron. If a neuron receives a signal from a
neighbouring neuron, which excites, the neuron's soma potential rises above some
threshold value and the neuron "fires". This in turn sends a signal to the next neuron,
and so on. New synaptic connections between neurons and the modification of
existing synapses constantly take place as the brain learns.
The axon.~ are the transmission lines, and the dendrites are the receptive zones. The
neural network resembles the human brain in two ways, namely that the network has
to learn to acquire knowledge and then the knowledge is stored in synaptic weights,
which represent the strength betw"eenconnected neurons.(Haykin, 1994:2).
For classification problems, discriminant analysis is used and linear regressIOn
analysis is commonly used for prediction. Why would one use neural networks when
there are so many other methods available? In this section only a few reasons
mentioned by Haykin (1994) are given.
Firstly a neural network develops out of interconnections between neurons which
defines a non-linear relationship between the neurons whereas the procedure for the
above-mentioned techniques are linear.
Secondly neural network methodology is a non-parametric
because the network
learns from examples by constructing input-output mapping for the specific problem.
Although there are usually many Parameters in a neural network model, these
Parameters are essentially artefacts in the fitting process. The emphasis is therefore
not on interpreting their values but on finding the optimal set for the classification or
prediction problem. One of the training methods is supervised learning. This involves
the changing of the synaptic weights by Presenting a training set of examples to the
network. Every time the weights are changed, an error signal is computed This error
signal is defined as the difference between the actual response of the network and the
desired response. The moment the error signal reaches a minimum value, the desired
weights for the specific problem have been obtained. Neural networks are therefore
adaptable. A more detailed discussion on supervised learning as a training method is
given in Chapter 3.
In classification problems, neural networks'are not only designed to give information
on which specific pattern to select, but also on the confidence in the decision made.
Evidential response is thus an important property of neural networks in classification
problems.
Neural networks are fault tolerant in the sense that, for example if a neuron or its
connecting links should be damaged, the neural network will gradually degrade in
performance, rather than suddenly.
Another important issue regarding neural networks is that the broad approach is a
very computer- intensive method, but with the constant improvement in computer
technology it becomes cheaper and faster to do calculations.
Lastly, neural networks are perhaps more user-friendly than other methods because
the user only has to differentiate between the independent (input) and dependent
(output) variables before selecting a model.
There are other broad classes of neural networks such as Hopfield associative networks,
Linear networks,
Probabilistic
neural networks and Radial Basis Function
neural
networks. A feed forward neural network, as used in this dissertation, is a name used for
a wide variety of mathematical models used to define the connection between a number
of explanatory variables and one or more dependent variables. In Section 2.5 it is shown
that the architecture of the model depicts the mathematical outlay of the neural network.
As mentioned in Section 2.2, the neurons are the elements that are connected to each
other to form the network. In Figure 2.2 the non-linear model of a neuron is
illustrated. The different components of a non-linear model are given.
Let x I,
X2,
... Xp
be a set of independent or explanatory variables and let Yk be the non-
linear function
Activation
function
Input
Output
signals
y};
Synaoeic
weights
Figure 2.2: The non-linear model ofa neuron (Haykin, 1994:8)
The
Xj'S
are the input signals. They are each connected to a weight or strength of its
own. A signal Xj at the input of synapse j connected to neuron k is multiplied by the
synaptic weight wkj. The first subscript refers to the neuron in question and the second
subscript refers to the input end of the synapse to which the weight refers. If the
weight is positive, the associated synapse is excitatory. If negative, the synapse is
inhibitory. The threshold function, Bt, is a bias or constant parameter added as an
extra input with value in most cases chosen as one.
The non-linear regression model can be described as
where Yk is the outp~
wkj,
j = 1,2, ...p
are the parameters or weights associated with
each input valuexj,j= 1,2,...p and &.t is the residual term.
The linear combiner described as I sums the input signal
Xj,
weighed by the
respective synapses of the neuron.
The activation Junction, ffJ, transforms the amplitude of the output of a neuron.
Normally the amplitude will lie in the interval (0.0;1.0) or (-1.0;1.0), depending on the
type of activation function. Three types of activation functions, namely the threshold
function, the piecewise linear function, and the sigmoidal function are discussed.
P
Vk
= LW.qxj
j=l
-8k
I if V> 0
qJ( V) = { 0 if v < 0
1
0.8
0.6
0.4
0.2
o
-1.5
-2
,
-1
-0.5
I if v ~ 0.5
v if -0.5<v<0.5
qJ(V)=
{
o if v:::;-o.5
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
o-2
-1.5
-1
-0.5
0
0.5
The sigmoidal function is the most popular activation function. The fact that it is a
differentiable function is especially important in the training of neural networks. The
reason for this will be discussed later. An example of a sigmoidal function is the
logistic !W2Ction.
qJ(v) =
1
.
l+exp(-av)
Note that the activation functions range from 0 to 1. In Figure 2:5 the sigmoidal
function is illustrated for different values of a.
2
1.8
1.6
:p(v)
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-10
-8
-6
-4
-2
0
2
4
6
8
10
v
The threshold has the effect of lowering the net input to the activation function. If the
opposite effect is needed a hia') term can be implemented. The bias and threshold
function are, therefore nUIl?ericallythe same but with different signs.
where Yk is the output of the signal
threshold
The use of the threshold
transformation to the output
rp(.) is the activation function and 8k is the
8k has the effect of applying an affine
of the Jinear combiner. See Figure 2.6.
Uk.
Total
internal
activity
level, Vk
Threshold 8k < 0
8k=0
8k>0
Linear combiner's
output, Uk
Figure 2.6: Affine transformation produced by the presence of a threshold (Haykin,
1994:9)
Whether the threshold 8k is positive or not, the relationship between the activation
potential
Vk
of neuron k and the linear combiner output
Uk
stays constant.
A neural network is an interconnection of neurons. Neurons can be connected in many
different ways to define different neural network architectures. In Section 2.5 the
architecture
of the neural network is illustrated along with the different classes of
neural network architectures.
Neural
networks are mostly illustrated as block diagrams in which the vanous
elements of the model are described. There are four different classes of network
architectures,
namely single-layer feedforward
networks,
multi-layer feedforward
networks, recurrent networks and lattices structures. Only the first three will be
discussed.
When a network is organised in a form of layers, it is referred to as a layered network.
A linear regression model is equivalent to a feedforward
neural network in its
simplest form with an input and an output layer and a linear activation fimction. In
Figure 2.7, a feedforward network with a single layer of neurons is illustrated
The
most simple feedforward neural network has an input and output layer.
Figure
2.7:Feedforward
network
with a single
layer of neurons
(Schalkhoff,
1992:250)
This network consists of an input layer, one or more hidden layers and an output
layer. The computation nodes of the hidden layer are called the hidden neurons or
units. This type of network defines a more complicated
model. The relationship
between the inputs and outputs can be non-linear functions of non-linear functions, so
that the degree of non-linearity increases with each additional layer. In Figure 2.8 a
fully connected feedforward network with one hidden layer is illustrated. In a fully
connected feedforward network, every input neuron in the input layer is connected to
every neuron in the follqwing hidden layer, as well as every neuron in the last hidden
layer connected with every neuron in the output layer (Schalkhoff, 1992:236-258).
0··
Figure 2.8: Fully connected feedforward network with one hidden layer and output
layer(De Jongh and De Wet, 1994:5)
There are many fields of application for the multi-layered feedforward network, like
classificatio~
pattern recognition and forecasting. The following example illustrates a
time series prediction problem.
Letyt.
Yt-l, Yt-2 ... be
a stationary time series and let
Then a set of d such values
Yt-d+], ... Yt
Yt+l
be the value to be predicted.
can be selected from the time series to be the
inputs to a feedforward network, and the next value Yt+ 1, used as the target for the
output. In Figure 2.9 a feedforward neural network with one hidden layer is illustrated
(Bishop, 1996:303).
y
t
~---.,
t
t
I
I
I
I
I
I
I
• • •Yt Yt+l•
Yt-? Yt-l
~
A recurrent network differs from a feedforward network in the sense that it has at
least one feedback loop. If feedback exists in a neural network, it can be shown that
its output depends on all previous input values. Let Yl• Yl •... YN be a time series.
Suppose that a recurrent network architecture is used to define a model for the time
series and that the observation at time t can be written as
Yt-1 and the previous output value, Y t-I, which is characteristic of a recurrent neural
network model. By recursive substitution it can be shown that Yt can be written as a
function of Yt-1.Yt-l•... :
y,
= f(Y,_I,Yt-l,~+&,
= f(Yt-l,f(Y,-2'Y
1-2,w),~ + &,
If Yt is generated by a moving average process (see par 4.2), it can be expressed as a
linear function of an infinite nwnber of previous time series observations: Yt-J• Yt-2•...
(see 4.2.7). A recurrent network will therefore be suitable in the case of any process
with moving average terms.
Basic ModelGenlM
1.0 of Crusader systems, the software package that was used in
this study, offers both feed forward and Elman networks, which are recurrent
In
Figure 2.10 The Elman-net is illustrated
Figure 2.10: An Elman Recurrent network (Basic ModelGenlM1.0, 1997:53-54)
In an Elman network one or more of the hidden units from the previous time step are
used as an input at the next time step. This is useful when the output depends on the
history of the inputs, which is the case in this study. O'Brien
feedforward
(1997) uses a
neural network and an Elman recurrent neural network to extract
knowledge of a physical system (O'Brien, 1997).
A neural network can thus be compared with a non-linear model which defines a
connection between certain input or explanatory or independent variables and output or
dependent
variables.
Neural networks are used for different applications
such as
classification and prediction. In order to predict or classify, the parameters in the case of
the non-linear model and known in neural network literature as weights, first have to be
estimated The estimation of the weights is referred to as '~ining"
Chapter 3 investigates the various methods of training a network.
the network.
Below is a list of neural network terms and the corresponding statistical terms.(Sarle,
1996: 1-5)
.----
Neural network term
Statistical term
Architecture
Model
Training, Learning, Adaptation
Estimation, Model fitting, Optimisation
Classification
Discriminant analysis
Mapping, Function approximation
Regression
Supennsedleanting
Regression, Discriminant analysis
Unsupennsed learning,
Principal components, Cluster analysis,
Self-organization
Data reduction
Training set
Sample, Construction sample
Test set, Validation set
Hold-out sample
Input
Independent variables, Predictors,
Regressors, Explanatory variables,
Carriers
Output
Predicted values
Forward propagation
Prediction
Training values, Target values
Dependent variables, Responses,
Observed values
Training pair
Observation containing both inputs and
target values
Neural network term
Statistical term
Errors
Residuals
Noise
Error term
Generalisation
Interpolation, Extrapolation, Prediction
Prediction
Forecasting
Squashing function
Bounded function with infinite domain
Error bars
Confidence intervals
Weights, Synaptic weights
(Regression) coefficients, Parameter
estimates
Bias
Intercept
Them~rencereM~nilie~pected
Bias
value of a statistic and the
corresponding true value
Backpropagation
Computation of derivatives for a multilayer perceptron and various algorithms
based thereon
Least mean squares (LMS)
Ordinary least squares (OLS)
One of the most important properties of a neural network: is its ability to leam In
neural network methodology, the data set is divided into three different sets, namely a
training set, a cross-validation set, and a fest set. The training set is used for training
the network with the various available leaming (optimisation) algorithms. These are
discussed in this chapter. Section 3.2 defines the error function associated with neural
network learning. In Section 3.3 and Section 3.4 methods based on first and second
order Taylor expansions are reviewed, including the gradient descent method, Newton
methods, conjugate gradient and quasi-Newton methodv. The powerful superlinear
methods namely Levenberg-Marquardt
which involve combinations
discussed.
process,
The procedure
is described
algorithm and Snyman's leap-frog method,
of first and second order Taylor expansions, are also
of cross-validation,
which forms part of the learning
in Section 3.5. Different algorithms will perform best on
different problems. It is therefore not possible to highlight one specific algorithm. The
advantages and limitations of different algorithms in respect of all training problems
are discussed.
Optimisation is the process in which an error function E is minimised There are many
optimal error functions, of which the sum of squares error function is the most widely
used. The error is a function of the weights in the network as well as the training data.
For a multi-layered perceptron (MLP) (discussed in Chapter 2), the derivatives of an
error function with respect to the weights can be obtained efficiently by using
backpropagation.
The gradient information is of central importance in the use of
algorithms for network training. There are four main stationary points at which the
local gradient of an error function can possibly vanish. Figure 3.1 illustrates this point
by showing the value of an error function E against a single weight w. Point A is
called a local minimum because it is not the global minimum value in the error space.
Point B is a local maximum, while point C is called a "saddle point". Point
19
9. is
a
region where the error function can be flat and some algorithms tend to get stuck on
this flat surface for long periods. This behaviour is also found with local minima and
can lead to premature termination of the optimisation algorithm. The desired error
minimum is point D, the global minimum, because the value of the error function is
minimal at this point.
To illustrate basic terminology, the simple case where the error function depends on
only two weights,
Wl
WI
and
Wl,
is considered The problem is to find values for
WI
and
where the error function reaches a global minimum. Figure 3.2 is a geometrical
representation of an error function E(!YJ as a surface lying above the weight space.
Point A represents a local minimum while point B represents the global minimum of
the error function. If C is any point in the error surface, then the local gradient of the
error function is VE, where VE is the gradient of the error function with respect to the
weights. The gradient in Figure 3.2 will thus be the two-dimensional vector of partial
derivatives with respect to the weights:
Neural networks usually have many weight parameters, which may lead to extended
training times. As a result, effective algorithms are necessary to find a suitable local
minimmn of the error function in the shortest time possible. The parameters or
weights are not identifIable because more than one set of parameters can give the
same error function value. For instance, a two layered neural network: with M hidden
neurons exhibits a symmetry factor of MJrJ (Bishop, 1996:256) equivalent points
which generate the same network mapping and therefore give the same value for the
error function. It is clear that the error function therefore does not have a unique
global minimum. As a result, the specifIc values of the weights are not important.
When the error surface is locally convex, the error function can be written as a Taylor
series expansion in the vicinity of a local minimum. Optimisation algorithms can be
classified according to their relation with the terms in the Taylor expansion. Four
categories
can be distinguished.
Firstly,
methods
based on zero order Taylor
expansions are simplex and random walk. These methods are well treated in different
texts, and will not be of interest in this work (Fletcher, 1987). A second category
includes methods based on first order Taylor expansions such as gradient descent. A
third category which can involve combinations
of either first and of second order
Taylor expansions are superlinear methods such as the Levenberg-Marquardt,
leap-
frog, quasi- Newton and conjugate gradient methods. Finally, the fourth category
\ '~lolt?-z.(."l
'b \
4 :,C;<t,
0
b '2..
includes methods based purely on second order Taylor expansions such as the Newton
methods.
Because of the non-linearity of the error function, it is impossible to find global
solutions to the error. To overcome this problem, algorithms that search through the
weight space are used; formulated in mathematical terms as follows:
where w is a weight vector (chosen at random) in a specific update direction, n the
iteration step and Aw(n) the adjustment of the weight vector. Different algorithms
involve different choices for Aw(n). Algorithms such as conjugate gradient and quaviNewton are formulated not to increase the error function as the weights change. Non-
linear optimisation algorithms cannot guarantee global minimum and therefor the
disadvantage of these methods is that the error function can get stuck in a local
minimum, which, in turn, leads to premature termination.
The process of learning or estimation involves the process of optimising the error
function by finding the vector of weights which minimises the error function. By
obtaining a local quadratic approximation by using a Taylor series expansion, the error
function can be minimised in a convex region around a local minimwn.
Let E be a function of more than one argument, EU!) = [(WI. W2. ...• w,J (Hamilton,
1994:735-738). Furthermore, let E: fit' -+91 with continuous second order derivatives.
A first order Taylor series expansion of EU!) around any point WI is given by
T
oE
ow
= VET
is a (lxn) vector (the transpose of the gradient) and RI(.) is a remainder
If E(.) has continuous second order derivatives, a second order Taylor senes
expansion of E(!!!J around WI is given by
Consider a second order Taylor series approximation of the multi-variable error
function E(!!!J: It' ~ RI around some point
minimised. When E(!!!J is
WI
in the weight space, which is to be
twice differentiable, the second order Taylor series
approximation of E(w) around WI is
b = oE(!i)
-
ow
I
~l
The first derivative of the approximated error function (3.3-3) with respect to
given by
W
is
The gradient is set equal'to zero at a stationary point, say !!2 = w * and solved for ~
then
This forms the basis of many of the learning algorithms because, for points w that are
close to w·, these above expressions will give acceptable approximations to the error
function and its gradient.
The method of gradient descent, also known as steepest descent, makes use of the first
order Taylor expansion of the error function, E<!f) as given by (3.3-1). As mentioned
in Section 3.2, different algorithms involve different choices for L1w(n), the weight
vector update. An initial guess for the weight vector, denoted by wo, is required. The
weight vector is updated in such a way that with each iteration step n the direction of
the movement is towards the negative gradient of the error function oE~),
A!!"
evaluated at !fn. This is a discretisation of a first-order differential equation in w so
that the trajectory in weight space leads to a stationary point w*. Thus for the error
function
oE~
--"-
ow"
)T
.
=!!, the weIght vector update is
A
(
aE(w,,)
)
l..lwn =-TJ---
-
aw"
2
O<TJ<--,
Amax
where
A.max
is the largest eigenvalue of H(J!) (Haykin, 1994:49). This leads to
guaranteed convergence, although in practice
1] is
determined empirically or by rule of
thumb.
The algorithm that implements the gradient descent learning strategy for feed forward
neural networks is known as back propagation. It can be described in two steps. The
first step is a forward propagation of the input pattern from the input to the output
layer of the network. The second step is a back propagation of the error vector from
the output layer to the input layer of the network (Berthold et ai, 1999: 228-229).
Superlinear methods involve either first or second order Taylor expansions, such as
Levenberg-Marquardt,
leap-frog, and quasi-Newton
where the Levenberg-Marquardt
methods. Consider (3.3-8)
methods replace B-1
with rt,.l) where
1]
is a step
size parameter and I the unity matrix. When TJ is equal to 1, rt,.l)/Ldominatesand the
method resembles to gradient descent and when TJ becomes small the method
becomes second order.
It is important to mention that if residuals, which is the difference between the model
output and the desired output, are not small. it should be better to use general quasi-
methods or hybrids thereof (see later in this section). The Levenberg-Marquardt
algorithm (Bishop, 1996:290-292) minimises the error function while at the same time
it tries to keep the step size small to ensure that the linear approximation remains
valid
The leap-frog optimisation method of Snyman (1982:449-462) is a reliable and robust
algorithm which will generally compute acceptable minima without being overly
sensitive with respect to difficult environmental conditions. The method is based on
the motion of a particle of unit mass in an N-dimensional conservative force field
where the total energy of the particle is conserved In the case of neural networks the
force field is the weight space. The energy of the particle consists of two components,
namely potential energy and kinetic energy from the fact that a force acts on a unity
mass particle. The force accelerates the particle to a point where it has maximum
kinetic energy and minimum potential energy. When this happens the particle stops
because the force is zero. The force acting on the particle is the negative of the
gradient of the potential energy (Snyman, 1983). The leap-frog method is based on
Euler relations for dynamic systems. The updating equation (cf 3.2-2) for this method
where Aw(n)
g(n)
= rln)At(n)
is the step size in the weight space, At(n) the time step, and
the acceleration of the particle at step n. (3.4.2-1) relates the trajectory of the
particle in the weight space to the particle velocity! and acceleration Q.
Quasi-Newton
methods use the local gradient of the error function, but instead of
calculating the Hessian directly and then evaluating its inverse, an approximation A of
the inverse Hessian is built up over a number of steps. The minimum of the quadratic
error function is typically reached after W steps (where W is the number of weights
and biases in a network) as is the same with the method of conjugate gradient A
sequence of matrices is generated to represent increasingly accurate approximations to
the inverse Hessian Ifl. This is accomplished by the use of only first order derivatives
of the error function (by using backpropagation). By starting from a positive definite
matrix such as the unity matrix, the problem of non-positive definite Hessian matrices
is eliminated. The update procedures are such that the approximation to the inverse
Hessian Ifl is guaranteed to remain positive definite although the condition number
of Ifl may become inconveniently large. Two commonly used update procedures are
the Broyden-Fletcher-Goldfarb-Shanno
(BFGS) and the Davidson-Fletcher-Powell
(DFP) (Bishop, 1996:288) with BFGS the method of choice.
For the method of gradient descent, the direction vector is the negative of the gradient
vector. Small steps are taken to reach the minimum of the error function, which can
be a time consuming effort. The method of conjugate gradients guarantees the error
function is minimised within W search steps, without calculating the Hessian matrix.
This property is an improvement on the gradient descent method because the gradient
descent needs many steps to minimise a simple quadratic error function. For
conjugate gradient the choice for L1w(n) (the weight update in (3.2-2» is
where pIn) denotes the search direction vector at iteration n of the algorithm and TJ is
the learning rate parameter. The initial direction vector, ]l(O),
is set equal to the
negative gradient vector,
b = oE(~)1
•
_0
~
w=w
uw
--
p(O) = -~o
Each successive direction vector is then calculated as a linear combination of the
current gradient vector and the previous direction vector:
p(n + 1) = -Q"+l + p(n)p(n)
where JX...n) is a time varying parameter that is determined in terms of the gradient
vectOTSQ"and Q,,+l' The Fletcher-Reeves fonnula and the Polak-Ribiere fonnula can
be used to detennine JX...n) (Bishop, 1996:280-281). This method was applied to small
Boolean learning problems and it was shown that backpropagation learning based on
conjugate
gradient
required fewer iterations than the standard backpropagation
algorithm, but is more complex to compute (Kramer and Sangiovanni-Vincentelli,
1989 and Johansson et ai, 1992). There are known problems with conjugate gradient
when the dimensionality of the network becomes large. This is due to curnmuIative
mis-adjustment on the search direction update formula given in (3.4.2-5).
Newton method,; may take only a few steps to converge, but are relatively slow
because the Newton methods make explicit use of the full Hessian matrix, H, which
has to be calculated. These can lead to computational expense. However, these
methods are elegant and applicable and must be treated in this text.
Consider the second order Taylor expansion given in (3.3-3) and assume that EUf) is
twice differentiable. Then from (3.3-8), where the weight vector w corresponding to
the minimum of the error function satisfies
w= w· -H-1b.
-
-
-
The vector _H-1Q is known as the Newton direction or step (Bishop, 1996:287).
Since the quadratic approximation used to determine (3.3-3) is not exact, the weight
update equation has to be applied iteratively and H re-evaluated at each new search
point. The Newton method has significant disadvantages. For non-linear networks, the
Hessian matrix take NW
2
steps to compute, where W is the number of weights in the
network and N the number of patterns in the training set. It also has to be inverted,
1..NW 3
requiring
2
steps. The Hessian has to be positive definite to ensure descent in
the Newton direction.
Cross validation is the process whereby a trained network's performance is measured
on unseen data records. The data is divided into three different subsets, namely the
training set, cross validation set and the test set. The cross validation set is used
during the training process in order to stop training when the network starts to
overtrain. With each iteration, the root mean square error (RMSE) of both the training
and the cross validation sets are calculated and compared. The RMSE of the training
set will typically decrease, and after a number of iterations, stabilise. The RMSE as a
measure of performance can be determined by calculating the sum of squares of the
error function
whereYj is the target output, f(~j;~
a function of the input vector ~and the weight
vector wand N is the number of training patterns. The MSE (mean squared error) is
then
M~E=E~
N
Overtraining occurs when the network model fits the training set well but fails to
generalise on samples of the same population. The problem is more accute in
situations where the number of observations is small in comparison with the number
of weights. To overcome the problem of overtraining, the neural network model fitted
to the training set is also (in a sense) fitted to the cross validation set. A measure of fit
such as the root of the mean squared error (RMSE) is calculated for both data sets. If
the RMSE of the cross validation set increases after a number of iterations while the
corresponding value decreases for the training set this is a sign of overtraining and
unacceptable generalisation results.
The remaining data is used to test the network's performance after training. This will
be discussed in Chapter 4 by means of an example, when the test set is used to test the
performance of the model after training.
In this chapter different training or learning algorithms are given. It is emphasised that
there is no ideal learning algorithm and that discretion must be used in selecting the
correct method for a specific problem. In the next chapter time series analysis with
neural network models and Box-Jenkins autoregressive moving average (ARMA)
models are discussed.
The aim of this chapter is to illustrate the identification, estimation and evaluation of
neural network models and autoregressive moving average (ARMA) models for a
stationary time series. Neural network models have been used in time series prediction
(Le. Baron and Wiegen~ 1996). Autoregressive moving average models (ARMA) are
well-known for purposes of time series analysis (Ansuj et aI. 1996). These models
define linear relationships between a time series observation at time t. the dependent
variable. and a set of time series observations that occurred prior to time t. Any linear
model can be expressed as a simple feed forward neural network model with only linear
activation functions. This is referred to as a regressor. The versatility of a neural
network lies in the fact that it is used to model non-linear relationships between input
and output variables. This can result in very complicated models with a large number of
parameters. where the parameters have no physical meaning. In model building there is
always a trade-off between models with a large number of parameters that fit very well,
but with poor generalisation abilities. and models with fewer parameters that do not fit
all that well but produce better forecasts because of better generalisation abilities. In
Section 4.2 to Section 4.5 the ARMA{p,q}-process
is defined. maximum likelihood
estimation and the evaluation of the results are discussed. It is shown that these
procedures can also be applied as part of the neural network methodology.
Suppose that Y,. t = ... -I, 0, I ... is an equally space~ weakly stationary or covariance
stationary. time series. A well-known class of linear models for the analysis of time
series in the time domain belongs to an autoregressive moving average (ARMA) class of
the form (4.2-1) (Hamilton, 1994:59-61),
where {&t} is a sequence of uncorrelated variables, also referred to as a white noise
process, with conditions
= o and
E{et)
E{e e ) =
t'
and ~
r
{CT:
for t = r
0
th
.
(PI. ...
o efWlse
¢p, 01,
...
Oq are unknown constants or parameters. The model (4.2-1) is an
ARMA(p,q) model or Box-Jenkins model.
The process is stationary if the roots of the equation i(BJ = 0 all lie within the unit
circle. TheARMA(p,q)
model can be expressed as a moving average (AM) model,
TI(B)Y, = p. +&,
(J(B)
with n(B) = =-~(B)
Let n(B) = 1-JrtB-Jr2B2
- ••••
The equation (4.2-6) can be written as
Any ARMA model can therefore be written as an AR or MA model (Cryer, 1986:7374).
Box and Jenkins (Box, Jenkins and Reinsel, 1994) developed a general framework:for
time series modelling. They suggested a model-building strategy where model
identification, estimation and diagnostic checking are done iteratively to select the best
possible model for a given series.
The ARMA model is based on linear relationships between successive observations as
measured by the autocorrelation function. Plots of the sample autocorrelation function
(ACF) and the sample partial autocorrelation function (PACF) are compared to the
corresponding population functions to identify the ARMA model by selecting the values
of p and q, the autoregressive and moving average order, of the model. (Cryer,
1986:111).
.
r r = COV(Y'_l , &, ) = 0
:;t:
if k > 0
0 if k ~ 0
_r1
ro
P 1-
t/ti is the correlation coefficient in the bivariate distribution of Yt. Yt-k conditional on
Yt-I, Yt-l, ... , Yt-k+I
(Cryer, 1986:106).
The ACF of an AR(P) process shows an exponential decay or a damped sine wave,
while its PACF is zero for lags greater than p.
A pure moving average model of order q has nonzero autocorrelations
wdY on the
first q lags and its partial autocorrelations are not zero after q lags.
Two time series are used to illustrate the concepts of identification,
estimation and
evaluation of a model. Time Series 1 consists of 150 tenns generated from an AR(2)
model with t/JJ. =1.3,
th
= -0.5, and zero mean, and
St -
N(O,a;).
Time Series 2
consists of 336 data points of electricity consmnption, measured hourly, for two weeks.
As mentioned in Section 4.3, the sample autocorrelation function and sample partial
autocorrelation function can be used to identify a possible model that can be used to
describe a time series.
A time plot can be used to identify a trend, seasonal patterns, outliers, discontinuities,
etc. A time plot of hundred observations of Series I is given in Figure 4.1.
10
5
o
-5
I-Vtl
-10
-15
Figure 4.2 is the sample autocorrelation function. It has the appearance of a damped
sine wave which is typical of an AR model while the sample partial autocorrelation
illustrated in Figure 4.3 gives a clear indication of an AR(2) model. The first two
partial autocorrelations differ significantly from zero while the rest do not differ from
zero on the 5% significance level.
Lao Covariance
Correlation
·1
o 7.670817
1.00000
1
6.013996
0.78401
2
3.712388
0.48396
3
2.166189
0.28239
4
1.322617
0.17242
5
0.765067
0.09974
6
0.505147
0.06585
7
0.830336
0.10825
8
1. 042072
O.13585
9
0.744163
0.09701
10
0 •323893
0 .04222
11 -0.124980
·0.01829
12 -0.368546
·0.04805
13 ·0.454332
-0.05923
14 -0.431367
·0.05823
15 ·0.391593
·0.05105
16 -0.338712
·0.04418
17 -0.359287
·0.04884
18 -0.789711
·0.10034
19 -1.156028
·0.15070
20 ·1.868117
·0.21748
21 -2.225459
·0.29012
22 -2.784961
·0.36045
23 ·2.763432
·0.36025
24 -1.983370
·0.25856
Std
,I················
.
I······.
I'·'
I'•
I'
I"
I'·'
I"
,.
.,.,
.,.,
.,.,
..,
I
"'1
····1
....... ,,
........
. .....,
marks two standard
o
0.100000
O.'49310
0.164249
0.169035
0.170784
0.171368
0.171619
0.172300
0.173368
0.173910
0.174012
0.174028
0.174160
0.174362
0.174543
0.174692
0.174804
0.174929
0.175504
0.176793
0.179448
0.184079
0.191007
0.197685
errors
Figure 4.2: Sample autocorrelation function for an AR(2) with ;1=1.3 and
ria. = -0.5
(SAS System for Windows v6.12)
Lag Correlation ·1 9 8 7 6 5 432
I
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0.88347
·0.42102
0.00224
0.00211
·0.00117
0.02936
0.02602
0.08493
0.06300
0.11536
-0.06218
·0.00949
·0.08332
0.19943
-0.13970
-0.04186
-0.00209
-0.14519
·0.07380
0.02051
0.04710
-0.06309
-0.15938
·0.00652
I 0 I 2 3 4 5 6 7 891
........ I··················
,
I
I
I
I"
I'
I'•
I'
I"
'1
1
"1
I····
····1
'1
1
.* •• ,
'1
,.
.,
.... ,
1
I
Figure 4.3: Sample partial autocorrelation function for an AR(2) with ;1 =1.3 and
-0.5 (SAS System for Windows v6.12)
th =
~
~
~=
.~ :::(,)-,...,
3
2.5
2
.- c +
b
0 CD 1.5
(,)._ 0
CD .•••~
_ Cox
1
W E
~
U)
I
I
1-.ELEK:yt:
0.5
C
o
o
CJ
There are explicit daily, 12-hourly and weekly seasonal patterns in the electricity
consumption, due to consumer behaviour. The autocorrelation function depicted in
Figure 4.5 shows evidence of a sine wave, which may suggest an AR(2) component.
The sample autocorrelations from lags 20 to 24 are quite high due to the 24-hour
seasonal pattern.
Lag Covariance
Correlation
o
5842606
1.00000
1
5379454
0.92073
2
4337118
0.74233
3
3086162
0.52822
4
1901803
0.32551
5
917772
0.15708
6
194390
0.03327
7
-238615
-0.04084
8
-361234
-0.06183
9
-197386
-0.03378
10
146418
0.02506
11
456639
0.07816
12
511863
0.08761
13
243946
0.04175
14
-219729
-0.03761
15
-631504
-0.10809
16
-783852
-0.13416
17
-635189
-0.10872
18
-208851
-0.03575
19
465597
0.07969
20
1324200
0.22665
21
2276192
0.38959
22
3199092
0.54755
23
3893269
0.66636
24
4086444
0.69942
't:****'/r+****
********+.***
•••• ****+ •••••
The partial.a.utocorrelation function's first two lags are significant and the remaining
lags are significantly reduced which indicates an AR(2)- process. This is illustrated in
Figure 4.5.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0.89902'
·0.40365
·0.08511
·0.12721
·0.22907
·0.17068
0.15599'
0.13347'
0.04207
0.02022
-0.04473
-0.12335
-0.09437
0.08936'
·0.19785
-0.09166
·0.00793
·0.13995
0.00804'
-0.10087
-0.02190
-0.06457
-0.00557
0.03435'
I·················· I1
• ••••••
,
1
·1
· "I
,
1
•• •• 1
•••••
1
I
,
1
· •• ·1
1
I ••
I
••
1
I·...
,.
I
I
1
,
1
I
I
,
'1
1
..,
..,
,
I
,
I
I
1
I
• •• ,
1
I
1
'1
I
1
I
,.
1
•
,
·
I
I
••••
••
,
1
,.. .
·
I
I
,
1
· ••• ,
,
,
,
I
I
Figure 4.6: The partial autocorrelation function for the sample of electricity
consumption.
A spectral analysis of the data can be used to identitYthe most prominent cycles in the
data. A periodogram of the data is given in Figure 4.7 ..
_
6
co
+
-
8 4
X
-
2
~
N
~
M
~
~
~
M
ID ~
~
ro
~
~
~
0
~
~
N
~
The graph s~~ws peaks at 12-hour. 24-hour (the most) and 48-hour periods. A large
percentage of the total variation in the data can therefore be ascribed to the identified
cycles. and the information can be used to define input variables for a neural network
model.
Through the method of spectral analysis. the series of observations Yt can be
described as a weighed sum of periodic functions of the form cos(~). where
{j}
denotes a particular frequency (Hamilton, 1994:152). For Series 2 with dominant
periods of 24 hours and 12 hours. as seen in Figure 4.7. the frequencies can be
calculated as follows:
Period =24 = 2Jr therefore lU I = 0.2618
lUl
Period =12 = 2Jr therefore lU2
= 0.5236.
lU2
Different numbers of nodes in the hidden layer were tried and it was decided to use no
more than two nodes in the hidden layer. The added complexity introduced by adding
nodes in the hidden layer did not result in a significant improvement of the model fit.
Let fl, f2, ... , f N denote a stationary time series, and suppose that the observation at
time t can be described by the model
where w is the parameter vector or weight vector and the error terms ... ~-I,~,~+I
•..
are assumed to be white noise. Iff is a linear function, the above model is an AR(P)
model.
The parameters of an ARMA model can be estimated by the method of maximum
likelihood. There are various other methods, for example, the conditional least .~quare
method (CLS) and the unconditionallea.,>;tsquares method (ULS). In the case of an AR
model, if the error tenns are assumed to be independent nonnal, it can be shown that
the maximum likelihood method is equivalent to the least squares method. For
purposes of this dissertation only the method of maximum likelihood will be
discussed.
be the vector of parameters for anAR(p) model and let (Y ••f2. ... fM>be a stationary
time series. The value of l'that maxirnises the joint probability or likelihood function
is the maximum likelihood estimate. The likelihood function for an AR(p )-process,
conditions on both Y' sand i's.
lfthe initial values of ~o = (Yo'Y-J'···'Y-P+)'
the sequence {e) ,e2,
••• ,eN }
and fo =(Eo,E_),...~E_q+)'are given,
can be calculated from {Y)'Y2' ...'YN }by iterating on
with the option of setting the initial y's and cs equal to their exPected values
(Hamilton, 1994:132). The error function used for maximum likelihood estimation
can be written as
In neural network terminology, estimation of the parameters of the model takes place
during the training period, during which the network learns with generalisation
through certain learning algorithms. Weights connected to each input value
(parameters) are constantly updated until the error function, E(!!J, which is often the
sum of squared errors (see Chapter 2), reaches its global minimum.
An AR(2) model was fitted, by using the SAS System for Windows v6.12, to Series 1
by the method of maximum likelihood. The estimates, approximate standard errors
and corresponding t-ratios are given in Table 4.1.
Parameter
Estimate
Approx. Std Error
T-Ratio
Lag
MU
-o.672n
0.89122
-0.75
0
AR1,1
1.26298
0.09195
13.74
1
AR1,2
-0.42242
0.0937
-4.51
2
For large sample sizes the t-ratio may be used to test the significance of the estimates
with respect to the null hypothesis that the corresponding parameter is zero. The tvalue for the mean is small, not rejecting the null hypothesis. This is expected, since
the time series was generated with a zero mean. The parameter is redundant and
should be excluded from the model. The t-values for the two estimates are large,
indicating that
;J
and
th should not be
excluded from the model. In Figure 4.8 the
time plots of the actual and the fitted series is given.
10
5
o
l--Target
- - .... Model
-5
-10
A neural network is fitted to the electricity consumption time series. The program
package used for this purpose is Basic ModelGen™l.O from Crusader Systems.
Estimation ~.f a neural network model takes place during training by means of the
training algorithm. For the training set 70% of the data is used, 20% for the crossvalidation set and the rest is used for testing the model. The model (4.3-6) has six
input variables, as mentioned in Example 4.1, as well as an intercept term included in
both the input layer and the hidden layer. The neural network model is non-linear with
two neurons in the hidden layer. The sigmoidal function (2.5-7) is used as activation
function. The sum of squared error function (3.5-1), which is a function of the weights
in the network, is minimised. This is done efficiently by using backpropagation. As
mentioned in Chapter 3, the cross-validation set is also used during training. The root
mean square error (RMSE) of both the training set and the cross-validation set
decrease with each iteration. The RMSE of the training set will usually decrease more
than the RMSE of the cross-validation set because the network is trained on the data
of the training set. If too many parameters are included in the model relative to the
number of data points in the training set, over-training can occur. An increase in the
RMSE of the cross-validation set gives an indication that the network is starting to
over-train and training stops immediately. In Figure 4.9 a plot of the RMSE of the
training set and the cross-validation set is illustrated.
,.-.,',
,
.. ;
•.
I
O.Q4.~ :
·5
w
,,~OD36
.'
.
:
.
:
:.
';
:
..
.
.;"
.,
..
.:'.
'.:.
.,'
-,
:'
.
"
i\11
,
c'
"
"
:',
"
':':
'
..' t: oms
.' 'lk:.
"-"
<
. ;. .
:,:
:".',
.'
i \
.
7,-
....
',,'
..
..• c'
,
o.03i
:
I
.
,',
..
:'"
'
;"/.
.
"
:.
".
.
,
"
.
(.
"
<
'.<
,,
;
•
'.',
J,
~
•
.. '-::
-~
.' -
,-~.
..
~ .'
,
:":
".:,'
: ...
"
-'1'.:,',_'-
,""
The output of the model is a function of the weights but the weights t~emselves do not
have any significance value. When the vector of weights, which minimises the
gradient of the error function, has been detennined through a learning algorithm,
training stops. The output of the model containing the six input variables (and one
hidden laye~~th two neurons) as indicated in Section 4.3 fitted to Series 2 is given in
Figure 4.10.
a-
1.2
1
~ .1)
0.8
~ § 0.6
i ~0.4
U
c
.- 0
a.
~ 8
II)
--Series1
----...Series2
0.2
0
~ m ~ ~
~
N
M
M
~
~
m ~ ~
~
~
M
<0 ~
Time
Although the data set is quite small, the model fits the data reasonably well,
considering Figure 4.10. A rule of thumb that is often used and accepted by
practitioners is that the number of observations should be at least a factor of ten times
the number of weights (N=JOW).
In this section, the estimation of linear and neural network models is discussed The
evaluation of the estimated model is discussed in the next section.
After a model has been identified and the parameters estimated, the model must be
evaluated. Evaluation procedures are the same for both the ARIMA and neural
network models. If the model fits the data well, then the residuals should almost have
the properties of uncorrelated, identically distributed, random vat:iables with zero
mean and fixed standard deviation (Cryer, 1986:147). If the estimated value of
given by
r. is
The residuals of a fitted model are useful indicators of any inadequacies in the
specification of the model or violations of underlying assumptions.
Examination of various plots of the residuals, is an indispensable step in the
evaluation process of any model (Box and Jenkins, 1994:289). If a plot of the
residuals exhibits a trend over time, it is an indication of a trend in the data that is not
adequately modelled.
A histogram of standardised residuals should correspond with a symmetrical normal
curve if the model fits the data well. Any outliers will imply significant differences
between the Y, and the corresponding observed value, Yt that should be investigated
since the model fits the data poorly at those points.
The sample autocorrelation function of the residuals, r k , can be observed to check for
the independence of the residuals in the model. Usually the sample autocorrelations
are approximately uncorrelated and normally distributed with mean zero and variance
l/n. A
'"i (Chi-squared) test is used to test whether residuals are correlated (Cryer,
1986:153).
Different programme packages evaluate neural network models differently. A
comprehensive residual analysis is offered by the BasicModelGen™l.O Crusader
systems. The mean square error (MSE) in (3.5-2) and the root of the mean square
error (RMSE) are determined for every model and the model with the smallest of
these values should be selected. As mentioned, the residuals play an important role in
the evaluation process. A fast Fourier transform of the model residuals is calculated
If the spectrum of the residuals is constant, white noise is implied and it can be
concluded that the model fits the data relatively well.
Other meth~
of evaluation are a sensitivity analysis performed to establish the most
influential input variable of the model, and output analysis where the relationship
between an output and specific input is shown. The t-test statistic is used to determine
whether a model's output differs from the desired output. Together with this test the
mean and the standard deviation of the model's output are calculated. The quality of
the model can be observed by viewing a graphical representation of the model's
output versus the desired output. Another useful measure of evaluation is the
correlation (R) between the target output and the actual model output. In the following
example the evaluation of the models fitted to the AR(2) and the electricity demand
time series is illustrated.
The residuals for the estimated AR(2) model in Example 4.2 were analysed A time
plot and other residual plots revealed no deviations from the model and assumptions.
A X2 (Chi-squared) test for independence of the residuals was performed and the null
hypothesis was not rejected. AR(I), ARMA(2,I) and AR(3) models were also fitted,
but the AR(2) model fit was superior as measured by Akaike's information criterion
(Cryer, 1986:122) and the significance of the estimated pammeters. This is an
expected result, since the time series is generated by an AR(2)-process.
The evaluation procedure points out some inconsistencies with respect of the neuml
network model fitted to the electricity data, which can be expected because the data is
seasonal and the model selected to fit the series perhaps does not include all the
variables necessary to fit the data. This problem is examined in Chapter 6.
The observed spectrum of the residuals of the neural network model has some peaks
which indicate that there is some seasonal information present in the residuals.
The sensitivity analysis, illustmted in Figure 4.11, shows that the variable
ft-l
has the
greatest influence on the model output which is quite reasonable because it represents
the electricity consumption during the previous hour.
700600500400
300
.~
1ii 200
-
8100
"2.
0
g -100,
&-200
..,.
!! -300
-<
~.
.
-400
-500·· ..
-600
-700-800
The t-test is used to determine how significantly a model's output differs from the
desired or target output. The time plot of the model output and the desired output for
the test set are given in Figure 4.12.
Model Output versus Target Output
The correlation between the target output and model output is 0.897. This value
differs significantly from O. Table 4.2 provides this information together with the
mean and standard deviation of the model and the desired outputs for the test set, as
well as a 95% confidence interval for R.
Table 4.2: The correlation coefficient for the test set as determined by the neural
network model
Standard
95% Confidence
Mean
Deviation
R
Backprop net
19861.1
2065.443
0.89722
Desired
19461.3
2144.093
Ii
0.80501
UmitsforR
[0.82EH:>.940]
Although the evaluation of the model for the electricity consumption may indicate
some inadequacies, one can always improve the model by trial and error. In Chapter 6
an electricity load forecasting will be done on one year's data.
In this chapter the process of identification, estimation and evaluation was discussed
by means of examples. When the process of evaluation has been completed, and the
model gives a satisfactory description of the data, the model can be used for
forecasting or classification. In Chapter 5 forecasts and prediction intervals are
derived and calculated for neural network models.
In this chapter a neural network model is used to forecast a time series. It is shown
that bootstrap methods can be used to calculate standard errors and prediction limits
for the forecasts.
In Section 5.2 one- and two-step minimum mean square error (MSE) forecasts are
given for both a linear autoregressive model and a non-linear neural network model.
In case of the non-linear model, a different approach has to be considered. A few
alternative theoretical approaches to obtain forecasts for non-linear time series models
are discussed in Section 5.3 and the bootstrap technique is selected for the purpose of
obtaining prediction intervals. Section 5.4 shows how bootstrap methodology can be
used to construct prediction intervals.
One- and multi-step predictions, together with their standard errors and 95%
prediction intervals are calculated for the simulated AR(2) series.
Consider the ARMA model (4.2-1) for a stationary time series fit} with t =1,2, ..., N.
The linear AR( 1) model is given by
where {Et} is a white noise process that satisfies the conditions (4.2-2). Suppose that
the white noise variables are identically distributed.
The minimum mean square error (MSE) linear predictor for one step is the conditional
expectation of YN+l, given Yl• Y1•... YN:
YN(l) = E(YN+.IY.,Y2,··.,YN)
= E(f;'YN
+ E,v+l!Y.,Y2,···,YN)
= tAYN
By using (5.2-1) for the second step prediction, Y,v+1, the U';;E linear predictor can be
written in terms of the observed data Y;v:
YN(2) = E(YN+2IY.'Y2'···'YN)
= ;IE(Y,v+IIY1'Y2, ...'YN )+0
= ;IYN(l)
=;.2YN
= E[f(YN+I,a)+EN+21~'Y2'
= E{f[f(Y N ,a) + EN+.],a
YN(2)
= E{flYN(I)+&N+d,al~'Y2'
...'YN]
lY
1,
Y2,..., YN }
...'YN}
Alternative approaches have been proposed to calculate the forecasts; such methods
are discussed in the following paragraphs.
Five techniques for forecasting have been discussed by Brown and Mariano (1989)
and summarised by Lin and Granger (1994). Some of the methods are naive in the
requirement of assumptions regarding the distribution of the white noise terms. The
methods vary in respect of ease of implementation and computing sensitivity.
This technique is widely used and easy to implement but due to the fact that the
forecast is b~
it is not satisfactory. If one considers the non-linear model for time
series forecasting given in (5.2-5), the naive technique states that the two-step forecast
of
YN+2
for
YN+l
is given by
in the model equation. The estimator a is obtained by minimising the sum of
squared error terms or by maximising the likelihood function of the observations.
Usually the expected value of the function is not equal to the function of the expected
value and therefore the forecast will be biased Even if the functional form f( ) is
known, the bias will not go to zero jf the sample size N is large.
J
YN(2) = f{(YN(I)+&),~)dF(&),
=
JfTf(Y
N,~)+&),~]dF(&)
where F() is the distribution function of
&
necessary to know the distribution of
which is in practice generally unknown.
&,
andj(.) a non-linear function of
&.
It is
Brown and Mariano (1989) assume a N(O,I) distribution, but this has to be verified
Numerical integration can be used to approximate (5.3.2-1). Apart from the fact that
this forecast can be difficult to calculate, it may be incorrect, because of the wrong
assumption on the distribution of
&.
The dimensionality of the error distribution will
increase with the lead time of the forecast, and will consequently result in an increase
in computer time.
For large values of N this technique has the same disadvantage as the closed form
technique (in that the distribution the error terms is needed) but it is easier to
implement. The two-step forecast is defined as
where the sequence ej's are independent and identically distributed and chosen from
the error terms. This is a particular form of numerical integration based on
simulations. The mathematical expectation (5.2-7) is approximated by the arithmetic
average from a sample of possible realisations of YN+2 given Yl, Y2, ... , YN.
The bootstrap technique is based on the estimated residuals. The two-step forecast is
given by
are the realised one-step forecast errors arising up to time N. No assumption on the
distribution of the error terms is therefore necessary. The forecast improves as time
advances and is easily formed (Lin and Granger, 1994:2). This technique is used for
purposes of this dissertation and discussed further in Section 5.4.
This method involves the direct modelling of the relationship between
YN+2
and
YN
by
using the same model as before but estimating a new set of coefficients directly. The
previous techniques involved the use of the same function with the same set of
coefficients, f(f( )), for the two-step forecasts, which decreases its quality. The f( ) is
rarely known in practice and has to be approximated from a specific search. If
could be considered For a non-parametric procedure such as neural networks this
would be a sensible technique.
Lin and Granger (1994) investigated all these techniques with a simulation study for
the two-step case. The results lead to a mild recommendation of the bootstrap
predictor because of its small bias and not more than 5% inefficiency in mean squared
error, when compared with the parametric model, using the corr~
Further study on broader classes of time series models is recommended
specifications.
The bootstrap technique can be applied to obtain interval forecasts for an
autoregressive time series. Masarotto (1990) finds the bootstrap technique useful for
three reasons, namely: it is distribution-free, it takes into account that the Parameters
and order of the model are unknown, and improved computer technology makes the
difficult calculations involved with the bootstrap technique easier. Boraine (2000)
showed that the bootstrap results for linear models can be extended to non-linear time
series models. A discussion is given in this section. In Section 5.4.1 the prediction
limits for linear autoregressive models are given. These are the standard results given
in any time series text book. The multi-step forecasts for the neural network models
are introduced in Section 5.4.2 and the prediction limits for the multi-step forecasts
are given in Section 5.4.3.
where YN(J) = YN-i if j < o.
The forecasts are calculated recursively and converge to the mean of the time series
for large values of h.
YN +h IYl, Y2 , ••. , YN is normal with mean YN(h) and variance
h-l·
a; (1+ Ltp~) where the j 's
j:l
The approximation of I-a probability limits for fN +hll; , f2 ,""", fN is given by
where
Z
a
1-
2
is the 100(1 - ~)Ih
2
percentile of the standard normal distribution (Box,
Let fl, f1, ... ,fN be a stationary time series described by a non-linear neural network
model
where j{ ) is a non-linear function defined by the neural network architecture, p is the
number of input variables in the network, and w the weight vector. The c,'s are
uncorrelated, identically distributed random variables with mean zero and variance
u/. The observation
at time N+ 1 can be written as
For the estimation of the minimum MSE forecast, the bootstrap methodology can be
used and therefore the bootstrap forecast for YA(2) is proposed as
AIm
A
YN (2) -- - "f
*(j)
•
LJ (YN+l'YN""'YN+2-p,!:!:0
m j=1
~{l is an observation,
and &
-.
drawn with replacement, from &p+1,..., &N with
AIm
A
Y (h) "f(y*(j)
y*(j)
y*(j). ",\
N
- - LJ
N+h-1' N+h-2"'" N+h-p'~
m j=1
Y;}!2
= f(Y;~L1,Y;}!L2,
"
... ,Y;}!Lp;!:!:0+&~{1
*(j)
- Y
YN+h-p
- N+h-p 1'fh -p -<
and
0
The bootstrap procedure can be summarised in a few steps:
First fit the model (5.4.2-1) to the time series Yl, Yl,
...,YN.
Then calculate estimates of
the residual term using (5.4.2-8). Thirdly calculateY;+1, ...,r:~+h conditional on Yl, Yl,
... ,YN:
Y;+Ir-p = YN+h-p if h-p:::; Oand&~+h
is
an
observation
drawn
randomly
with
replacement from &p+l, ...,&N .Repeat this for m times where m =100, usually. Now
calculate the one- to h-step forecasts for the time series generated in the previous step
using (5.4.2-3), (5.4.2-6) and (5.4.2-9).
In the next .s~on
it is shown that bootstrap methodology can also be used for the
construction of prediction intervals for
YN+h.
The ideas used in this section are based on the method proposed by Masarotto (1990)
for linear time series models. The prediction error is
By using the bootstrap methodology the distribution of r};{h) can be approximated
using the Monte Carlo algorithm.
rL andru
are the B(Y )-th andB(l- Y )- th order statistic of rN(h); •...• rN(h)~ where
2
2
B is the number of bootstrap replications.
eN(h)
are required Note that rN(h)is
rN (h).
a function of u. A uis required for each
For good approximation of the distribution of rN (h). at least 1000 bootstrap
replications are required
First calculate the bootstrap time series ~. ,..., Y;+h. To do this a neural network model
of the form (5.4.2-1) is fitted to the observed time series, Yl, Y1,... YH. If w is the
estimated weight vector, the estimated residuals as in (5.4.2-8) are
&,
= Y, -
fO~-t,···,}'~-p;~ with
A
Y,· = f(Y'~J'Y'~2""'Y'~P;~+&'
t
= p+ l,p+2, ...,N.
",*
A*
where
&,
is drawn randomly with replacement from
It is assumed that the bootstrap time series values Y:3P-t = Y:3P-2 = ...= y,,:,p = Y.
u· , the standard error of YN (h), is calculated by using a bootstrap procedure within
the bootstrap procedure, that estimates the distribution of the standardised residual
The neural network model of the form (5.4.2-1) is fitted to the time series, ~. ,...,
Y;.
The residuals of the model are determined by using (5.4.2-8). By using this model
residuals, a bootstrap series is generated namely,
The forecasts, YN (h) are calculated by taking into account the first N observations,
Yl-
,Y;...,Y;. The prediction error defined in (5.4.3-1) is
e;(h) = Y;+h - Y N(h).
This is repeated m times. The standard deviation of e;(h)(l), ...,e;(h)(m) is used as an
....•
time series J;* ' ...'Y;+h' as well as the calculation of YN(h) andu
A*
and
7N(h)*,
B
times where B must be at least 1000, the interval (5.4.3-4) can be constructed.
A practical application of forecasting a time series from a linear AR(2) model is given
in the following example.
A Fortran program (see Appendix: Chapter 6) was developed to train the network and
to implement the bootstrap for forecasting and the calculation of the prediction limits.
The program uses a Gauss-Newton algorithm to estimate the parameters of the neural
network.
As an exam:l~lea generated AR(2) time series (see Example 4.1 to 4.3) consisting of
200 values is used to illustrate the calculation of the forecasts and the corresponding
95% prediction limits. The generated process is linear, and therefore it can not be
expected that the neural network model would produce better prediction results than a
linear model. A feed forward neural network with two input nodes for two past values
of the time series and one hidden layer with a sigmoidal activation function was
trained. Bootstrap methodology was used to calculated the prediction limits. Figure
5.1 and Figure 5.2 illustrate the prediction results. The results for the neural network
where bootstrap methods were used and the linear model correspond more or less.
Autoregressive
time series: predictions
and 95% prediction limits (neural network)
0.8
0.6
0.4
0.2
o
-0.2
-0.4
-0.6
-0.8
-1
-1.2
•.•. 2
•.•.
•.
•.-
4
---
6
8
- - ~---
..•....
10
12
- - .. .... .. •.
Lead
Pred ictio n
----
- ·L95
- - - - - 'U95 I
"Autoregressive time series: predictions
and 950/0 prediction limits (linear
regression model)
0.8
0.6
0.4
0.2
o
-0.2
-0.4
-0.6
-0.8
-1
----Prediction
- - - - - ·L95 - - - - - 'U95
I
It is shown that bootstrap methodology can be used to calculate one- and multi-step
predictions of neural networks with their standard errors and prediction limits. A
minimum of 1000 time series have been resampled from the observed times series and
for each resampled time series a network is trained to do one step predictions. These
highly computer intensive calculations were no problem considering the everincreasing speed of computers.
In can be seen in the two examples given that neural networks as well as linear
regression models are used to model the time series. Predictions and prediction limits
are calculated. The results compare reasonably well.
Further research is required to establish whether the bootstrap results can be
improved. For an example, the increase in the number of bootstrap replications may
lead to an improvement in the results.
The objective of this chapter is to forecast a time series by using neural networks.
Some scientists devote careers to the building of models for electricity load
forecasting. The purpose of this chapter is not to improve on the work done on model
estimation but only to introduce the neural network as a forecasting tool.
Several studies have been done on electricity load forecasting. For example: the
modelling of one-hour-ahead hourly electricity demand prediction has been done by
Connor (1996) while Hwang and Ding (1997) focused on the construction of
prediction intervals for electricity load forecasting. Lee et of (1992) used neural
networks for short term load forecasting by dividing the data into different classes of
daily and weekly load variations while Peng et 01 (1993) proposed a new strategy in
selecting training cases for a neural network model.
The data used here is ESKOM data taken hourly. measuring the average electricity
consumption for the Republic of South Africa. Section 6.2 deals with the analysis of
the data. The different strategies and techniques used to prepare the data for
estimation of the models are proposed in Section 6.3. The model is investigated in
Section 6.4 by means of evaluation. Forecasting one to twelve hours ahead by using
the selected model is done in Section 6.5. Unfortunately the programme package.
Basic ModelGen does not include the bootstrap method as a forecasting technique.
Two methods. namely the naive technique discussed in paragraph 5.3.1 and the direct
method discussed in paragraph 5.3.5 are used and compared in Section 6.5 by the use
of the programme package. Basic ModelGen. The bootstrap method is then used in
Section 6.5.3 on the above mentioned electricity data.
As mentioned earlier. ESKOM data. taken hourly. measuring electricity consumption
was analysed. A one-year period was considered starting from the 25th of October
62
1997 to the ~.5thof October 1998, a total of N = 8760 observations. The data therefore
contains seasonal components, as seen in Figure 6.1, where a four-week period is
illustrated.
3
e
-------------------------------------1
0
+i:;r
Q.+ 2.5
E
CD
::so
(1)-
ex
o ..
CJ=
~tO
..2
-~
~
...=
0
CJ_
.!!
w
2
1.5
1
0.5
Seasonal components cause difficulties in the estimation of the model parameters. The
sample autocorrelation function (ACF) and sample partial autocorrelation function
(PACF) point out a definite AR(2) component. Therefore the consumption in periods I-I
and 1-2will be considered as two of the inputs in the neural network model.
From the electricity data it is clear that there is more than one seasonal component.
Several seasonal components can influence the behaviour of a time series. A spectral
analysis can be performed to attribute the total variation in the electricity consumption
to cycles of different frequencies.
The results of the spectral analysis indicated the 24-hour and the 12-hour periods
respectively as the more important seasonal components.
Several other factors were considered, for example the effect of the day of the week and
of public holidays, and the average temperature for each day. Galpin (1997) investigated
the inclusion of rainfall, temperature and humidity as input variables in a regression
model. Alt~~ugh she found that these climate data decrease the forecast error, it was
found that public holidays have an overriding impact on the forecasting process.
These variables were all considered as explanatory variables. Networks with different
combinations of input variables and a different number of hidden inputs were trained
The model that was finally selected to describe the electricity data has as explanatory
variables periodic components that were determined by a spectral analysis (see Example
4.1), and the electricity consumption of the two previous hours. If Yt denotes the
electricity consumption at time I, the proposed model is:
Y, = I(Yt-), Y'_2,COS[liJ)(I -1)],sin[liJ)(I -1)],COS[liJ2(1 -1)],sin[liJil
-I)];.!!) +&,
(6.2-1)
as well as two nodes in the hidden layer. The
&,'S
are assumed to be generated by a
white noise process and liJ} = 0.2618 and liJ2= 0.5236.
The ideal network is one that is not redundant, in other words, the specific network
with the fewest parameters and which represents the data the best, is preferred (Hwang
and Ding, 1997:748-757).
The data is divided into a training set (first 70% of data), a cross-validation set (10% of
data excluding the training set) and the rest is used as a test set (last 20%). By
calculating the MSE for each model, the one with the smallest MSE value was selected.
This was done for both the feed forward neural network, which uses backpropagation,
and the Elman recurrent neural network The feedforward neural network is a function
of a fixed number of previous values of a time series, while the Elman recurrent neural
network, implicitly depends on all previous observations of a time series. The
feedforward models considered in this dissertation, performed better as can be seen
from the MSE values in Figure 6.2.
-
-------------------1.
50
30
I
I
'ii 20
I
N
•
-CD
0
)(
40
1 OMSE:FF
CD
:s
.MSE:ELMAN
I
~
w 10
UJ
:E
0
~ty
"\ty
Model
The labels on the horizontal axis denote the number of input nodes and the number of
hidden nodes of the different models considered. For example: 3.2 indicates a neural
network with three inputs and two hidden nodes. The number of parameters, W, in the
fully connected model is calculated as follows:
where nj is the number of input nodes and
nh
is the number of nodes in the hidden layer.
(The term, one, that is added. provides for a bias or intercept in both the input and
hidden layer.)
Therefore the neural network model (6.3-1) with six input nodes and two hidden nodes,
has 17 parameters or weights. This is acceptable because the total number of patterns in
the training set should ideally not be less than 32W(Bishop, 1996:410), but 10Wis also
an accepted norm in some cases.
The software package of the Basic ModelGen™ 1.0 from Crusader Systems was used.
In the training phase the input signals are passed through an activation function (see
Chapter 3). The activation function used here is a symmetric sigmoidal function (see
2.5-7). The backpropagation method was used to calculate the gradient of the error
function. The training stopped after 250 iterations. After the network has been trained,
the model has to be evaluated. Procedures for evaluation differ from programme
package to programme package. A few of the procedures available in the Basic
ModelGen™ 1.0 from Crusader Systems, are discussed in Section 6.4.
Figure 6.3 is a representation of the root mean squared error (RM()E), and therefore
also the mean square error (MSE), of the training and cross validation sets during
training. This gives the user an indication of how well the training progresses with
each iteration or epoch and, should there be some default, training can be stopped
immediately.
0.049
0.046
0.044
0.041'
(5
l::0.038,
w
~0.036-
~
\
\
0.033-
;
0.031
)
t
I
0.028'
!I
"
';'
;
1
1
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
_______
,.,_.
.
"_,, __.•"._,
Epochs
.__
,,..
,,_,...__
..__
..
,_.
.
,
.
!
.1
The RMSE for both the training and cross-validation sets fall sharply at the beginning
of the iterative process. The RMSE of the training set is lower and decreases steadily,
while the RMSE of the cross-validation set shows a slight upward trend before it
stabilises after about 130 iterations. This is an indication that the model should fit
reasonably well to an independent data set from the same population.
The investigation of the model residuals by means of the Fourier transform gives a
spectrum that is not zero. This means the residuals may still contain some information
of the underlying time series.
The sensitivity analysis points out that the model is the most sensitive for the input
variables Yt-1 and Yt-2• To inspect the sensitivity of the model output to the other input
variables, Figure 6.4 can be observed.
A graph of the output of the model and the desired outputs also gives an indication of
the quality of the fit of the model. Figure 6.5 shows these results for a randomly
selected section of the time series. The graph appears satisfactory because the model
output is relatively similar to the target or desired output values. The last peak is overestimated by the model. The performance
investigated further.
of the model on weekends should be
----,
j
II I-To_...
- - ••••.
I
Model: New Vt
The following table provides the correlation coefficient, R, between the outputs
predicted by the model and the actual desired output on the test set. The test set is
used for model evaluation. This is data that was not used during the training phase.
Standard
Mean
Deviation
Feed Forw
0.096
0.341
Desired
0.035
0.341
Desi"ed
0.035
0.341
MODELS
95% Confidence
R
R2
UmitsforR
0.9821
0.9645
[0.980 - 0.984]
The R of 0.9821 is quite high and lies in the 95% confidence interval. This indicates
that the model fits the data well. The narrow interval is due to the large number of
observations in the test set.
The evaluation of the model has been done by means of v~ous
methods of
measurements. If this model is satisfactory, the forecasting process can take place. In
Section 6.5 the forecasting of twelve hours ahead is done.
The model (6.2-1) was estimated and evaluated in the previous sections. This model is
now used to forecast Basic ModelGen™ 1.0 from Crusader Systems unfortunately is
not a specialised time series analysis package. It only provides one-step predictions.
For a given set of input variables, the predicted model value (6.2-1) can be calculated.
It was decided to use the nalve-, the direct method and the bootstrap method (see
Section 5.3 and 5.4) to forecast twelve electricity consumption values. A Fortran
programme (see Appendix: Chapter 6), was used to produce forecasts and prediction
intervals using bootstrap methodology.
In the next section these three methods are
implemented.
= fey
N(I),YN,cos[0,2618(N
+ 1)],sin[0.2618(N + 1)],
cos[O.5236(N + 1)],sin[O.5236(N + 1)];w)
= fCY
N(2),Y N Cl),cos[O,26H{N + 2)],sin[O.261a:N + 2)],
cos[0.523((N + 2)],sin[0.523((N + 2)];~
The following graph shows how this method performed. The forecast output values are
compared with the target outputs for twelve hours.
-
e
o
-'¢
0.+
.-
E
CD
::10
2.5 -------------------------.--i
2
ex
o .. 1.5
i,
II
.- 3=
.2 0
••
II
(I),,""
u=
~as
~=
.4
10.5 -
o-
-.JI::
u_
.!!!
w
4.~
i
• Target
fTIDirect
I
I
n,
I
I
~
I
I
~
I
xCb
I
I
,,~
I
,,'1,
4.~ 4.~ 4.~ 4.~ 4.~ 4.~
By inspection of the graph one can see that after two forecasts the forecast values
predicted by the model are less than the target values. This can be a problem because
it can cause an underproduction of electricity.
The next method implemented is the direct method.
For the direct method, a different model has to be estimated for each lead time. To
forecast two steps, the neural network is trained to estimate
Y 1+2
= g(Y
I,
Y,-l' cos[O,2618(t
+ 1)],sin[0.2618(t + 1)],
cos[O.5236(t + 1)],sin[O.5236(t + 1)]; w) + &1+2
Yt+2
from .Yi: and .Yi:-J:
Y Hi
= h(Y, ,Yt-I ,cos[0,2618(t + l-l)],sin[0.2618(t
+ 1-1)],
cos[0.5236(t + 1-1)],sin[0.5236(t + I-I)]; w) + CHi
Figure 6.7 illustrates the performance of the direct method with once again the forecast
output values compared with the target outputs for twelve hours.
c
o
;; .•...•...
D.~
2.5
--·--------·-.---------------
..
-----1
i
E
+
~ c!
2
H
en
c ><
o ..
u t::
1.5
~
.a-as~
..~
-~
~=
0
u_
.!!
w
I
I
DDirect
1
0.5
o
.Target
J
,
-\."
Here the model- forecast values differ a bit less from the target values. To compare
the naive method with the direct method Figure 6.8 is shown.
..
e
o
...-.-.
+
Q,"l:t
E
::s8
(1)-
ex
o ..
u=
~«S
.-
~
..u_=
.!:! 0
-.lllI:
.!!
w
3 -..----.--------------------.----.-----~---------,
2.5
2
1.5
1
~
I
'n
r-.-T-a-r-g-et-'
EJNa"ive
mJDirect
'--------'
0.5 -
o-
Figure 6.8 illustrates that the direct method performs better. The predicted values of
the model for this method are for most of the time periods the nearest to the target.
In this chapter a neural network was fitted to hourly electricity consumption. Two
methods for determining the forecasts were discussed. The direct method performed
better for most of the forecast values.
Chatfield (1996) investigated alternative approaches to forecasting. The investigation
of neural networks as a forecasting model showed that models with less parameters
generally give better out-of-sample predictions than those neural network models with
more parameters, even though this smaller models fit worse than less parsimonious
models. Chatfield (1996) also found that wider prediction intervals may reflect model
uncertainty better. Neural network software programs may be improved by including
methods for determining prediction intervals.
The same neural network model used in Chapter 5 is used to predict the electricity
consumption for 1 to 24 hours ahead. A time plot of the data is given in Figure 6.9.
Hourly
electricity
load
d e man d tim e s e rie s
0.8
0.6
0.4
0.2
o
-0.2
-0.4
-0.6
-0.8
Bootstrap methodology is used for prediction together with their standard errors and a
95% prediction interval. The results are illustrated in Figure 6.10.
Predicted
(Bootstrap
~
o .5
()
"t:
t)
GI
'ii
'0
GI
ii
hourly
electricity
load.
approach
applied
to
neural
network.)
'0
CIS
0
.2
-0 .5
()
en
- 1
-----Predicted
-----·L95
-----·U95
I
A linear regression model was also used to predict the electricity lo~d. The same input
variables were used and it was assumed that the prediction errors are normal
distributed when determining the prediction limits. In Figure 6.11 the prediction limits
as well as the predictions are illustrated.
Predicted
(linear
hourly electricity
load
regression
model)
1
~
(J
°C
(J
••
~
(I)
0.5
"C
fCS
.2
0
"C
.!!!
fCS
(J
U)
-0.5
.•.
-. -_ ...
-1
I
----Predicted
- - - - - 'L95 - - - - - 'U95
II
It can be seen that the forecasts obtained by the neural network (Figure 6.10) and the
linear model (Figure 6.11) compares well. In the case of the linear model the
predictions and the prediction intervals are smoother than that obtained by the
bootstrap method for the neural network model. This is because the bootstrap method
is data driven and does not rely on any assumptions on the probability distribution of
the data.
PREDICTNN is an artificial feed forward neural network that does one-step and multistep time series prediction, as well as the calculation of standard errors and prediction
intervals based on bootstrap methodology. (Reference).
Minimum hardware specifications:
The program runs under DOS Version 4.
Memory: 32 MB
CPU: 486
Disk space: 1 MB
Disclaimer: The authors do not take responsibility in any terms for any damages that
may result from using PREDICTNN.
Create a new directory, say TSPREDICT.
Copy PREDICTNN.EXE to the TSPREDICT directory.
All input and output files will be written to or read from TSPREDICT directory.
Go to DOS Window.
cd\path to TSPREDICT
Type PREDICTNN.
PREDICTNN requires parameter input (from screen or ASCII file) and the training data
set (from ASCII file).
The user can either specify the parameters by
typing it in as prompted by the program, or
the parameters can be read from an input file:
I Enter
0 if
input
from screen or 1 if
input
from file:--]
If the user choose to specify the parameters by typing it in from the screen, the following
values must be entered as prompted by th~ program:
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
Enter
the
the
the
the
the
the
number
number
number
number
number
number
of patterns in the training set:
of bootstrap samples to·be drawn:
of input variables:
of hidden neurons:
of lags of the dependent variable:
of forecasts to be calculated:
lOO*(l-?)%
confidence interval:
the maximum number of iterations:
the number of nn models to fi t:
the input file name:
It is assumed that the training set is the first pattern in the data file up to the number of
patterns in the training set as specified.
The number of bootstrap samples to be drawn refers to the number of samples to be used
to calculate the prediction interval. A minimum of 1000 bootstrap samples is
recommended in the literature. Note that an A1~ is trained on every bootstrap sample. If
you only want predictions and standard errors, you may specify a value of3 for this
parameter. The program will draw 100 bootstrap samples from the original time series to
calculate predictions and standard errors.
The number of input variables is the number of independent! explanatory variables in the
model including the number oflagged values of the dependent! output variable. The
dependent variable is a time series.
The number of lags of the dependent variable is the autoregressive order of the model.
Number of hidden neurons - number of neurons in hidden layer.
Number offorecasts to be calculated - size of forecasting window.
100*(1-?)% confidence interval- a value of 0.05 will, for instance, specify a 95%
confidence coefficient.
Maximum number of iterations - iterative optimization algorithm will stop after the
maximum number of iterations has been reached, even if convergence to the global
minimum has not taken place. More complex problems require more iterations. In
relatively small problems, (20 parameters), 60 iterations should suffice.
The number of nn models tofit ~efers to the number of ANN's to be trained from random
starting weights. The network with the lowest mean square error is then selected. Values
between 3 and 10 should suffice.
The input file name is the name of the data file. (Not longer than 20 characters).
The values entered are automatically writt.en to the file PARAMB.DAT.
\
If the parameters are read from a file, the program will read the parameter values from
the ASCII file PARAMB.DA T one value per line in the following order:
NTRAIN: Number of patterns in the training set.
NBOOT: Number of bootstrap samples to be drawn.
NINP: Number of input 'variables.
NHID: Number of hidden neurons.
NLAGS: Number of lags of the dependent variable.
LEAD: Number of forecasts to be calculated.
PALPHA:IOO*(l-?)% confidence interval.
MAXITE: Maximum number of iterations.
NFIT: Number of neural network models to fit.
FNAME: Input file name.
The data file is an ASCII file. The data are real numbers separated by one or more blanks.
The name of the data file is specified by the user (see input parameters).
The first pattern in the training set is given in the first row of the data file, starting with
the output variable, the independent variables (if any) the first lag of the output variable,
the second lag of the output variable, and so on. More than one line of data may be used
for each pattern in the training set. The computer will go to a new line after a pattern has
been read.
FORECAST.OUT
Format of output:
Forecast lead (e.g. 1 for one step), forecast, standard error of forecast, lower and upper
prediction limits, length of prediction interval.
BOOT.OUT
Format of output:
Bootstrap sample number, value of root mean squared error for trained network (on
training set).
OPTIM.OUT
Work file used by optimization algorithm~ This file may have useful information on
optimization problems. Iteration results are given for the last network trained.
Neural network architecture:
Pre processing: Input and output variables are scaled to (-I: I).
Initialization of weights: A set of weights are generated using a random number
generator. In order to prevent local minimum problems, a number (as specified by user)
of ANN's are trained from random points, and the best ANN is selected.
One hidden layer with sigmoidal activation function.
Error function: Root mean squared error.
One output.
Number of input variables (NINP) and number of nodes in hidden layer (NHID):
NHID*(NINP+2)<100
Number of patterns in training set (NTRAIN): NTRAIN<IOOO
(NINP+ 1)*NTRAIN < 6000
Number of forecast lead times (LEAD): LEAD<100
Number of bootstrap samples (NBOOT): NBOOT < 2000
LEAD*NBOOT < 10000
The data file ELECS.TXT contains 716 patterns of an electricity load demand time
series. Each pattern consists of the output variable and 8 input variables including two
lagged values of the output variable.
Suppose we want to train a network with two nodes in the hidden layer on the first 400
patterns and produce 5 forecasts with 95% prediction limits, using 1000 bootstrap
samples.
400
1000
8
2
2
5
0.05
40
3
elecs.txt
To run the program, type PREDICTNN. The pr~m
will estimate the required running
time. If you want to terminate the program, hit 19!21 ~.
.
The output file are FORECAST. OUT and BOOT.OUT.
The temptation exists to use neural networks as "black boxes", and scientists often
succumb. This is very unfortunate because a neural network is a powerful statistical
device, used for many different problems. Statisticians should provide the knowledge
to implement statistical methods together with neural networks to bring out its best
performances.
In this dissertation, neural networks were used to analyse a time series and especially
in request of the problem of forecasting. Statistical methods, used for time series
analysis, are usually linear but may not succeed in forecasting a non-linear data set.
This study specifically investigated the use of ARMA model in comparison with the
neural network model for the problem of forecasting.
Two time series, namely a generated AR(2) and an electricity consumption time
series, were used to illustrate the process of time series analysis. Box and Jenkins
proposed three phases for the analysis of a time series: identification, estimation and
evaluation. These phases were given for ARMA models and also applied to neural
network models in Chapter 4.
To illustrate the use of neural networks for the problem of forecasting, electricity
consumption data as a time series was used. A neural network model was estimated to
predict twelve hours ahead. The model for electricity consumption can be improved.
Different forecasting methods suitable for non-linear models were discussed A
program was developed to detennine prediction intervals for the examples of the
linear model and the neural network model by using the bootstrap method. The neural
network program package that was used in this study can only calculate one-step
predictions and does not permit the implementation of the bootstrap method.
Calculation of multi-step forecasts and prediction intervals should be implemented in
neural network software.
Neural. ~etworks as a versatile, adjustable tool are highly recommended. This study
focussed on the identification, estimation and evaluation of neural network models for
time series, and forecasting. Many of the standard techniques in statistics can be
compared with neural network methodology, especially in application with large data
sets.
Further research includes the use of bootstrap methods to calculate predictions,
standard errors and prediction intervals for the forecasts. Bootstrap techniques can be
implemented in neural network computer software packages. Research in the
application and comparison by use of bootstrap techniques with neural networks are
recommended.
ANSUJ, A.P., CAMARGO M.E., RADHARAMANAN
(1996). Sales forecasting
R., and PETRY D.G.
using time series models and neural networks.
Computers and Indu~trial Engineering, 31(1,2), 421-424.
BISHOP, C.M. (1996). Neural networks for pattem recognition. Oxford
University press, New York.
BERTHOLD,
M., HAND,
D.J.,
(1999).
Intelligent data analysi/t: An
introduction. Springer- Verlag.
BROWN, B. and MARIANO, R (1989). Residual based stochastic predictors
and estimation in non-linear models. Econometrica, 52, 321-343.
BORAINE, H (2000). Prediction intervals for forecasts of neural network
models for time series. Proceedings of the second ICSC Smymposium on
Neural Computation, Berlin, Germany (23-26 May, 2000).
BO~
G.E.P., JENKINS,
G.M
and REINSEL,
G.C. (1994). Time Serie.~
Analysis: Forecasting and Control. Third Edition, Prentice Hall, Englewood
Cliffs, New Jersey.
CHATFIELD, C (1996). Model Uncertainty and Forecast Accuracy. Journal
ofForeca~ting, 15,495-508.
CHENG, B. and TITTERINGTON,
D.M (1994). Neural networks: A view
from a statistical perspective. Statistical Science, 9(1),2-54.
CONNOR, J.T. (1996). A robust neural network filter for electricity demand
prediction. Journal ofForeca~ting, 15(6),458(22).
CRUSADER
guide, 1-142.
SYSTEMS
(Pty)Ltd.
(1998). Basic ModeIGen1MJ.O user's
DE JONGH, P.J. and DE WET, T. (1994).
An Introduction
to neural
network:;;, 1-19.
FLETCHER, R (1987). Practical Methods of Optimization (Second ed.). New
York: John Wiley.
GALPIN, J.S. (1997). Investigation of improvement of load forecasting by
using rainfall, temperature and humidity data. ESKOM Technology research
report, 2-93.
GEMAN, S., BIENENSTOCK E., DOURSAT, R (1992). Neural Networks
and the Bias/Variance Dilemma. Neural Computation, 4, 1-58.
HA YKIN,
S. (1994).
Neural
network<; - A comprehen.vive foundation.
Macmillan College Publishing Company, Inc., printed in USA.
HWANG, J.T.G. and DING, AA (1997). Prediction intervals for neural
networks. Journal of the American Statistical A.<;sociation, 92(438), Theory
and Methods, 748-757.
HAMILTON, J.D. (1994). Time serie.<;ana/y,<;i,<;.
Princeton University Press,
Princeton, New Jersey.
JOHANNSON, E.M,
DOWLA, F.U. and GOODMAN, D.M
(1992).
Backpropagation learning for multi-layer feedforward neural networks using
conjugate gradient method. International Journal of Neural Systems, 2(4),
188-197.
KRAMER, AH and SANGIOVANNl-VINCENTELLI, A (1989). Efficient
parallel learning algorithms for neural networks. In Touretzky (Ed), Advance.v
in Neural Information Processing Systems, I , 40-48. San Mateo, CA: Morgan
Kaugmann.
LE BARON, B. and WEIGEND, AS. (1996). Evaluating neural network
predictors
by
bootstrapping,
http://www.cs.colorado.edulhomeslandreas/public
htmllHome.html.(January
1998).
LEE, K Y., CRA, Y.T., PARK, J.R,
(1992). Short term load forecasting
using an artificial neural network. IEEE: Transactions on Power systems,7,
124-132.
LIN, J.L. and GANGER, C.W.J. (1994). Forecasting from Non-linear Models
in Practice. Journal offoreca.<;ting,13,1-9.
MASAROTIO,
G. (1990). Bootstrap prediction intervals for autoregressions.
International Journalfor Forecayting, 6, 229-239.
MC CULLOCH, W.S.
and PITIS, W. (1943). A logical calculus of ideas
immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115133.
MC LEOD, A.I. (1978). On the distribution ofresiduaI autocorrelation in BoxJenkins models. Journal of Royal Statistical Society, A40, 296-302.
MINSKY, M. and PAPERT, S. (1969). Perceptrons. Cambridge, MA: MIT
Press.
ORR, KD. (1995). Use of an artificial neural network to predict lCO length of
stay in cardiac surgical patients. Society of Critical Care Medicine. San
Francisco, CA, January 1995.
.l~ARK, D.C., EL-SHARKAWI, M.A., MARKS II, R.I., ATLAS, L.E.,
DAMBORG, MJ., (1991). Electric load forecasting using neural networks.
IEEE Transactions on Power System~, 6(2), 442-449.
PENG, T.M, HUBELE, N.F. and KARADY. G.G. (1993). An adaptive neural
networks approach to one week ahead load forecasting. IEEE transactions on
Power Systems, 8(3), 1195-1203.
RIPLEY. B.D. (1993). Statistical aspects of neural networb. Invited lectures
for SemStat, Sandbjerg, Denmark, 25-30 April 1992. To appear in the
proceedings to be published by Chapman & Hall (January 1993).
RIPLEY. B.D. (1994). Neural networks and related methods of classification.
Journal of the Royal Statistical Society, 56(3), 409-456.
SARLE,
W.S.
(1996).
Neural
Network
and
Stati.~tical Jargon.
[email protected], (April 29, 1996).
SCHALKHOFF, R (1992). Pattern Recognition: Statistical, Structural and
Neural approaches. John Wiley and Sons, Inc., Canada.
SNYMAN. J.A. (1982). A New Dynamic Method for Unconstrained
Minimization. Applied Mathematical Modelling, 6, 449-462.
SNYMAN, J.A. (1983). An updated version of the Original Leap-frog,
Dynamic Method for Unconstrained Minimisation (LFOP1(b)). Applied
Mathematical Modelling, 7, 216-218.
STERN.
as.
(1996). Neural networks in applied statistics. American
Statistical Association and the American Society for Quality Control:
Technometrics, 38(3),205-213.
TAM, K.Y. and KlANG, MY. (1992). Managerial applications of neural
networks in the case of bank failure predictions. Journal of Management
Science, 38(7), 926-947.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement