APPLICATION OF WILCOXON NORM FOR INCREASED OUTLIER INSENSITIVITY IN FUNCTION APPROXIMATION PROBLEMS

APPLICATION OF WILCOXON NORM FOR INCREASED OUTLIER INSENSITIVITY IN FUNCTION APPROXIMATION PROBLEMS
APPLICATION OF WILCOXON NORM FOR
INCREASED OUTLIER INSENSITIVITY IN FUNCTION
APPROXIMATION PROBLEMS
A THESIS SUBMITTED IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
BACHELOR OF TECHNOLOGY
IN
ELECTRONICS & INSTRUMENTATION ENGINEERING
By
NISHANTA SOURAV DAS
Roll No. – 10407029
Under the guidance of
Prof. Ganapati Panda
Department of Electronics & Communication Engineering
National Institute of Technology, Rourkela
2008
APPLICATION OF WILCOXON NORM FOR
INCREASED OUTLIER INSENSITIVITY IN FUNCTION
APPROXIMATION PROBLEMS
A THESIS SUBMITTED IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
BACHELOR OF TECHNOLOGY
in
ELECTRONICS & INSTRUMENTATION ENGINEERING
By
NISHANTA SOURAV DAS
Roll No. – 10407029
Under the guidance of
Prof. Ganapati Panda
Department of Electronics & Communication Engineering
National Institute of Technology, Rourkela
2008
NATIONAL INSTITUTE OF TECHNOLOGY
ROURKELA
CERTIFICATE
This is to certify that the thesis entitled, “Application of Wilcoxon Norm for
increased outlier insensitivity in function approximation problems” submitted by Sri
Nishanta Sourav Das in partial fulfillment of the requirements for the award of Bachelor of
Technology Degree in Electronics & Instrumentation Engineering at the National Institute of
Technology, Rourkela (Deemed University) is an authentic work carried out by him under my
supervision and guidance.
To the best of my knowledge, the matter embodied in the thesis has not been
submitted to any other University / Institute for the award of any Degree or Diploma.
Prof. G. Panda
Professor and Head,
Department of Electronics & Communication Engineering,
Date:
National Institute of Technology,
Rourkela-769008
ACKNOWLEDGEMENT
I take this opportunity as a privilege to thank all individuals without whose
support and guidance I could not have completed our project in the stipulated period of time.
First and foremost I would like to express my deepest gratitude to my Project Supervisor Prof. G.
Panda, Head of the Department, Department of Electronics and Communication Engineering,
NIT Rourkela for his invaluable support, guidance, motivation and encouragement throughout
the period this work was carried out. His readiness for consultation at all times, his educative
comments and inputs, his concern and assistance even with practical things have been extremely
helpful.
I am grateful to Ms. Babita Majhi , Mr. Jagannath Nanda and Mr. Ajit Kumar
Sahoo for their valued suggestions and inputs during the course of the project work.
I would also like to thank all professors and lecturers, and members of the
department of Electronics and Communication Engineering for their generous help in various
ways for the completion of this thesis. I also extend my thanks to my fellow students for their
friendly co-operation.
NISHANTA SOURAV DAS
Roll. No. 10407029
Department of E.C.E.
NIT Rourkela
CONTENTS
Abstract.
i
List of Figures.
iii
List of Tables.
v
Abbreviations Used
vi
CHAPTER 1. INTRODUCTION
1
1.1 Introduction.
2
1.2 Motivation
3
1.3 A Brief Sketch of Contents
4
CHAPTER 2. ADAPTIVE MODELING AND SYSTEM IDENTIFICATION
6
2.1. Introduction.
7
2.2. Adaptive Filter.
8
2.3. Filter Structures.
10
2.4. Application of Adaptive Filters.
11
2.4.1. Direct Modeling.
11
2.4.2. Inverse Modeling.
13
2.5. Gradient Based Adaptive Algorithm.
14
2.5.1. General Form of Adaptive FIR Algorithm.
14
2.5.2. The Mean-Squared Error Cost Function.
14
2.5.3. The Wiener Solution.
15
2.5.4. The Method of Steepest Descent.
17
2.6. Least Mean Square (LMS) Algorithm.
17
2.7. System Identification.
19
CHAPTER 3. ARTIFICIAL NEURAL NETWORKS
22
3.1. Introduction.
3.2. Single Neuron Structure.
3.2.1. Activation Functions and Bias.
3.2.2. Learning Process.
3.3. Multilayer Perceptron
3.3.1. Back Propagation Algorithm.
23
24
25
26
27
29
CHAPTER 4. RADIAL BASIS FUNCTIONS NETWORKS
31
4.1. Introduction.
32
4.2. RBFNN Structure.
32
4.2.1. Various Radial Basis Functions .
4.3. Learning Strategies of GRBFNNs
33
34
4.3.1. Fixed Centers Selected at Random
35
4.3.2 Self-organized Selection of Centers
35
4.3.3 Stochastic Gradient Approach (Supervised Learning)
36
CHAPTER 5. WILCOXON LEARNING MACHINES
38
5.1 Introduction
39
5.2 Wilcoxon Norm
40
5.3 Wilcoxon Neural Network WNN
40
5.3.1 Structure of WNN
40
5.3.2. Learning Algorithm of WNN
43
5.4 Wilcoxon Generalised Radial Basis Function Network (WGRBFN)
45
CHAPTER 6. SIMULATIONS AND CONCLUSION
47
6.1 Simulations
48
6.2 Conclusion
57
6.3 References
59
ABSTRACT
In system theory, characterization and identification are fundamental problems. When
the plant behavior is completely unknown, it may be characterized using certain model and then, its
identification may be carried out with some artificial neural networks(ANN) (like multilayer
perceptron(MLP) or functional link artificial neural network(FLANN) ) or Radial Basis
Functions(RBF) using some learning rules such as the back propagation (BP) algorithm. They offer
flexibility, adaptability and versatility, for the use of a variety of approaches to meet a specific goal,
depending upon the circumstances and the requirements of the design specifications. The first aim of
the present thesis is to provide a framework for the systematic design of adaptation laws for
nonlinear system identification and channel equalization. While constructing an artificial neural
network or a radial basis function neural network, the designer is often faced with the problem of
choosing a network of the right size for the task. Using a smaller neural network decreases the cost of
computation and increases generalization ability. However, a network which is too small may never
solve the problem, while a larger network might be able to. Transmission bandwidth being one of the
most precious resources in digital communication, Communication channels are usually modeled as
band-limited linear finite impulse response (FIR) filters with low pass frequency response.
The second aim of the thesis is to propose a method of dealing with the inevitable
presence of Outliers in system identification and function approximation problems. In statistics, an
outlier is an observation that is numerically distant from the rest of the data. Statistics derived from
data sets that include outliers may be misleading. As is well known in statistics, the resulting linear
regressors by using the rank-based Wilcoxon approach to linear regression problems are usually
robust against (or insensitive to) outliers. This is the prime motivation behind the introduction of the
Wilcoxon approach to the area of machine learning in this paper. Specifically, we investigate two
new learning machines, namely Wilcoxon neural network (WNN) and Wilcoxon generalized radial
basis function network (WGRBFN).These provide alternative learning machines when faced with
general nonlinear learning problems.
This thesis presents a comprehensive comparative study covering the implementation
of Artificial Neural Network (ANN) and Generalized Radial Basis Functions (GRBFNN) and their
i Wilcoxon versions, namely Wilcoxon Neural Network (WNN) and Wilcoxon Generalized Radial
Basis Function Neural Network (WGRBFNN) for nonlinear system identification and channel
equalization. All the structures mentioned above, and their conventional gradient-descent training
methods were extensively studied.
Simulation results show that the Wilcoxon learning machines proposed as such have
good robustness against outliers as applied to artificial neural networks and generalized radial basis
functions.
ii LIST OF FIGURES
Figure No
Figure Title
Page No.
Fig.2.1
Type of adaptations
8
Fig.2.2
General Adaptive Filtering
9
Fig.2.3
Structure of an FIR Filter
11
Fig.2.4
Function Approximation
12
Fig.2.5
System Identification
12
Fig.2.6
Inverse Modeling
13
Fig.2.7
Block diagram of system identification
20
Fig.3.1
A single neuron structure
24
Fig.3.2
Structure of multilayer perceptron
27
Fig.3.3
Neural network using BP algorithm
29
Fig.4.1.
Structure of RBFNN
32
iii Fig.4.2.
The Gaussian Function
33
Fig.5.1.
Wilcoxon Neural Network Structure
41
Fig.6.1-10
Simulation Examples: Performance of ANN & WNN
49-53
Fig.6.11-20
Simulation Examples: Performance of GRBFN & WGRBFN
53-55
iv LIST OF TABLES
Table No.
3.1
Table Title
Page No.
Common activation functions.
24
v ABBREVIATIONS USED
ANN
Artificial Neural Network
RBF
Radial Basis Function
GRBFNN
Generalized Radial Basis Function Neural Network
WNN
Wilcoxon Neural Network
WGRBFNN
Wilcoxon Generalized Radial Basis Function Neural Network
BP
Back Propagation
FIR
Finite Impulse Response
IIR
Infinite Impulse Response
ISI
Inter Symbol Interference
LMS
Least Mean Square
MLANN
Multilayer Artificial Neural Network
MLP
Multilayer Perceptron
MSE
Mean Square Error
vi Chapter 1
INTRODUCTION
1 1. INTRODUCTION
1.1. INTRODUCTION.
System identification is one of the most important areas in engineering because of
its applicability to a wide range of problems. Mathematical system theory, which has in the past
few decades evolved into a powerful scientific discipline of wide applicability, deals with
analysis and synthesis of systems. The best developed theory for systems defined by linear
operators using well established techniques based on linear algebra, complex variable theory and
theory of ordinary linear differential equations. Design techniques for dynamical systems are
closely related to their stability properties. Necessary and sufficient conditions for stability of
linear time-invariant systems have been generated over past century, well-known design methods
have been established for such systems. In contrast to this, the stability of nonlinear systems can
be established for the most part only on a system-by-system basis.
In the past few decades major advances have been made in adaptive identification
and control for identifying and controlling linear time-invariant plants with unknown parameters.
The choice of the identifier and the controller structures based on well established results in
linear systems theory. Stable adaptive laws for the adjustment of parameters in these which
assures the global stability of the relevant overall systems are also based on properties of linear
systems as well as stability results that are well known for such systems [1.1].
Machine learning, namely learning from examples, has been an active research area for several
decades. Popular and powerful learning machines proposed in the past include artificial neural
networks (ANNs) [1]–[4], generalized radial basis function networks (GRBFNs) [5]–[7], fuzzy
neural networks (FNNs) [8], [9], and support vector machines (SVMs). They are different in
their origins, network configurations, and objective functions. They have also been successfully
applied in many branches of science and engineering. In statistical terms, the aforementioned
learning machines are nonparametric in the sense that they do not make any assumptions of the
functional form, e.g., linearity, of the discriminant or predictive functions. Among these, we
would be particularly interested in ANN and GRBFNN.
Robust smoothing is a central idea in statistics that aims to simultaneously
estimate and model the underlying structure. Outliers are observations that are separated in some
fashion from the rest of the data. Hence, outliers are data points that are not typical of the rest of
2 the data. Depending on their location, outliers may have moderate to severe effects on the
regression model. A regressor or a learning machine is said to be robust if it is insensitive to
outliers in the data
1.2. MOTIVATION
Adaptive filtering has proven to be useful in many contexts such as linear
prediction, channel equalization, noise cancellation, and system identification. The adaptive filter
attempts to iteratively determine an optimal model for the unknown system, or “plant”, based on
some function of the error between the output of the adaptive filter and the output of the plant.
The optimal model or solution is attained when this function of the error is minimized. The
adequacy of the resulting model depends on the structure of the adaptive filter, the algorithm
used to update the adaptive filter parameters, and the characteristics of the input signal.
When the parameters of a physical system are not available or time dependent it is
difficult to obtain the mathematical model of the system. In such situations, the system
parameters should be obtained using a system identification procedure. The purpose of system
identification is to construct a mathematical model of a physical system from input-output
mapping. Studies on linear system identification have been carried out for more than three
decades [1.3]. However, identification of nonlinear systems is a promising research area.
Nonlinear characteristics such as saturation, dead-zone, etc. are inherent in many real systems. In
order to analyze and control such systems, identification of nonlinear system is necessary.
Hence, adaptive nonlinear system identification has become more challenging and received
much attention in recent years [1.4]. The conventional LMS algorithm [1.5] fails in case of
nonlinear channels and plants. Several approaches based on Artificial Neural Network (ANN)
and Generalized Radial Basis Functions (GRBFNN) have been discussed in this paper for
estimation of nonlinear systems.
In statistics, an outlier is an observation that is numerically distant from the rest of
the data. Depending on their location, outliers may have moderate to severe effects on the
regression model. Statistics derived from data sets that include outliers may be misleading. For
example, if one is calculating the average temperature of 10 objects in a room, and most are
between 20 and 25 degrees Celsius, but an oven is at 350 °C, the median of the data may be 23
3 but the mean temperature will be 55. In this case, the median better reflects the temperature of a
randomly sampled object than the mean. Outliers may be indicative of data points that belong to
a different population than the rest of the sample set. As is well known in statistics, the resulting
linear regressors by using the rank-based Wilcoxon approach to linear regression problems are
usually robust against (or insensitive to) outliers. It is then natural to generalize the Wilcoxon
approach for linear regression problems to nonparametric Wilcoxon learning machines for
nonlinear regression problems. This is the prime motivation behind the introduction of the
Wilcoxon approach to the area of machine learning in this paper. Specifically, we investigate two
new learning machines, namely Wilcoxon neural network (WNN) and Wilcoxon generalized
radial basis function network (WGRBFN).These provide alternative learning machines when
faced with general nonlinear learning problems.
1.3. A BRIEF SKETCH OF CONTENTS
¾ In Chapter 2, adaptive modeling and system identification problem is defined for linear
and nonlinear plants. The conventional LMS algorithm and other gradient based
algorithms for FIR system are derived. Nonlinearity problems are discussed briefly and
various methods are proposed for its solution.
¾ In Chapter 3, the theory, structure and algorithms of various artificial neural networks
are discussed. We focus on Multilayer Perceptron (MLP).
¾ Chapter 4 gives an introduction to Radial Basis Function Neural Network (RBFNN).
Different training strategies to train GRBFNN are thoroughly discussed. Simulations are
carried out for the stochastic gradient approach for identification of non-linear and noisy
plants.
¾ In Chapter 5, introduces the Wilcoxon Learning Approach as applied to various learning
machines. The Wilcoxon Norm is defined and its application in developing Wilcoxon
Neural Networks (WNN) and Wilcoxon Generalized Radial Basis Functions
(WGRBFNN) is shown. The gradient descent methods for these new networks are
derived.
4 ¾ Chapter 6 summarizes the work done in this thesis work and points to possible directions
for future work, more precisely, application of Wilcoxon Norm to various learning
machines for increasing the robustness against outliers in various function approximation
problems. Simulations are shown with the new update equations as derived and the
results are compared with conventional ANN and GRBFNN.
5 Chapter 2
ADAPTIVE MODELLING AND SYSTEM
IDENTIFICATION
6 2. ADAPTIVE MODELING AND SYSTEM IDENTIFICATION
2.1. INTRODUCTION
Modeling and system identification is a very broad subject, of great importance in the fields of
control system, communications, and signal processing. Modeling is also important outside the
traditional engineering discipline such as social systems, economic systems, or biological
systems. An adaptive filter can be used in modeling that is, imitating the behavior of physical
systems which may be regarded as unknown “black boxes” having one or more inputs and one or
more outputs. The essential and principal property of an adaptive system is its time-varying, selfadjusting performance. System identification [2.1, 2.2] is the experimental approach to process
modeling. System identification includes the following steps :
• Experiment design Its purpose is to obtain good experimental data and it includes the
choice of the measured variables and of the character of the input signals.
• Selection of model structure A suitable model structure is chosen using prior knowledge
and trial and error.
• Choice of the criterion to fit: A suitable cost function is chosen, which reflects how well
the model fits the experimental data.
• Parameter estimation An optimization problem is solved to obtain the numerical values of
the model parameters.
• Model validation: The model is tested in order to reveal any inadequacies.
The adaptive systems have following characteristics
1) They can automatically adapt (self-optimize) in the face of changing (nonstationary) environments and changing system requirements.
2) They can be trained to perform specific filtering and decision making tasks.
3) They can extrapolate a model of behavior to deal with new situations after trained
on a finite and often small number of training signals and patterns.
4) They can repair themselves to a limited extent.
5) They can be described as nonlinear systems with time varying parameters.
The adaptation is of two types
7 (i) open-loop adaptation
The open-loop adaptive process is shown in Fig.2.1.(a). It involves making measurements of
input or environment characteristics, applying this information to a formula or to a computational
algorithm, and using the results to set the adjustments of the adaptive system. The adaptation of
process parameters don’t depend upon the output signal.
(a)
(b)
Fig.2.1. Type of adaptations (a) Open-loop adaptation and (b) Closed-loop adaptation
(ii) closed-loop adaptation
Close-loop adaptation, as shown in Fig. 2.1.(b),on the other hand involves the automatic
experimentation with these adjustments and knowledge of their outcome in order to optimize a
measured system performance. The latter process may be called adaptation by “performance
feedback”. The adaptation of process parameters depends upon the input as well as output signal.
2.2. ADAPTIVE FILTER
An adaptive filter [2.3, 2.4] is a computational device that attempts to model the
relationship between two signals in real time in an iterative manner. Adaptive filters are often
realized either as a set of program instructions running on an arithmetical processing device such
as a microprocessor or digital signal processing (DSP) chip, or as a set of logic operations
implemented in a field-programmable gate array (FPGA). However, ignoring any errors
8 introduced by numerical precision effects in these implementations, the fundamental operation of
an adaptive filter can be characterized independently of the specific physical realization that it
takes. For this reason, we shall focus on the mathematical forms of adaptive filters as opposed to
their specific realizations in software or hardware. An adaptive filter is defined by four aspects:
1. The signals being processed by the filter.
2. The structure that defines how the output signal of the filter is computed from its input signal
3. The parameters within this structure that can be iteratively changed to alter the filter's inputoutput relationship.
4. The adaptive algorithm that describes how the parameters are adjusted from one time instant
to the next.
By choosing a particular adaptive filter structure, one specifies the number and type of
parameters that can be adjusted. The adaptive algorithm used to update the parameter values of
the system can take on an infinite number of forms and is often derived as a form of optimization
procedure that minimizes an error.
Fig.2.2. General Adaptive Filtering
Fig.2.2. shows a block diagram in which a sample from a digital input signal x(n) is fed into a
device, called an adaptive filter, that computes a corresponding output signal sample y(n) at time
9 n. For the moment, the structure of the adaptive filter is not important, except for the fact that it
contains adjustable parameters whose values affect how y(n) is computed. The output signal is
compared to a second signal d(n), called the desired response signal, by subtracting the two
samples at time n. This difference signal, given by
e(n) = d(n) – y(n)
(2.1)
is known as the error signal. The error signal is fed into a procedure which alters or adapts the
parameters of the filter from time n to time (n + 1) in a well-defined manner. As the time index n
is incremented, it is hoped that the output of the adaptive filter becomes a better and better match
to the desired response signal through this adaptation process, such that the magnitude of e(n)
decreases over time. In the adaptive filtering task, adaptation refers to the method by which the
parameters of the system are changed from time index n to time index (n +1). The number and
types of parameters within this system depend on the computational structure chosen for the
system. We now discuss different filter structures that have been proven useful for adaptive
filtering tasks.
2.3. FILTER STRUCTURES
In general, any system with a finite number of parameters that affect how y(n) is
computed from x(n) could be used for the adaptive filter in Fig. 2.2.. Define the parameter or
coefficient vector
W(n) = [ w0(n) w1(n) …… wL-1(n) ]T
(2.2)
where { wi(n)}, 0 < i < L - 1 are the L parameters of the system at time n.
The filter model typically takes the form of a finite-impulse-response (FIR) or infinite-impulseresponse (IIR) filter. Figure2.3. shows the structure of a direct-form FIR filter, also known as a
tapped-delay-line or transversal filter, where z-1 denotes the unit delay element and each wi(n) is
a multiplicative gain within the system. In this case, the parameters in W(n) correspond to the
impulse response values of the filter at time n. We can write the output signal y(n) as
= WT(n)X(n)
(2.3)
T
where X(n) = [ x(n) x(n-1) …… x(n- L + 1) ]T denotes the input signal vector and - denotes
vector transpose. Note that this system requires L multipliers and L - 1 delays to implement and
10 these computations are easily performed by a processor or circuit so long as L is not too large
and the sampling period for the signals is not too short. It also requires a total of 2L memory
locations to store the L input signal samples and the L coefficient values, respectively.
Fig.2.3. FIR filter structure
2.4. APPLICATION OF ADAPTIVE FILTERS.
Perhaps the most important driving forces behind the developments in adaptive filters
throughout their history have been the wide range of applications in which such systems can be used.
We now discuss the forms of these applications in terms of more-general problem classes that
describe the assumed relationship between d(n) and x(n). Our discussion illustrates the key issues in
selecting an adaptive filter for a particular task.
2.4.1. Direct Modeling (Function Approximation & System Identification)
In function approximation problems, we are given a set of input-output patterns and we try to
estimate the underlying function that relates the input to the output. This is done by passing the same
set of input points to the function and an adaptive filter kept parallel to the function,Fig.2.4 gives an
illustration. The outputs or response of both the function and the filter is found out and their
difference is noted. This difference is the error. The error is minimized by an adaptive algorithm that
updates the weights of the adaptive filter. System Identification is a special case of function
approximation. Here the underlying function is the provided by a plant or system and our aim is to
determine the impulse response of this system.
In direct modeling, the adaptive model is kept parallel with the unknown plant. Modeling a
single-input, single-output system is illustrated in Fig.2.5..Both the unknown system and adaptive
filter are driven by the same input. The adaptive filter adjusts itself in such a way that its output is
11 matched with that of the unknown system. Upon convergence, the structure and parameter values of
the adaptive system may or may not resemble those of unknown systems, but the input-output
response relationship will match. In this sense, the adaptive system becomes a model of the unknown
plant
Fig.2.4. Function approximation
Fig.2.5. System Identification
Let d(n) and y(n) represent the output of the unknown system and adaptive model with
x(n) as its input. Here, the task of the adaptive filter is to accurately represent the signal d(n) at
its output. If y(n) = d (n), then the adaptive filter has accurately modeled or identified the portion
of the unknown system that is driven by x(n).
Since the model typically chosen for the adaptive filter is a linear filter, the practical goal of the
adaptive filter is to determine the best linear model that describes the input-output relationship of
the unknown system. Such a procedure makes the most sense when the unknown system is also a
linear model of the same structure as the adaptive filter, as it is possible that y(n) = d(n) for some
set of adaptive filter parameters. For ease of discussion, let the unknown system and the adaptive
filter both be FIR filters, such that
12 d(n)= WTOPT(n)X(n)
where W
OPT
(2.4)
(n) is an optimum set of filter coefficients for the unknown system at time n. In this
problem formulation, the ideal adaptation procedure would adjust W(n) such that W(n) = W
OPT
(n) as
n→∞ . In practice, the adaptive filter can only adjust W(n) such that y(n) closely approximates d(n)
over time.
The system identification task is at the heart of numerous adaptive filtering applications. We list
several of these applications here
• Plant Identification
• Echo Cancellation for Long-Distance Transmission
• Acoustic Echo Cancellation
• Adaptive Noise Canceling
2.4.2. Inverse Modeling
We now consider the general problem of inverse modeling, as shown in Fig.2.6. In this
diagram, a source signals s(n) is fed into a plant that produces the input signal x(n) for the
adaptive filter. The output of the adaptive filter is subtracted from a desired response signal that
is a delayed version of the source signal, such that
d(n) = s (n - Δ)
(2.5)
where Δ is a positive integer value. The goal of the adaptive filter is to adjust its characteristics
such that the output signal is an accurate representation of the delayed source signal.
Fig.2.6 Inverse Modeling
13 2.5. GRADIENT BASED ADAPTIVE ALGORITHM
An adaptive algorithm is a procedure for adjusting the parameters of an adaptive filter to
minimize a cost function chosen for the task at hand. In this section, we describe the general
form of many adaptive FIR filtering algorithms and present a simple derivation of the LMS
adaptive algorithm. In our discussion, we only consider an adaptive FIR filter structure, such that
the output signal y(n) is given by (2.3). Such systems are currently more popular than adaptive
IIR filters because
(1) The input-output stability of the FIR filter structure is guaranteed for any set
of fixed coefficients, and
(2) The algorithms for adjusting the coefficients of FIR filters are simpler in general than those
for adjusting the coefficients of IIR filters.
2.5.1. General Form of Adaptive FIR Algorithm
The general form of an adaptive FIR filtering algorithm is
W(n+1) = W(n) + μ(n) G ( e(n),X(n),φ(n) )
(2.6)
where G(-) is a particular vector-valued nonlinear function, μ(n) is a step size parameter, e(n) and
X(n) are the error signal and input signal vector, respectively, and φ(n) is a vector of states that store
pertinent information about the characteristics of the input and error signals and/or the coefficients at
previous time instants. In the simplest algorithms, φ(n) is not used, and the only information needed
to adjust the coefficients at time n are the error signal, input signal vector, and step size.
The step size is so called because it determines the magnitude of the change or "step" that is taken by
the algorithm in iteratively determining a useful coefficient vector. Much research effort has been
spent characterizing the role that μ(n) plays in the performance of adaptive filters in terms of the
statistical or frequency characteristics of the input and desired response signals. Often, success or
failure of an adaptive filtering application depends on how the value of μ(n) is chosen or calculated to
obtain the best performance from the adaptive filter.
2.5.2. The Mean-Squared Error Cost Function
The form of G(-) in (2.6) depends on the cost function chosen for the given adaptive filtering
task. We now consider one particular cost function that yields a popular adaptive algorithm. Define
the mean-squared error (MSE) cost function as
1
2
∞
∞
14 = E{
}
(2.7)
where p (e(n)) represents the probability density function of the error at time n and E{-} is shorthand
n
for the expectation integral on the right-hand side of (2.7). The MSE cost function is useful for
adaptive FIR filters because
• ξMSE (n) has a well-defined minimum with respect to the parameters in W(n);
• the coefficient values obtained at this minimum are the ones that minimize the power in the error
signal e(n), indicating that y(n) has approached d{n); and
•ξ
MSE
is a smooth function of each of the parameters in W(n), such that it is differentiable with
respect to each of the parameters in W(n). The third point is important in that it enables us to
determine both the optimum coefficient values given knowledge of the statistics of d(n) and x(n) as
well as a simple iterative procedure for adjusting the parameters of an FIR filter.
2.5.3. The Wiener Solution.
For the FIR filter structure, the coefficient values in W(n) that minimize ξMSE (n) are welldefined if the statistics of the input and desired response signals are known. The formulation of
this problem for continuous-time signals and the resulting solution was first derived by Wiener
[2.3]. Hence, this optimum coefficient vector WMSE (n) is often called the Wiener solution to the
adaptive filtering problem. The extension of Wiener's analysis to the discrete-time case is
attributed to Levinson . To determine WMSE (n) we note that the function ξMSE(n) in (2.7) is
quadratic in the parameters {w (n)}, and the function is also differentiable. Thus, we can use a
i
result from optimization theory that states that the derivatives of a smooth cost function with
respect to each of the parameters is zero at a minimizing point on the cost function error surface.
Thus, WMSE (n) can be found from the solution to the system of equations
0 ,0
L-1
(2.8)
Taking derivatives of ξMSE (n) in (2.7) we obtain
15 = E{ e(n)
=
E{ e(n)
=
E{ e(n)
=
( E{ d(n)
}
(2.9)
}
(2.10)
}
(2.11)
∑
}
(2.12)
where we have used the definitions of e(n) and of y(n) for the FIR filter structure in (2.1) and
(2.6), respectively, to expand the last result in (2.15). By defining the matrix R (n)
XX
(autocorrelation matrix) and vector P (n) (cross correlation matrix) as
dx
R (n) = E ( X(n)X(n)T )
XX
and
(2.13)
P (n) = E ( d(n) X(n) )
dx
respectively, we can combine (2.8) and (2.13) to obtain the system of equations in vector form as
R (n) WMSE (n)
P (n) = 0
XX
(2.14)
dx
where 0 is the zero vector. Thus, so long as the matrix R (n) is invertible, the optimum Wiener
XX
solution vector for this problem is
WMSE (n) = R
XX
-1
(n) P (n)
(2.15)
dx
16 2.5.4. The Method of Steepest Descent
The method of steepest descent is a celebrated optimization procedure for minimizing the
value of a cost function ξ(n) with respect to a set of adjustable parameters W(n). This procedure
adjusts each parameter of the system according to
1
(2.16)
th
In other words, the i parameter of the system is altered according to the derivative of the cost
th
function with respect to the i parameter. Collecting these equations in vector form, we have
1
(2.17)
where ∂ξ(n)/∂W(n) is a vector of derivatives dξ(n)/dw (n).
i
Substituting these results into (2.17) yields the update equation for W(n) as
1
(2.18)
However, this steepest descent procedure depends on the statistical quantities E{d(n)x(n-i)} and
E{x(n-i)x(n-j)} contained in P (n) and R (n), respectively. In practice, we only have
dx
xx
measurements of both d(n) and x(n) to be used within the adaptation procedure. While suitable
estimates of the statistical quantities needed for (2.21) could be determined from the signals x(n)
and d{n), we instead develop an approximate version of the method of steepest descent that
depends on the signal values themselves. This procedure is known as the LMS (least mean
square) algorithm.
2.6. LMS ALGORITHM
The cost function ξ(n) chosen for the steepest descent algorithm of (2.16) determines the
coefficient solution obtained by the adaptive filter. If the MSE cost function in (2.7) is chosen,
the resulting algorithm depends on the statistics of x(n) and d(n) because of the expectation
operation that defines this cost function. Since we typically only have measurements of d(n) and
17 of x(n) available to us, we substitute an alternative cost function that depends only on these
measurements.
we can propose the simplified cost function ξ
ξ
LMS
LMS
(n)given by
(n) = e2(n)
(2.19)
This cost function can be thought of as an instantaneous estimate of the MSE cost function, as
ξ
MSE
(n) = E{ ξ
obtained when ξ
LMS
LMS
(n )}. Although it might not appear to be useful, the resulting algorithm
(n) is used for ξ(n) in (2.16) is extremely useful for practical applications.
Taking derivatives of ξ
LMS
(n) with respect to the elements of W(n) and substituting the result
into (2.16), we obtain the LMS adaptive algorithm given by
W(n+1) = W(n) + μ(n)e(n) X(n)
(2.20)
Equation (2.20) requires only multiplications and additions to implement. In fact, the number and
type of operations needed for the LMS algorithm is nearly the same as that of the FIR filter
structure with fixed coefficient values, which is one of the reasons for the algorithm's popularity.
The behavior of the LMS algorithm has been widely studied, and numerous results concerning
its adaptation characteristics under different situations have been developed. For now, we
indicate its useful behavior by noting that the solution obtained by the LMS algorithm near its
convergent point is related to the Wiener solution. In fact, analysis of the LMS algorithm under
certain statistical assumptions about the input and desired response signals show that
lim
(2.21)
when the Wiener solution W
MSE
(n) is a fixed vector. Moreover, the average behavior of the LMS
algorithm is quite similar to that of the steepest descent algorithm in (2.18) that depends
explicitly on the statistics of the input and desired response signals. In effect, the iterative nature
of the LMS coefficient updates is a form of time-averaging that smoothes the errors in the
instantaneous gradient calculations to obtain a more reasonable estimate of the true gradient.The
problem is that gradient descent is a local optimization technique, which is limited because it is
18 unable to converge to the global optimum on a multimodal error surface if the algorithm is not
initialized in the basin of attraction of the global optimum.
Several modifications exist for gradient based algorithms in attempt to enable them to overcome
local optima. One approach is to simply add a momentum term [2.3] to the gradient computation
of the gradient descent algorithm to enable it to be more likely to escape from a local minimum.
This approach is only likely to be successful when the error surface is relatively smooth with
minor local minima, or some information can be inferred about the topology of the surface such
that the additional gradient parameters can be assigned accordingly. Other approaches attempt to
transform the error surface to eliminate or diminish the presence of local minima [2.16], which
would ideally result in a unimodal error surface. The problem with these approaches is that the
resulting minimum transformed error used to update the adaptive filter can be biased from the
true minimum output error and the algorithm may not be able to converge to the desired
minimum error condition. These algorithms also tend to be complex, slow to converge, and may
not be guaranteed to emerge from a local minimum.
Another approach, attempts to locate the global optimum by running several LMS
algorithms in parallel, initialized with different initial coefficients. The notion is that a larger,
concurrent sampling of the error surface will increase the likelihood that one process will be
initialized in the global optimum valley. This technique does have potential, but it is inefficient
and may still suffer the fate of a standard gradient technique in that it will be unable to locate the
global optimum. By using a similar congregational scheme, but one in which information is
collectively exchanged between estimates and intelligent randomization is introduced, structured
stochastic algorithms are able to hill-climb out of local minima. This enables the algorithms to
achieve better, more consistent results using a fewer number of total estimate.
2.7. SYSTEM IDENTIFICATION
System identification concerns with the determination of a system, on the basis of input
output data samples. The identification task is to determine a suitable estimate of finite
dimensional parameters which completely characterize the plant. The selection of the estimate is
based on comparison between the actual output sample and a predicted value on the basis of
input data up to that instant. An adaptive automaton is a system whose structure is alterable or
adjustable in such a way that its behavior or performance improves through contact with its
19 environment. Depending upon input-output relation, the identification of systems can have two
groups
A. Static System Identification
In this type of identification the output at any instant depends upon the input at that instant.
These systems are described by the algebraic equations. The system is essentially a memoryless
one and mathematically it is represented as y(n) = f [x(n)] where y(n) is the output at the nth
instant corresponding to the input x(n).
B. Dynamic System Identification
In this type of identification the output at any instant depends upon the input at that instant as
well as the past inputs and outputs. Dynamic systems are described by the difference or
differential equations. These systems have memory to store past values and mathematically
represented as y(n)=f [x(n), x(n-1),x(n-2)………..y(n-1),y(n-2),……] where y(n) is the output at
the nth instant corresponding to the input x(n).
Fig.2.7. Block Diagram of System Identification
20 A system identification structure is shown in Fig.2.6. The model is placed parallel to the
nonlinear plant and same input is given to the plant as well as the model. The impulse response
of the linear segment of the plant is represented by h(n) which is followed by nonlinearity(NL)
associated with it. White Gaussian noise q(n) is added with nonlinear output accounts for
measurement noise. The desired output d(n) is compared with the estimated output y(n) of the
identifier to generate the error e(n) which is used by some adaptive algorithm for updating the
weights of the model. The training of the filter weights is continued until the error becomes
minimum and does not decrease further. At this stage the correlation between input signal and
error signal is minimum. Then the training is stopped and the weights are stored for testing. For
testing purpose new samples are passed through both the plant and the model and their responses
are compared.
21 Chapter 3
ARTIFICIAL NEURAL NETWORKS 22 2. ARTIFICIAL NEURAL NETWORK
3.1. INTRODUCTION
Because of nonlinear signal processing and learning capability, Artificial Neural
Networks (ANN’s) have become a powerful tool for many complex applications including
functional approximation, nonlinear system identification and control, pattern recognition and
classification, and optimization. The ANN’s are capable of generating complex mapping
between the input and the output space and thus, arbitrarily complex nonlinear decision
boundaries can be formed by these networks. An artificial neuron basically consists of a
computing element that performs the weighted sum of the input signal and the connecting
weight. The sum is added with the bias or threshold and the resultant signal is then passed
through a non-linear element of tanh(.) type. Each neuron is associated with three parameters
whose learning can be adjusted; these are the connecting weights, the bias and the slope of the
non-linear function. For the structural point of view a neural network(NN) may be single layer or
it may be multi-layer. In multi-layer structure, there is one or many artificial neurons in each
layer and for a practical case there may be a number of layers. Each neuron of the one layer is
connected to each and every neuron of the next layer.
A neural network is a massively parallel distributed processor made up of simple processing unit,
which has a natural propensity for storing experimental knowledge and making it available for
use. It resembles the brain in two types
1. Knowledge is acquired by the network from its environment through a learning process.
2. Interneuron connection strengths, known as synaptic weights, are used to store the
acquired knowledge.
Artificial Neural Networks (ANN) has emerged as a powerful learning technique to perform
complex tasks in highly nonlinear dynamic environments. Some of the prime advantages of
using ANN models are their ability to learn based on optimization of an appropriate error
function and their excellent performance for approximation of nonlinear function. At present,
most of the work on system identification using neural networks are based on multilayer feed
forward neural networks with back propagation learning or more efficient variations of this
algorithm On the other hand the Functional link ANN(FLANN) originally proposed by Paois a
single layer structure with functionally mapped inputs. The performance of FLANN for system
23 identification of nonlinear systems has been reported [3.5] in the literature. Patra and Kot have
used Chebyschev expansions for nonlinear system identification and have shown that the
identification performance is better than that offered by the multilayer ANN (MLANN) model.
Wang and Chen have presented a fully automated recurrent neural network (FARNN) that is
capable of self-structuring its network in a minimal representation with satisfactory performance
for unknown dynamic system identification and control
3.2. SINGLE NEURON STRUCTURE
In 1958, Rosenblatt demonstrated some practical applications using the perceptron [3.8].
The perceptron is a single level connection of McCulloch-Pitts neurons sometimes called singlelayer feed forward networks. The network is capable of linearly separating the input vectors into
pattern of classes by a hyper plane. A linear associative memory is an example of a single-layer
neural network. In such an application, the network associates an output pattern (vector) with an
input pattern (vector), and information is stored in the network by virtue of modifications made
to the synaptic weights of the network.
The structure of a single neuron is presented in Fig. 3.1.An artificial neuron involves the
computation of the weighted sum of inputs and threshold [3.9, 3.10]. The resultant signal is then
passed through a non-linear activation function. The output of the neuron may be represented as,
(3.1)
Where b(n) = threshold to the neuron is called as bias.
th
w (n) = weight associated with the j input, and N = no. of inputs to the neuron.
j
24 3.2.1. Activation Functions and Bias.
The perceptron internal sum of the inputs is passed through an activation function, which
can be any monotonic function. Linear functions can be used but these will not contribute to a
non-linear transformation within a layered structure, which defeats the purpose of using a neural
filter implementation. A function that limits the amplitude range and limits the output strength of
each perceptron of a layered network to a defined range in a non-linear manner will contribute to
a nonlinear transformation. There are many forms of activation functions, which are selected
according to the specific problem. All the neural network architectures employ the activation
function [3.1, 3.8] which defines as the output of a neuron in terms of the activity level at its
input (ranges from -1 to 1 or 0 to 1). Table 3.1 summarizes the basic types of activation
functions. The most practical activation functions are the sigmoid and the hyperbolic tangent
functions. This is because they are differentiable.
The bias gives the network an extra variable and the networks with bias are more
powerful than those of without bias. The neuron without a bias always gives a net input of zero
to the activation function when the network inputs are zero. This may not be desirable and can be
avoided by the use of a bias.
25 3.2.2 Learning Processes
The property that is of primary significance for a neural network is that the ability of the
network to learn from its environment, and to improve its performance through learning. The
improvement in performance takes place over time in accordance with some prescribed measure.
A neural network learns about its environment through an interactive process of adjustments
applied to its synaptic weights and bias levels. Ideally, the network becomes more
knowledgeable about its environment after each iteration of learning process. Hence we define
learning as:
“It is a process by which the free parameters of a neural network are adapted through a process
of stimulation by the environment in which the network is embedded.”
The processes used are classified into two categories as described in [3.1]:
(A) Supervised Learning (Learning With a Teacher)
(B) Unsupervised Learning (Learning Without a Teacher)
(A) Supervised Learning:
We may think of the teacher as having knowledge of the environment, with that
knowledge being represented by a set of input-output examples. The environment is, however
unknown to neural network of interest. Suppose now the teacher and the neural network are both
exposed to a training vector, by virtue of built-in knowledge, the teacher is able to provide the
neural network with a desired response for that training vector. Hence the desired response
represents the optimum action to be performed by the neural network. The network parameters
such as the weights and the thresholds are chosen arbitrarily and are updated during the training
procedure to minimize the difference between the desired and the estimated signal. This updation
is carried out iteratively in a step-by-step procedure with the aim of eventually making the neural
network emulate the teacher. In this way knowledge of the environment available to the teacher
is transferred to the neural network. When this condition is reached, we may then dispense with
the teacher and let the neural network deal with the environment completely by itself. This is the
form of supervised learning.
The update equations for weights are derived as LMS:
1
(3.2)
is the change in w in nth iteration.
j
26 (B) Unsupervised Learning:
In unsupervised learning or self-supervised learning there is no teacher to over-see the
learning process, rather provision is made for a task independent measure of the quantity of
representation that the network is required to learn, and the free parameters of the network are
optimized with respect to that measure. Once the network has become turned to the statistical
regularities of the input data, it develops the ability to form the internal representations for
encoding features of the input and thereby to create new classes automatically. In this learning
the weights and biases are updated in response to network input only. There are no desired
outputs available. Most of these algorithms perform some kind of clustering operation. They
learn to categorize the input patterns into some classes.
3.3. MULTILAYER PERCEPTRON
In the multilayer perceptron (MLP), the input signal propagates through the network in a
forward direction, on a layer-by-layer basis. This network has been applied successfully to solve
some difficult problems by training in a supervised manner with a highly popular algorithm
known as the error back-propagation algorithm [3.1,3.9]. The scheme of MLP using four layers
is shown in Fig.3.2.
the two hidden layers and
represent the input to the network,
and
represent the output of
represents the output of the final layer of the neural network.
The connecting weights between the input to the first hidden layer, first to second hidden layer
and the second hidden layer to the output layers are represented by
3.2 MLP network
27 ,
,
respectively.
If P is the number of neurons in the first hidden layer, each element of the output vector of first
1
hidden layer may be calculated as,
1,2,3, … ,
φ
where
3.3
1,2,3, . .
is the threshold to the neurons of the first hidden layer, N is the no. of inputs and
.
is the nonlinear activation function in the first hidden layer chosen from the Table 3.1. The time
index n has been dropped to make the equations simpler. Let P be the number of neurons in the
2
second hidden layer. The output of this layer is represented as,
,
φ
where,
1,2,3, …
and may be written as
3.4
is the threshold to the neurons of the second hidden layer. The output of the final
output layer can be calculated as
,
φ
where,
1,2,3, …
3.5
is the threshold to the neuron of the final layer and P is the no. of neurons in the
3
output layer. The output of the MLP may be expressed as
φ
φ
3.6
φ
28 3.3.1. Backpropagation Algorithm.
An MLP network with 2-3-2-1 neurons (2, 3, 2 and 1 denote the number of neurons in the input
layer, the first hidden layer, the second hidden layer and the output layer respectively) with the
back-propagation (BP) learning algorithm, is depicted in Fig.3.3. The parameters of the neural
network can be updated in both sequential and batch mode of operation. In BP algorithm,
initially the weights and the thresholds are initialized as very small random values. The
intermediate and the final outputs of the MLP are calculated by using (3.3), (3.4.), and (3.5.)
respectively.
The final output
resulting error signal
at the output of neuron l, is compared with the desired output d(n) and the
is obtained as
(3.7)
The instantaneous value of the total error energy is obtained by summing all error signals over all
neurons in the output layer, that is
1
2
(3.8)
where P is the no. of neurons in the output layer.
3
This error signal is used to update the weights and thresholds of the hidden layers as well as the
output layer. The reflected error components at each of the hidden layers is computed using the
errors of the last layer and the connecting weights between the hidden and the last layer and error
obtained at this stage is used to update the weights between the input and the hidden layer. The
29 thresholds are also updated in a similar manner as that of the corresponding connecting weights.
The weights and the thresholds are updated in an iterative method until the error signal becomes
minimum. For measuring the degree of matching, the Mean Square Error (MSE) is taken as a
performance measurement. The updated weights are,
where,
1
Δ
3.9
1
Δ
3.10
1
Δ
3.11
are the change in weights of the second hidden layer-to-
,
output layer, first hidden layer-to-second hidden layer and input layer-to-first hidden layer
respectively. That is,
2µ
Δ
µ
µ
φ
(3.12)
Where, μ is the convergence coefficient (0≤μ≤1). Similarly the Δ
and Δ
can be computed
The thresholds of each layer can be updated in a similar manner, i.e.
1
∆
3.13
1
∆
3.14
1
∆
3.15
where, ∆
,∆
,∆
are the change in thresholds of the output, hidden and input
layer respectively. The change in threshold is represented as,
∆
2µ
µ
µ
φ
(3.16)
30 Chapter 4
RADIAL BASIS FUNCTIONS NETWORK
31 4. RADIAL BASIS FUNCTIONS NETWORK
4.1. INTRODUCTION
Radial Basis Function Networks (RBFN) are multilayer feed-forward neural networks
consisting of one input layer, one hidden layer and one output layer with linear weights as shown
in Fig4.1. The function of the hidden layer is to perform a non-linear transformation of the input
space. The hidden layer typically comprises an activation function which is a non-linear function
of the distance between the input space and the corresponding centers decided by the hidden
space or rather, the Euclidean Norm of the input points and the centers. These activation
functions which are real valued with values depending upon the radial distance of a point from
the origin or center are called Radial Basis Functions and the Networks using them are hence
called Radial Basis Function Networks(RBFNs).
The hidden space is typically of higher
dimensionality than the input space corresponding to Cover’s theorem(’65) which states that a
complicated pattern classification problem that is non-linearly separable is more likely to be
linearly classified if it is cast into a high dimensional space rather than a low dimensional one.
The output layer that contains linear weights perform a linear regression to predict the desired
targets. The structure is drawn from biological receptive fields to perform function mappings.
Weights on the output layer are adapted via supervised learning.
4.2. RBFNN SRUCTURE
Fig.4.1 Structure of RBFNN
32 As shown in the figure, the input vector
of dimension M is in the input layer. Hidden layer
contains the Radial Basis Functions that perform the nonlinear mapping. There are K nodes , so
its dimensionality is K such that K>M. Each node has a center vector
. The ouput layer
contains the linear weights W = [w0 w1 ….. wK]T that perform linear regression. The inputoutput mapping is given by the following equation:
;
(4.1)
4.2.1. VARIOUS RADIAL BASIS FUNCTIONS
A radial basis function (RBF) is a real-valued function whose value depends only on the distance
from the origin, so that
; or alternatively on the distance from some other point
,
t, called a center, so that
.
RBF types :
1. Multiquadric
0 and r =
for
(4.2)
2. Inverse Multiquadric
0 and r =
for
(4.3)
3. Gaussian
/
for
0 and r =
Fig.4.2 The Gaussian Function
33 (4.4)
In our context,
;
(4.5)
Where
is the input vector,
is the center vector
is the Euclidean distance between
and
is the width of the Gaussian function and
. The Gaussian function is the most
popular amongst the above.
As can be seen from the form of the radial basis functions , the multiquadric function
increases monotonically with increase in r while the inverse multiquadric and Gaussian functions
decrease with increase in r. Further, the rate at which the output decreases can be controlled by
varying the width
in case of a Gaussian kernel function. This means the output of the RBFN
will decrease in case the input point is far from the centre and will tend to zero if we use
Gaussian or inverse multiquadric functions and the output will increase if we use multiquadric.
So ,theoretically speaking, RBFN with inverse multiquadric function is good for extrapolation
whereas RBFN with inverse multiquadric or Gaussian functions is good for interpolation. It is
note worthy that the most commonly used Radial Basis Function is the Gaussian function. So,
we could safely say that RBFN is good for interpolation. It should be noted that we do not
consider the Regularized Radial Basis Function Network which takes the same number of
centers as the number of input points in the training set. This computationally very complex if
we have even slightly large training sets. We rather would discuss in detail the Generalized
Radial Basis Function Network(GRBFN) which has number of centers less than the number of
input points. The number and location of these centers are chosen strategically so that function
approximation and system identification problems can be solved with more precision and less
computational complexity. Henceforth by RBFNN we would refer specifically to GRBFNN.
4.3. LEARNING STRATEGIES APPLIED TO GRBFNNs
Like a multilayer perceptron, RBFN has universal approximation ability. The advantages of
RBFN are linearity in parameters and the availability of fast and efficient training methods.
RBFN learns to approximate the desired input-output map represented by training data
{ ,
}where
is the input vector and
is the desired response (target), i = 1, 2,…, N. A
34 number of learning methods exist to approximate the desired input-output maps. And by these
learning methods we mean the efficient selection of the centers and a method to update the linear
weights.
4.3.1. Fixed Centers Selected at Random
In this learning method, RBFs of the hidden units are fixed, that is, the centers are not
updated ; they are fixed. The locations of the centers may be chosen randomly from the training
data set. We can use different values of centers and widths for each radial basis function for
which experimentation with training data is needed. Only the output layer weights need to be
learned. The values of the output layer weights are obtained easily by pseudo-inverse method.
This method is apparently very simple but to produce results that can show a satisfactory level of
performance, it requires a large training set and rigorous experimentation on the training data.
4.3.2 Self-organized Selection of Centers
Self-organized selection of centers employs a hybrid learning approach which combines
self-organized learning algorithm based on K-Means Clustering Algorithm and supervised
learning algorithm based on stochastic gradient. The former is used to determine the center of
Gaussian function, while the later is employed to adjust the output weights. The number of
centers is depended on the number of clusters of data, or it could well be the user’s discretion –
an arbitrary selection.
K-means clustering algorithm is used to cluster data into k number of clusters. Specifically, this
algorithm places centers of radial basis function in the input space area where the data are
significant. K-means clustering algorithm proceeds as follows :
1. Initialization, select randomly center values
0 ; the only requirement is that values of
0 must be different for each k = 1, 2,…, K. It is suggested that Euclidean norm of each
center sufficiently small.
2. Sampling, take a sample vector u of input space with certain probability. Vector u represents
input applied to RBFN.
th
3. Similarity matching, find center of the winner at n iteration, with minimum euclidean
distance :
arg
k = 1, 2, …,K
35 (4.6)
4. Updating, adjust center position according to:
,
1
(4.7)
,
The spread or width of the Gaussian function is determined by taking
√
where
=
maximum distance between the centers and K= number of nodes of RBFN.The weights are then
updated by supervised learning using LMS.
4.3.3
Stochastic Gradient Approach (Supervised Learning)
In this method, RBF network design takes on its most generalized form. As we know, the
RBFN has three parameters: centers
parameters the centers
, spread
, spread
and output layer weights
. Here, all these
of the radial-basis functions and all the weights
of the
network undergo a supervised learning process. A natural candidate is error-correction learning,
using a stochastic gradient descent of the error criterion. The Basic concept of this method is
similar to LMS algorithm.
Algorithm:
|
y We take the Cost Function
|
for n = 1,2,……..,N
y Where e(n) is the error signal
y
∑
;
⁄
∑
y To minimize ξ(n), we would use the stochastic gradient descent method :
1
1
1
36 Hence the results of the stochastic gradient approach can be summarized as:
y
∑
y e n
dn – yn
;
(1)
2
Parameter Update Equations:
1
y
y
1
y
1
where
;
;
/
;
;
/
(4)
(5)
(6)
37 (3)
Chapter 5
WILCOXON LEARNING MACHINES
38 5. WILCOXON LEARNING MACHINES
5.1 INTRODUCTION
Machine learning, namely learning from examples, has been an active research area for
several decades. Popular and powerful learning machines proposed in the past include artificial
neural networks , generalized radial basis function networks (GRBFNs) ,fuzzy neural networks
(FNNs) and support vector machines (SVMs). They are different in their origins, network
configurations, and objective functions. They have also been successfully applied in many
branches of science and engineering. In statistical terms, the aforementioned learning machines
are nonparametric in the sense that they do not make any assumptions of the functional form,
e.g., linearity, of the discriminant or predictive functions. A detailed discussion of two of the
above machines has been done, namely ANNs and GRBNs.
Robust smoothing is a central idea in statistics that aims to simultaneously estimate and model
the underlying structure. In statistics, an outlier is an observation that is numerically distant from the
rest of the data. Hence, outliers are data points that are not typical of the rest of the data. Statistics
derived from data sets that include outliers may be misleading. Depending on their location, outliers
may have moderate to severe effects on the regression model. A regressor or a learning machine
is said to be robust if it is not sensitive to outliers in the data.
As is well known in statistics, the resulting linear regressors by using the rank-based Wilcoxon
approach to linear regression problems are usually robust against (or insensitive to) outliers. It is
then natural to generalize the Wilcoxon approach for linear regression problems to
nonparametric Wilcoxon learning machines for nonlinear regression problems. The prime
motivation behind this thesis is to apply and study the Wilcoxon approach to the machines we
studied before (ANN and GRBFN) and see how these machines perform in presence of outliers.
We would try to demonstrate that these Wilcoxon learning machines are robust against outliers.
39 5.2 WILCOXON NORM
Before investigating the Wilcoxon learning machines, an introduction to the Wilcoxon Norm is
required. To define the Wilcoxon norm of a vector, we need a score function. A score function is
: 0,1
a function
which is non-decreasing such that
∞
Usually the score function is standardized such that
1
0
The score associated with the score function
is defined by
1
Where l is a fixed positive number.
It can be shown that the following function is a pseudonorm (seminorm) on
,
.
…..
(5.1)
Where
of
is the rank of
…..
among
.
,
…..
√12
and
….
.
are the ordered values
0.5 .
is called the Wilcoxon Norm of the vector .
5.3 WILCOXON NEURAL NETWORK WNN
5.3.1 Neural Network Structure
We consider a three layered neural network with one input , one hidden and one output layer.
This neural network is for the analysis of a general input-output mapping of n dimensions to p
dimensions, i.e. input vector o n dimensions is to be mapped to an output of p dimensions. Hence
we consider the following network of n+1 input nodes m+1 hidden nodes and p output nodes.
40 Fig.5.1 Wilcoxon Neural Network Structure
…..
Let the input vector be
Let
…..
or
=
…..
1
denote the connection weight from the ith input node to the input of the jth hidden node.
Then, the input
and output
of the jth hidden node are given by, respectively
.
,
1 ,
(5.2)
where
.
is the activation function of the hidden node. Some commonly used activation
functions are sigmoidal functions, i.e., monotonically increasing S-shaped functions as follows:
„ Unipolar logistic function
„ Bipolar sigmoidal function
„ Hyperbolic tangent function
Let
denote the connection weight from the output of the jth hidden node to the input of the
kth output node. Then, the input
and output
of the jth hidden node are given by, respectively
41 .
,
1 ,
(5.3)
. is the activation function of the output node. For classification problems, the output
where
activation functions can be chosen as sigmoidal functions, while for regression problems, the
output activation functions can be chosen as linear functions with unit slope.
The final output
Where
,
of the network is given by
is the bias.
We define
…..
…..
1
…..
(5.4a)
(5.4b)
From 5.2 – 5.4 we have,
;
;
;
;
(5.5)
. We are given a training set
Let
,
,
In the following, we will use the subscript to denote the qth example.
In a WNN, the approach is to choose network weights that minimize the Wilcoxon norm of the
total residuals
Ψ
.
(5.6a)
(5.6b)
42 Where
is the rank of
ordered values of
,
…..
given by
among
,
…..
.
….
are the
.the Wilcoxon norm of residuals at the kth output node is
.
Ψ
(5.7a)
(5.7b)
…..
(5.7c)
Ψ
Ψ
The NN used here is the same as that used in standard ANN, except the bias terms at the outputs.
The main reason is that the Wilcoxon norm is not a usual norm, but a pseudonorm (seminorm).In
particular
0 for
…..
implies that
. This means that,
without the bias terms, the resulting predictive function with small Wilcoxon norm of total
residuals may deviate from the true function by constant offsets.
5.3.2. Learning Algorithm of WNN
Now, we introduce an incremental gradient–descent algorithm. In this algorithm, Ψ s are
minimized in sequence. From the definition of Ψ in (5.7) together with (5.5), we have
Ψ
.
43 Updating rules for the weights connecting input layer to hidden layer and those connecting
hidden to output layer.
Ψ
Ψ
0 is the learning rate.
From (5.8) we have,
Ψ
1
1
where
. denotes the total derivative of
. with respect to its argument and
is the kth
component of the qth vector . Hence, the updating rule becomes
Ψ
1
i.e.,
(5.9)
Again,
Ψ
1
…
…..
Hence the updating rule becomes
1
(5.10)
44 . denotes the total derivative of
where
. with respect to its argument and
is the jth
component of the qth vector .
The bias term
,
is given by the median of the residuals at the kth output node i.e.
(5.11)
1
5.4 WILCOXON GENERALISED RADIAL BASIS FUNCTION NETWORK (WGRBFN)
The Wilcoxon approach to GRBFN is similar to the approach used in ANN. In fact, the
three layer network we considered in fig5.1 can be conceptualized as a GRBFN if we replace the
activation function of the hidden layer by the Gaussian function used in RBF and taking the
output layer activation function as a linear function with unity slope.
Continuing our treatment using fig5.1
…..
We define
The predictive function
…..
and
is a non-linear map given by
,
2
(5.12)
Here,
is
,
the
connection
…..
weight
between
jth
hidden
is thecenter of the jth basis function. 2
node
to
kth
output.
is the ith variance of the jth
is the bias term. This system can also be represented as a feed-forward
basis function and
network. In this network, there are one input layer with nodes, one hidden layer with nodes, and
one output layer with nodes. We also have bias terms at the output nodes.
Defining for
,
,
2
,
exp
Then from 5.12 we have
45 ,
Suppose we are given the same training set as in Section 5.3 The Wilcoxon norm Ψ of residuals
at the kth output node is the same as defined in Section 5.3. The incremental gradient–descent
algorithm requires that Ψ s be minimized in sequence. By similar derivations, the weights
updating rules are given by
Ψ
Ψ
2
Ψ
2
2
2
(5.13)
Where
0 is the learning rate and the bias term
residuals at the kth output node i.e.
1
46 ,
is given by the median of the
Chapter 6
SIMULATIONS & CONCLUSION
47 6.1 SIMULATIONS
In this section, we compare the performances of various learning machines for several illustrative
nonlinear regression problems. Emphasis is put particularly on the robustness against outliers for
various learning machines. The updating rules for WNN are (9) and (10), WGRBFN are (13). It
should be pointed out that different parameter settings for learning machines might produce
different results. The parameters of each learning machine used in the following simulation may
not be the optimal parameters for a given learning problem. This is the model selection problem,
which always exists for a general learning problem. For “fair” comparison, similar machines will
use the same set of parameters in the simulation. Thus, for ANN and WNN, we use the same
number of hidden nodes, the same activation functions for hidden nodes, and the output node.
Similarly, for GRBFN and WGRBFN, we use the same kernel function for both machines
In each simulation of Examples 1 and 2, the uncorrupted training data set consists of 50
randomly chosen x-points(training patterns) with the corresponding y-values (target values)
evaluated from the underlying true function. The corrupted training data set is composed of the
same -points as the corresponding uncorrupted one but with randomly chosen -values corrupted
by adding random values from a uniform distribution defined on [-1,1]. It would be interesting to
know what happens if the noise is progressively increased and if the number of outliers is
increased. To this end, 20%, 30%, and 40% randomly chosen y-values of the training data points
will be corrupted.
PERFORMANCE COMPARISION OF ANN & WNN
Example 1 :
sin
1,
0
,
0
10,10
In this example, we compare the performances of ANN, WNN. For ANN and WNN, the number
of hidden nodes is 30, the activation functions of the hidden nodes are bipolar sigmoidal
functions, and the activation function of the output node is a linear function with unit slope. As
we can see the input and output both are one dimensional so here referring to fig.5.1 ,
n=1,m=30,p=1. The results are plotted in the figures that follow.
All the figures have input values “x” on the x-axis and the corresponding estimates on the y-axis.
48 Fig.6.1 Performance of ANN & WNN ‐uncorrupted data Fig.6.2 Performance of ANN & WNN ‐10% corrupted data 49 Fig.6.3 Performance of ANN & WNN ‐20% corrupted data Fig.6.4 Performance of ANN & WNN ‐30% corrupted data 50 Fig.6.5 Performance of ANN & WNN ‐40% corrupted data Example 2:
1.1 . 1
,
5,5
Fig.6.6 Performance of ANN & WNN ‐uncorrupted data 51 Fig.6.7 Performance of ANN & WNN ‐10% corrupted data Fig.6.8 Performance of ANN & WNN ‐20% corrupted data 52 Fig.6.9 Performance of ANN & WNN ‐30% corrupted data Fig.6.10 Performance of ANN & WNN ‐40% corrupted data 53 PERFORMANCE COMPARISION OF GRBFN & WGRBFN
Example 2:
The true function is given by the Hermite function
,
1.1 . 1
5,5
In this example, we compare the performances of GRBFN & WGRBFN. For all these networks,
the number of hidden nodes is 20, which is somewhat arbitrary. The range for the training targets
is [0.0002,2.7157] . The simulation results for GRBFN and WGRBFN are shown in the
following figures.
Fig.6.11 Performance of GRBFN & WGRBFN ‐uncorrupted data 54 Fig.6.12 Performance of GRBFN & WGRBFN ‐10%corrupted data Fig.6.13 Performance of GRBFN & WGRBFN ‐20%corrupted data 55 Fig.6.14 Performance of GRBFN & WGRBFN ‐30%corrupted data Fig.6.15 Performance of GRBFN & WGRBFN ‐40%corrupted data 56 6.2 CONCLUSION
The simulation results for ANN and WNN are shown in Fig. 6.1-6.10.Two examples are
taken. The range for the training targets is [0.2171, 0.99879] in the first example and [0.0002,
2.7157] in the second one. For uncorrupted data shown in Fig. 6.1 & 6.6, WNN performs better
than ANN and we are not over fitting the training data. For corrupted data shown in Fig. 6.2–6.5
and Fig. 6.6–6.10 with progressively increased corruption, WNN estimates are affected to a lot
lesser extent by these corrupted outliers and outperform ANN estimates.
In case of simulations of GRBFN and WGRBFN, we take only one example of the
function to be approximated. We take the Hermite function. For uncorrupted data shown in Fig.
6.11, GRBFN and WGRBFN estimates are almost indistinguishable from the true function and
we are not over fitting the training data. For corrupted data shown in Fig. 6.12–6.15 with
progressively increased corruption, WGRBFN estimates are robust to outliers since they are
affected to a lot lesser extent by these corrupted outliers and they outperform GRBFN estimate.
This thesis demonstrates the Wilcoxon approach to nonlinear learning problems for the
ANNs and GRBFNs. These provide alternative learning machines when faced with general
nonlinear learning problems. Simple weights updating rules based on gradient descent were
derived. Some numerical examples were provided to compare the robustness against outliers for
standard learning machines and Wilcoxon learning machines. Simulation results showed that the
Wilcoxon learning machines have good robustness against outliers.
The computational performances of the Wilcoxon learning machines are not discussed in
this study. The reason is that it is still very time-consuming to obtain the numerical solutions of
the Wilcoxon learning problems so that it makes little sense at this moment to present the data
for computational performances of the Wilcoxon learning machines. The search of more efficient
learning rules for Wilcoxon learning machines could be a future prospect. We are in the process
of developing a novel learning machine based on FLANN using the Wilcoxon approach. The
simulations of this machine are not ready yet. There was also an attempt on our side to develop
algorithms based on LMS using Wilcoxon approach (we might call it WLMS) for linear
57 regression problems. The reason it hasn’t been introduced in this thesis is for brevity and that the
algorithm is still very computationally expensive.
It is true illustrative examples do not provide a rigorous proof for the robustness of the
Wilcoxon learning machines. The results reported in this thesis provide only a start up or just a
preliminary study on Wilcoxon learning machines. Similar approach can be used to other
learning machines. As a final thought, we could only say that it is just a matter of time when we
could actually see the application of Wilcoxon Norms and possibly other novel methodologies
being developed for outlier rejection and robustness. In literature, much has been written on
increasing the robustness of various learning machines against outliers. Wilcoxon Learning
Machines could well be the answer to this very old problem.
58 6.3 REFERENCES
1. PRELIMINARY STUDY ON WILCOXON LEARNING MACHINES BY HSEIH,LIN, AND
JENG IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.19,NO.2 FEB 2008
2. IDENTIFICATION AND CONTROL OF DYNAMICAL SYSTEMS USING NEURAL
NETWORKS
BY
KUMPATI
S.
NARENDRA
FELLOW,
IEEE.
AND
KANNAN
PARTHASARATHY IEEE TRANSACTLONS ON NEURAL NETWORKS. VOL. I . NO. I .
MARCH 1990
3. FILTERED-X RADIAL BASIS FUNCTION NEURAL NETWORKS FOR ACTIVE NOISE
CONTROL BAMBANG RIYANTO, LAZUARDI ANGGONO & KENKO UCHIDA PROC.
ITB ENG. SCIENCE VOL. 36 B, NO. 1, 2004, 21-42
4. A JOINT STOCHASTIC GRADIENT ALGORITHM AND ITS APPLICATION TO SYSTEM
IDENTIFICATION WITH RBF NETWORKS BADONG CHEN, JINCHUN HU, HONGBO LI,
AND ZENGQI SUN PROCEEDINGS OF THE 6TH WORLD CONGRESS ON INTELLIGENT
CONTROL AND AUTOMATION, JUNE 21 - 23, 2006, DALIAN, CHINA
5. “RADIAL BASIS FUNCTION NETWORKS “CHAPTER 20,P-855-874 ADAPTIVE FILTER
THEORY BY SIMON HAYKIN PHI PUBLICATIONS
6. NEURAL NETWORKS: A COMPREHENSIVE FOUNDATION 2ND EDITION BY SIMON
HAYKIN, PEARSON EDUCATION
7. NEURAL NETWORKS: A CLASSROOM APPROACH BY SATISH KUMAR, TATA
MCGRAW HILL
8. A NEURAL NETWORK ENVIRONMENT FOR ADAPTIVE INVERSE CONTROL HELDER
J. COCHOFEL, B.SC. DAN WOOTEN, M.SC. JOSE PRINCIPE, PH.D.
9. AN INTRODUCTION TO THE USEOF NEURAL NETWORKS IN CONTROL SYSTEMS
MARTIN T. HAGAN, HOWARD B. DEMUTH AND ORLANDO DE JESÚS
59 
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement