Learning Multidimensional Signal Processing Magnus Borga Link¨oping Studies in Science and Technology. Dissertations

Learning Multidimensional Signal Processing Magnus Borga Link¨oping Studies in Science and Technology. Dissertations
Linköping Studies in Science and Technology. Dissertations
No. 531
Learning Multidimensional Signal
Processing
Magnus Borga
Department of Electrical Engineering
Linköping University, S-581 83 Linköping, Sweden
Linköping 1998
Learning Multidimensional Signal Processing
c 1998 Magnus Borga
Department of Electrical Engineering
Linköping University
S-581 83 Linköping
Sweden
ISBN 91-7219-202-X
ISSN 0345-7524
iii
Abstract
The subject of this dissertation is to show how learning can be used for multidimensional signal processing, in particular computer vision. Learning is a wide
concept, but it can generally be defined as a system’s change of behaviour in order
to improve its performance in some sense.
Learning systems can be divided into three classes: supervised learning, reinforcement learning and unsupervised learning. Supervised learning requires a
set of training data with correct answers and can be seen as a kind of function
approximation. A reinforcement learning system does not require a set of answers. It learns by maximizing a scalar feedback signal indicating the system’s
performance. Unsupervised learning can be seen as a way of finding a good representation of the input signals according to a given criterion.
In learning and signal processing, the choice of signal representation is a central issue. For high-dimensional signals, dimensionality reduction is often necessary. It is then important not to discard useful information. For this reason,
learning methods based on maximizing mutual information are particularly interesting.
A properly chosen data representation allows local linear models to be used in
learning systems. Such models have the advantage of having a small number of
parameters and can for this reason be estimated by using relatively few samples.
An interesting method that can be used to estimate local linear models is canonical correlation analysis (CCA). CCA is strongly related to mutual information.
The relation between CCA and three other linear methods is discussed. These
methods are principal component analysis (PCA), partial least squares (PLS) and
multivariate linear regression (MLR). An iterative method for CCA, PCA, PLS
and MLR, in particular low-rank versions of these methods, is presented.
A novel method for learning filters for multidimensional signal processing
using CCA is presented. By showing the system signals in pairs, the filters can be
adapted to detect certain features and to be invariant to others. A new method for
local orientation estimation has been developed using this principle. This method
is significantly less sensitive to noise than previously used methods.
Finally, a novel stereo algorithm is presented. This algorithm uses CCA and
phase analysis to detect the disparity in stereo images. The algorithm adapts filters
in each local neighbourhood of the image in a way which maximizes the correlation between the filtered images. The adapted filters are then analysed to find
the disparity. This is done by a simple phase analysis of the scalar product of
the filters. The algorithm can even handle cases where the images have different scales. The algorithm can also handle depth discontinuities and give multiple
depth estimates for semi-transparent images.
iv
To Maria
v
Acknowledgements
This thesis is the result of many years work and it would never have been possible
for me to accomplish this without the help, support and encouragements from a
lot of people.
First of all, I would like to thank my supervisor, associate professor Hans Knutsson.
His enthusiastic engagement in my research and his never ending stream of ideas
has been absolutely essential for the results presented here. I am very grateful that
he has spent so much time with me discussing different problems ranging from
philosophical issues down to minute technical details.
I would also like to thank professor Gösta Granlund for giving me the opportunity
to work in his research group and for managing a laboratory it is a pleasure to
work in.
Many thanks to present and past members of the Computer Vision Laboratory for
being good friends as well as helpful colleagues.
In particular, I would like to thank Dr. Tomas Landelius with whom I have been
working very close in most of the research presented here as well as in the (not
yet finished) systematic search for the optimum malt whisky. His comments on
large parts of the early versions of the manuscript have been very valuable.
I would also like to thank Morgan Ulvklo and Dr. Mats Andersson for constructive comments on parts of the manuscript. Dr. Mats Anderson’s help with a lot of
technical details ranging from the design of quadrature filters to welding is also
very appreciated.
Finally, I would like to thank my wife Maria for her love, support and patience.
Maria should also have great credit for proof-reading my manuscript and helping
me with the English. All remaining errors are to be blamed on me, due to final
changes.
The research presented in this thesis was sponsored by NUTEK (Swedish National Board for Industrial and Technical Development) and TFR (Swedish Research Council for Engineering Sciences), which is gratefully acknowledged.
vi
Contents
1
Introduction
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
3
4
I
Learning
5
2
Learning systems
2.1 Learning . . . . . . . . . . . . . . . . . . . . . .
2.2 Machine learning . . . . . . . . . . . . . . . . .
2.3 Supervised learning . . . . . . . . . . . . . . . .
2.3.1 Gradient search . . . . . . . . . . . . . .
2.3.2 Adaptability . . . . . . . . . . . . . . .
2.4 Reinforcement learning . . . . . . . . . . . . . .
2.4.1 Searching for higher rewards . . . . . . .
2.4.2 Generating the reinforcement signal . . .
2.4.3 Learning in an evolutionary perspective .
2.5 Unsupervised learning . . . . . . . . . . . . . .
2.5.1 Hebbian learning . . . . . . . . . . . . .
2.5.2 Competitive learning . . . . . . . . . . .
2.5.3 Mutual information based learning . . . .
2.6 Comparisons between the three learning methods
2.7 Two important problems . . . . . . . . . . . . .
2.7.1 Perceptual aliasing . . . . . . . . . . . .
2.7.2 Credit assignment . . . . . . . . . . . .
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
8
9
10
11
12
14
20
22
23
24
26
28
32
33
33
35
Information representation
37
3.1 The channel representation . . . . . . . . . . . . . . . . . . . . . 39
3.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
viii
Contents
3.3
Linear models . . . . . . . . . . . . . . . . . . . . . .
3.3.1 The prediction matrix memory . . . . . . . . .
Local linear models . . . . . . . . . . . . . . . . . . .
Adaptive model distribution . . . . . . . . . . . . . .
Experiments . . . . . . . . . . . . . . . . . . . . . . .
3.6.1 Q-learning with the prediction matrix memory
3.6.2 TD-learning with local linear models . . . . .
3.6.3 Discussion . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
46
51
52
53
54
54
57
Low-dimensional linear models
4.1 The generalized eigenproblem . . . . . . . . . . . . .
4.2 Principal component analysis . . . . . . . . . . . . . .
4.3 Partial least squares . . . . . . . . . . . . . . . . . . .
4.4 Canonical correlation analysis . . . . . . . . . . . . .
4.4.1 Relation to mutual information and ICA . . . .
4.4.2 Relation to SNR . . . . . . . . . . . . . . . .
4.5 Multivariate linear regression . . . . . . . . . . . . . .
4.6 Comparisons between PCA, PLS, CCA and MLR . . .
4.7 Gradient search on the Rayleigh quotient . . . . . . . .
4.7.1 PCA . . . . . . . . . . . . . . . . . . . . . . .
4.7.2 PLS . . . . . . . . . . . . . . . . . . . . . . .
4.7.3 CCA . . . . . . . . . . . . . . . . . . . . . .
4.7.4 MLR . . . . . . . . . . . . . . . . . . . . . .
4.8 Experiments . . . . . . . . . . . . . . . . . . . . . . .
4.8.1 Comparisons to optimal solutions . . . . . . .
4.8.2 Performance in high-dimensional signal spaces
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
61
64
66
67
70
70
73
75
78
82
83
84
85
87
87
92
3.4
3.5
3.6
4
II Applications in computer vision
5
6
Computer vision
5.1 Feature hierarchies . . . .
5.2 Phase and quadrature filters
5.3 Orientation . . . . . . . .
5.4 Frequency . . . . . . . . .
5.5 Disparity . . . . . . . . .
97
.
.
.
.
.
99
99
100
101
103
103
Learning feature descriptors
6.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Learning quadrature filters . . . . . . . . . . . . . . . . .
6.1.2 Combining products of filter outputs . . . . . . . . . . . .
107
110
110
115
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
6.2
7
8
ix
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Disparity estimation using CCA
7.1 The canonical correlation analysis part
7.2 The phase analysis part . . . . . . . .
7.2.1 The signal model . . . . . . .
7.2.2 Multiple disparities . . . . . .
7.2.3 Images with different scales .
7.3 Experiments . . . . . . . . . . . . . .
7.3.1 Discontinuities . . . . . . . .
7.3.2 Scaling . . . . . . . . . . . .
7.3.3 Semi-transparent images . . .
7.3.4 An artificial scene . . . . . .
7.3.5 Real images . . . . . . . . . .
7.4 Discussion . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
121
122
123
125
127
128
129
129
131
132
134
134
138
Epilogue
145
8.1 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . 145
8.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
A Definitions
151
A.1 The vec function . . . . . . . . . . . . . . . . . . . . . . . . . . 151
A.2 The mtx function . . . . . . . . . . . . . . . . . . . . . . . . . . 151
A.3 Correlation for complex variables . . . . . . . . . . . . . . . . . 152
B Proofs
B.1 Proofs for chapter 2 . . . . . . . . . . . . . . . . . . . . . . . .
B.1.1 The differential entropy of a multidimensional Gaussian
variable . . . . . . . . . . . . . . . . . . . . . . . . . .
B.2 Proofs for chapter 3 . . . . . . . . . . . . . . . . . . . . . . . .
B.2.1 The constant norm of the channel set . . . . . . . . . .
B.2.2 The constant norm of the channel derivatives . . . . . .
B.2.3 Derivation of the update rule for the prediction matrix
memory . . . . . . . . . . . . . . . . . . . . . . . . . .
B.2.4 One frequency spans a 2-D plane . . . . . . . . . . . .
B.3 Proofs for chapter 4 . . . . . . . . . . . . . . . . . . . . . . . .
B.3.1 Orthogonality in the metrics A and B . . . . . . . . . .
B.3.2 Linear independence . . . . . . . . . . . . . . . . . . .
B.3.3 The range of r . . . . . . . . . . . . . . . . . . . . . .
B.3.4 The second derivative of r . . . . . . . . . . . . . . . .
B.3.5 Positive eigenvalues of the Hessian . . . . . . . . . . .
153
. 153
.
.
.
.
153
154
154
155
.
.
.
.
.
.
.
.
156
156
157
157
158
158
159
159
x
Contents
B.3.6
B.3.7
B.3.8
B.3.9
The partial derivatives of the covariance . . . . . . . . .
The partial derivatives of the correlation . . . . . . . . .
Invariance with respect to linear transformations . . . .
Relationship between mutual information and canonical
correlation . . . . . . . . . . . . . . . . . . . . . . . .
B.3.10 The partial derivatives of the MLR-quotient . . . . . . .
B.3.11 The successive eigenvalues . . . . . . . . . . . . . . . .
B.4 Proofs for chapter 7 . . . . . . . . . . . . . . . . . . . . . . . .
B.4.1 Real-valued canonical correlations . . . . . . . . . . . .
B.4.2 Hermitian matrices . . . . . . . . . . . . . . . . . . . .
. 160
. 160
. 161
.
.
.
.
.
.
162
163
164
165
165
165
Chapter 1
Introduction
This thesis deals with two research areas: learning and multidimensional signal
processing. A typical example of a multidimensional signal is an image. An image is usually described in terms of pixel1 values. A monochrome TV image has
a resolution of approximately 700 500 pixels, which means that it is a 350,000dimensional signal. In computer vision, we try to instruct a computer how to extract the relevant information from this huge signal in order to solve a certain task.
This is not an easy problem! The information is extracted by estimating certain
local features in the image. What is “relevant information” depends, of course,
on the task. To describe what features to estimate and how to estimate them are
possible only for highly specific tasks, which, for a human, seem to be trivial in
most cases. For more general tasks, we can only define these feature detectors on
a very low level, such as line and edge detectors. It is commonly accepted that
it is difficult to design higher-level feature detectors. In fact, the difficulty arises
already when trying to define what features are important to estimate.
Nature has solved this problem by making the visual system adaptive. In
other words, we learn how to see. We know that many of the low-level feature
detectors used in computer vision are similar to those found in the mammalian
visual system (Pollen and Ronner, 1983). Since we generally do not know how to
handle multidimensional signals on a high level and since our solutions on a low
level are similar to those of nature, it seems rational also on a higher level to use
nature’s solution: learning.
Learning in artificial systems is often associated with artificial neural networks. Note, however, that the term “neural network” refers to a specific type
of architecture. In this work we are more interested in the learning capabilities
than the hardware implementation. What we mean by “learning systems” is discussed in the next chapter.
1 Pixel
is an abbreviation for Picture Element.
2
Introduction
The learning process can be seen as a way of finding adaptive models to represent relevant parts of the signal. We believe that local low-dimensional linear
models are sufficient and efficient for representation in many systems. The reason
for this is that most real-world signals are (at least piecewise) continuous due to
the dynamic of the world that generates them. Therefore it can be justified to look
at some criteria for choosing low-dimensional linear models.
In the field of signal processing there seems to be a growing interest in methods related to independent component analysis. In the learning and neural network
society, methods based on maximizing mutual information are receiving more attention. These two methods are related to each other and they are also related to a
statistical method called canonical correlation analysis, which can be seen as a linear special case of maximum mutual information. Canonical correlation analysis
is also related to principal component analysis, partial least squares and multivariate linear regression. These four analysis methods can be seen as different choices
of linear models based on different optimization criteria.
Canonical correlation turns out to be a useful tool in several computer vision
problems as a new way of constructing and combining filters. Some examples of
this are presented in this thesis. We believe that this approach provides a basis
for new efficient methods in multidimensional signal processing in general and in
computer vision in particular.
1.1 Contributions
The main contributions in this thesis are presented in chapters 3, 4, 6 and 7. Chapters 2 and 5 should be seen as introductions to learning systems and computer
vision respectively. The most important individual contributions are:
A unified framework for principal component analysis (PCA), partial least
squares (PLS), canonical correlation analysis (CCA) and multiple linear
regression (MRL) (chapter 4).
An iterative gradient search algorithm that successively finds the eigenvalues and the corresponding eigenvectors to the generalized eigenproblem.
The algorithm can be used for the special cases PCA, PLS, CCA and MLR
(chapter 4).
A method for using canonical correlation for learning feature detectors in
high-dimensional signals (chapter 6). By this method, the system can also
learn how to combine estimates in a way that is less sensitive to noise than
the previously used vector averaging method.
A stereo algorithm based on canonical correlation and phase analysis that
1.2 Outline
3
can find correlation between differently scaled images. The algorithm can
handle depth discontinuities and estimate multiple depths in semi-transparent
images (chapter 7).
The TD-algorithm presented in section 3.6.2 was presented at ICANN’93 in
Amsterdam (Borga, 1993). Most of the contents in chapter 4 have been submitted
for publication in Information Sciences (Borga et al., 1997b, revised for second
review). The canonical correlation algorithm in section 4.7.3 and most of the
contents in chapter 6 were presented at SCIA’97 in Lappeenranta, Finland (Borga
et al., 1997a). Finally, the stereo algorithm in chapter 7 has been submitted to
ICIPS’98 (Borga and Knutsson, 1998).
Large parts of chapter 2 except the section on unsupervised learning (2.5),
most of chapter 3 and some of the theory of canonical correlation in chapter 4 were
presented in “Reinforcement Learning Using Local Adaptive Models” (Borga,
1995, licentiate thesis) .
1.2 Outline
The thesis is divided into two parts. Part I deals with learning theory. Part II
describes how the theory discussed in part I can be applied in computer vision.
In chapter 2, learning systems are discussed. Chapter 2 can be seen as an
introduction and overview of this subject. Three important principles for learning are described: reinforcement learning, unsupervised learning and supervised
learning.
In chapter 3, issues concerning information representation are treated. Linear
models and, in particular, local linear models are discussed and two examples are
presented that use linear models for reinforcement learning.
Four low-dimensional linear models are discussed in chapter 4. They are lowrank versions of principal component analysis, partial least squares, canonical
correlation and multivariate linear regression. All these four methods are related
to the generalized eigenproblem and the solutions can be found by maximizing a
Rayleigh quotient. An iterative algorithm for solving the generalized eigenproblem in general and these four methods in particular is presented.
Chapter 5 is a short introduction to computer vision. It treats the concepts in
computer vision relevant for the remaining chapters.
In chapter 6 is shown how canonical correlation can be used for learning models that represent local features in images. Experiments show how this method can
be used for finding filter combinations that decrease the noise-sensitivity compared to vector averaging while maintaining spatial resolution.
In chapter 7, a novel stereo algorithm based on the method from chapter 6 is
presented. Canonical correlation analysis is used to adapt filters in a local image
4
Introduction
neighbourhood. The adapted filters are then analysed with respect to phase to get
the disparity estimate. The algorithm can handle differently scaled image pairs
and depth discontinuities. It can also estimate multiple depths in semi-transparent
images.
Chapter 8 is a summary of the thesis and also contains some thoughts on future
research.
Finally there are two appendices. Appendix A contains definitions. In appendix B, most of the proofs have been placed. In this way, the text is hopefully
easier to follow for the reader who does not want to get too deep into mathematical
details. This also makes it possible to give the proofs space enough to be followed
without too much effort and to include proofs that initiated readers may consider
unnecessary without disrupting the text.
1.3 Notation
Lowercase letters in italics (x) are used for scalars, lowercase letters in boldface
(x) are used for vectors and uppercase letters in boldface (X) are used for matrices.
The transpose of a real valued vector or a matrix is denoted xT . The conjugate
transpose is denoted x . The norm kvk of a vector v is defined by
kvk p
v v
and a “hat” (v̂) indicates a vector with unit length, i.e.
v̂ v
kvk
:
E [] means expectation value of a stochastic variable.
Part I
Learning
Chapter 2
Learning systems
Learning systems is a central concept in this dissertation and in this chapter, three
different principles of learning are described. Some standard techniques are described and some important issues related to machine learning are discussed. But
first, what is learning?
2.1 Learning
According to Oxford Advanced Learner’s Dictionary (Hornby, 1989), learning is
to
“gain knowledge or skill by study, experience or being taught.”
Knowledge may be considered as a set of rules determining how to act. Hence,
knowledge can be said to define a behaviour which, according to the same dictionary, is a “way of acting or functioning.” Narendra and Thathachar (1974), two
learning automata theorists, make the following definition of learning:
“Learning is defined as any relatively permanent change in behaviour
resulting from past experience, and a learning system is characterized by its ability to improve its behaviour with time, in some sense
towards an ultimate goal.”
Learning has been a field of study since the end of the nineteenth century.
Thorndike (1898) presented a theory in which an association between a stimulus
and a response is established and this association is strengthened or weakened
depending on the outcome of the response. This type of learning is called operant conditioning. The theory of classical conditioning (Pavlov, 1955) is concerned with the case when a natural reflex to a certain stimulus becomes a response of a second stimulus that has preceded the original stimulus several times.
8
Learning systems
In the 1930s, Skinner developed Thorndike’s ideas but claimed, as opposed to
Thorndike, that learning was more ”trial and success” than ”trial and error” (Skinner, 1938). These ideas belong to the psychological position called behaviourism.
Since the 1950s, rationalism has gained more interest. In this view, intentions
and abstract reasoning play an important role in learning. In this thesis, however,
there is a more behaviouristic view. The aim is not to model biological systems
or mental processes. The goal is rather to make a machine that produces the desired results. As will be seen, the learning principle called reinforcement learning
discussed in section 2.4 has much in common with Thorndike’s and Skinner’s operant conditioning. Learning theories have been thoroughly described for example
by Bower and Hilgard (1981).
There are reasons to believe that ”learning by doing” is the only way of learning to produce responses or, as stated by Brooks (1986):
“These two processes of learning and doing are inevitably intertwined;
we learn as we do and we do as well as we have learned.”
An example of ”learning by doing” is illustrated in an experiment (Held and
Bossom, 1961; Mikaelian and Held, 1964) where people wearing goggles that
rotated or displaced their fields of view were either walking around for an hour
or wheeled around the same path in a wheel-chair for the same amount of time.
The adaptation to the distortion was then tested. The subjects that had been walking had adapted while the other subjects had not. A similar situation occurs for
instance when you are going somewhere by car. If you have driven to a certain
destination before, instead of being a passenger, you probably will find your way
easier the next time.
2.2 Machine learning
We are used to seeing humans and animals learn, but how does a machine learn?
The answer depends on how knowledge or behaviour is represented in the machine.
Let us consider knowledge to be a rule for how to generate responses to certain stimuli. One way of representing knowledge is to have a table with all stimuli and corresponding responses. Learning would then take place if the system,
through experience, filled in or changed the responses in the table. Another way
of representing knowledge is by using a parameterized model, where the output is
obtained as a given function of the input x and a parameter vector w:
y = f (x; w):
(2.1)
Learning would then be to change the model parameters in order to improve the
performance. This is the learning method used for example in neural networks.
2.3 Supervised learning
9
Another way of representing knowledge is to consider the input space and output space together. Examples of this approach are an algorithm by Munro (1987)
and the Q-learning algorithm (Watkins, 1989). Another example is the prediction matrix memory described in section 3.3.1. The combined space of input and
output can be called the decision space, since this is the space in which the combinations of input and output (i.e. stimuli and responses) that constitute decisions
exist. The decision space could be treated as a table in which suitable decisions
are marked. Learning would then be to make or change these markings. Or the
knowledge could be represented in the decision space as distributions describing
suitable combinations of stimuli and responses (Landelius, 1993, 1997):
p(y; x; w)
(2.2)
where, again, y is the response, x is the input signal and w contains the parameters
of a given distribution function. Learning would then be to change the parameters
of these distributions through experience in order to improve some measure of
performance. Responses can then be generated from the conditional probability
function
p(y j x; w):
(2.3)
The issue of representing knowledge is further discussed in chapter 3.
Obviously a machine can learn through experience by changing some parameters in a model or data in a table. But what is the experience and what measure of
performance is the system trying to improve? In other words, what is the system
learning? The answers to these questions depend on what kind of learning we are
talking about. Machine learning can be divided into three classes that differ in the
external feedback to the system during learning:
Supervised learning
Reinforcement learning
Unsupervised learning
The three different principles are illustrated in figure 2.1.
In the following three sections, these three principles of learning are discussed
in more detail. In section 2.6, the relations between the three methods are discussed and it is shown that the differences are not as great as they may seem at
first.
2.3 Supervised learning
In supervised learning there is a teacher who shows the system the desired responses for a representative set of stimuli (see figure 2.1). Here, the experience
10
Learning systems
d
x
-
?
r
-y
(a)
x
-
?
-y
(b)
x
-
-y
(c)
Figure 2.1: The three different principles of learning: Supervised learning
(a), Reinforcement learning (b) and Unsupervised learning (c).
is pairs of stimuli and desired responses and improving performance means minimizing some error measure, for example the mean squared distance between the
system’s output and the desired output.
Supervised learning can be described as function approximation. The teacher
delivers samples of the function and the algorithm tries, by adjusting the parameters w in equation 2.1 or equation 2.2, to minimize some cost function
E = E [ε];
(2.4)
where E [ε] stands for the expectation of costs ε over the distribution of data. The
instantaneous cost ε depends on the difference between the output of the algorithm and the samples of the function. In this sense, regression techniques can
be seen as supervised learning. In general, the cost function also includes a regularization term. The regularization term prevents the system from what is called
over-fitting. This is important for the generalization capabilities of the system,
i.e. the performance of the system for new data not used for training. In effect,
the regularization term can be compared to the polynomial degree in polynomial
regression.
2.3.1 Gradient search
Most supervised learning algorithms are based on gradient search on the cost
function. Gradient search means that the parameters wi are changed a small step
in the opposite direction of the gradient of the cost function E for each iteration
of the process, i.e.
wi (t + 1) = wi (t ) , α
∂E
;
∂wi
(2.5)
where the update factor α is used to control the step length. In general, the negative gradient does of course not point exactly towards the minimum of the cost
2.3 Supervised learning
11
function. Hence, a gradient search will in general not find the shortest way to the
optimum.
There are several methods to improve the search by using the second-order
partial derivatives (Battiti, 1992). Two well-known methods are Newton’s method
(see for example Luenberger, 1969) and the conjugate-gradient method (Fletcher
and Reeves, 1964). Newton’s method is optimal for quadratic cost functions in the
sense that it, given the Hessian (i.e. the matrix of second order partial derivatives),
can find the optimum in one step. The problem is the need for calculation and
storage of the Hessian and its inverse. The calculation of the inverse requires the
Hessian to be non-singular which is not always the case. Furthermore, the size of
the Hessian grows quadratically with the number of parameters. The conjugategradient method is also a second-order technique but avoids explicit calculation of
the second-order partial derivatives. For an n-dimensional quadratic cost function
it reaches the optimum in n steps, but here each step includes a line search which
increases the computational complexity in each step. A line search can of course
also be performed in first-order gradient search. Such a method is called steepest
descent. In steepest descent, however, the profit from the line search is not so
big. The reason for this is that two successive steps in steepest descent are always
perpendicular and, hence, the parameter vector will in general move in a zigzag
path.
In practice, the true gradient of the cost function is, in most cases, not known
since the expected cost E is unknown. In these cases, an instantaneous sample
ε(t ) of the cost function can be used and the parameters are changed according to
wi (t + 1) = wi (t ) , α
∂ε(t )
:
∂wi (t )
(2.6)
This method is called stochastic gradient search since the gradient estimate varies
with the (stochastic) data and the estimate improves on average with an increasing
number of samples (see for example Haykin, 1994).
2.3.2 Adaptability
The use of instantaneous estimates of the cost function is not necessarily a disadvantage. On the contrary, it allows for system adaptability. Instantaneous estimates permit the system to handle non-stationary processes, i.e. the cost function
is changing over time.
The choice of the update factor α is crucial for the performance of stochastic
gradient search. If the factor is too large, the algorithm will start oscillating and
never converge and if the factor is too small, the convergence time will be far
too long. In the literature, the factor is often a decaying function of time. The
intuitive reason for this is that the more samples the algorithm has used, the closer
12
Learning systems
the parameter vector should be to the optimum and the smaller the steps should
be. But, in most cases, the real reason for using a time-decaying update factor is
probably that it makes it easier to prove convergence.
In practice, however, choosing α as a function of time only is not a very good
idea. One reason is that the optimal rate of decay depends on the problem, i.e.
the shape of the cost function, and is therefore impossible to determine beforehand. Another important reason is adaptability. A system with an update factor
that decays as a function of time only cannot adapt to new situations. Once the
parameters have converged, the system is fixed. In general, a better solution is to
use an adaptive update factor that enables the parameters to change in large steps
when consistently moving towards the optimum and to decrease the steps when
the parameter vector is oscillating around the optimum. One example of such
methods is the Delta-Bar-Delta rule (Jacobs, 1988). This algorithm has a separate
adaptive update factor αi for each parameter.
Another fundamental reason for adaptive update factors, not often mentioned
in the literature, is that the step length in equation 2.6 is proportional to the norm
of the gradient. It is, however, only the direction of the gradient that is relevant,
not the norm. Consider, for example, finding the maximum of a Gaussian by
moving proportional to its gradient. Except for a region around the optimum, the
step length gets smaller the further we get from the optimum. A method that deals
with this problem is the RPROP algorithm (Riedmiller and Braum, 1993) which
adapts the actual step lengths of the parameters and not just the factors αi .
2.4 Reinforcement learning
In reinforcement learning there is a teacher too, but this teacher does not give
the desired responses. Only a scalar reward or punishment (reinforcement signal)
according to the quality of the system’s overall performance is fed back to the
system, as illustrated in figure 2.1 on page 10. In this case, each experience is a
triplet of stimulus, response and corresponding reinforcement. The performance
to improve is simply the received reinforcement. What is meant by received reinforcement depends on whether or not the system acts in a closed loop, i.e. the
input to the system or the system state is dependent on previous output. If there
is a closed loop, an accumulated reward over time is probably more important
than each instant reward. If there is no closed loop, there is no conflict between
maximizing instantaneous reward and accumulated rewards.
The feedback to a reinforcement learning system is evaluative rather than instructive, as in supervised learning. The reinforcement signal is in most cases
easier to obtain than a set of correct responses. Consider, for example, the situation when a child learns to bicycle. It is not possible for the parents to explain
2.4 Reinforcement learning
13
to the child how it should behave, but it is quite easy to observe the trials and
conclude how good the child manages. There is also a clear (though negative) reinforcement signal when the child fails. The simple feedback is perhaps the main
reason for the great interest in reinforcement learning in the fields of autonomous
systems and robotics. The teacher does not have to know how the system should
solve a task but only be able to decide if (and perhaps how good) it solves it.
Hence, a reinforcement learning system requires feedback to be able to learn, but
it is a very simple form of feedback compared to what is required for a supervised
learning system. In some cases, the teacher’s task may even become so simple
that it can be built into the system. For example, consider a system that is only to
learn to avoid heat. Here, the teacher may consist only of a set of heat sensors.
In such a case, the reinforcement learning system is more like an unsupervised
learning system than a supervised one. For this reason, reinforcement learning is
often referred to as a class of learning systems that lies in between supervised and
unsupervised learning systems.
A reinforcement, or reinforcing stimulus, is defined as a stimulus that strengthens the behaviour that produced it. As an example, consider the procedure of
training an animal. In general, there is no point in trying to explain to the animal
how it should behave. The only way is simply to reward the animal when it does
the right thing. If an animal is given a piece of food each time it presses a button
when a light is flashed, it will (in most cases) learn to press the button when the
light signal appears. We say that the animal’s behaviour has been reinforced. We
use the food as a reward to train the animal. One could, in this case, say that it is
the food itself that reinforces the behaviour. In general, there is some mechanism
in the animal that generates an internal reinforcement signal when the animal gets
food (at least if it is hungry) and when it experiences other things that are good for
it i.e. that increase the probability of the reproduction of its genes. A biochemical
process involving dopamine is believed to play a central role in the distribution of
the reward signal (Bloom and Lazerson, 1985; Schultz et al., 1997). In the 1950s,
experiments were made (Olds and Milner, 1954) where the internal reward system
was artificially stimulated instead of giving an external reward. In this case, the
animal was even able to learn self destructive behaviour.
In the example above, the reward (piece of food) was used merely to trigger
the reinforcement signal. In the following discussion of artificial systems, however, the two terms have the same meaning. In other words, we will use only one
kind of reward, namely the reinforcement signal itself, which we in the case of an
artificial system can allow us to have direct access to without any ethical considerations. In case of a large system, one would of course want the system to be able
to solve different routine tasks besides the main task (or tasks). For instance, suppose we want the system to learn to charge its batteries. Such a behaviour should
14
Learning systems
then be reinforced in some way. Whether we put a box into the system that reinforces the battery-charging behaviour or we let the charging device or a teacher
deliver the reinforcement signal is a technical question rather than a philosophical
one. If, however, the box is built into the system, we can reinforce behaviour by
charging the system’s batteries.
Reinforcement learning is strongly associated with learning among animals
(including humans) and some people find it hard to see how a machine could learn
by a “trial-and-error” method. To show that machines can indeed learn in this way,
a simple example was created by Donald Michie in the 1960s. A pile of matchboxes that learns to play noughts and crosses illustrates that even a very simple
machine can learn by trial and error. The machine is called MENACE (Match-box
Educable Noughts And Crosses Engine) and consists of 288 match-boxes, one for
each possible state of the game. Each box is filled with a random set of coloured
beans. The colours represent different moves. Each move is determined by the
colour of a randomly selected bean from the box representing the current state
of the game. If the system wins the game, new beans with the same colours as
those selected during the game are added to the respective boxes. If the system
loses, the beans that were selected are removed. In this way, after each game, the
possibility of making good moves increases and the risk of making bad moves
decreases. Ultimately, each box will only contain beans representing moves that
have led to success.
There are some notable advantages with reinforcement learning compared to
supervised learning, besides the obvious fact that reinforcement learning can be
used in some situations where supervised learning is impossible (e.g. the child
learning to bicycle and the animal learning examples above). The ability to learn
by receiving rewards makes it possible for a reinforcement learning system to
become more skilful than its teacher. It can even improve its behaviour by training
itself, as in the backgammon program by Tesauro (1990).
2.4.1 Searching for higher rewards
In reinforcement learning, the feedback to the system contains no gradient information, i.e. the system does not know in what direction to search for a better
solution. For this reason, most reinforcement learning systems are designed to
have a stochastic behaviour. A stochastic behaviour can be obtained by adding
noise to the output of a deterministic input-output function or by generating the
output from a probability distribution. In both cases, the output can be seen as
consisting of two parts: one deterministic and one stochastic. It is easy to see
that both these parts are necessary in order for the system to be able to improve
its behaviour. The deterministic part is the optimum response given the current
knowledge. Without the deterministic part, the system would make no sensible
2.4 Reinforcement learning
15
decisions at all. However, if the deterministic part was the only one, the system
would easily get trapped in a non-optimal behaviour. As soon as the received
rewards are consistent with current knowledge, the system will be satisfied and
never change its behaviour. Such a system will only maximize the reward predicted by the internal model but not the external reward actually received. The
stochastic part of the response provides the system with information from points
in the decision space that would never be sampled otherwise. So, the deterministic
part of the output is necessary for generating good responses with respect to the
current knowledge and the stochastic part is necessary for gaining more knowledge. The stochastic behaviour can also help the system avoid getting trapped in
local maxima.
The conflict between the need for exploration and the need for precision is typical of reinforcement learning. The conflict is usually referred to as the exploration-exploitation dilemma. This dilemma does not normally occur in supervised
learning.
At the beginning when the system has poor knowledge of the problem to be
solved, the deterministic part of the response is very unreliable and the stochastic part should preferably dominate in order to avoid a misleading bias in the
search for correct responses. Later on, however, when the system has gained more
knowledge, the deterministic part should have more influence so that the system
makes at least reasonable guesses. Eventually, when the system has gained a lot
of experience, the stochastic part should be very small in order not to disturb the
generation of correct responses. A constant relation between the influence of the
deterministic and stochastic parts is a compromise which will give a poor search
behaviour (i.e. slow convergence) at the beginning and bad precision after convergence. Therefore, many reinforcement learning systems have noise levels that
decays with time. There is, however, a problem with such an approach too. The
decay rate of the noise level must be chosen to fit the problem. A difficult problem
takes longer time to solve and if the noise level is decreased too fast, the system
may never reach an optimal solution. Conversely, if the noise level decreases too
slowly, the convergence will be slower than necessary. Another problem arises
in a dynamic environment where the task may change after some time. If the
noise level at that time is too low, the system will not be able to adapt to the new
situation. For these reasons, an adaptive noise level is to prefer.
The basic idea of an adaptive noise level is that when the system has a poor
knowledge of the problem, the noise level should be high and when the system has
reached a good solution, the noise level should be low. This requires an internal
quality measure that indicates the average performance of the system. It could of
course be accomplished by accumulating the rewards delivered to the system, for
16
Learning systems
instance by an iterative method, i.e.
p(t + 1) = αp(t ) + (1 , α)r(t );
(2.7)
where p is the performance measure, r is the reward and α is the update factor,
0 < α < 1. Equation 2.7 gives an exponentially decaying average of the rewards
given to the system, where the most recent rewards will be the most significant
ones.
A solution, involving a variance that depends on the predicted reinforcement,
has been suggested by Gullapalli (1990). The advantage with such an approach
is that the system might expect different rewards in different situations for the
simple reason that the system may have learned some situations better than others.
The system should then have a very deterministic behaviour in situations where it
predicts high rewards and a more exploratory behaviour in situations where it is
more uncertain. Such a system will have a noise level that depends on the local
skill rather than the average performance.
Another way of controlling the noise level, or rather the standard deviation
σ of a stochastic output unit, is found in the REINFORCE algorithm (Williams,
1988). Let µ be the mean of the output distribution and y the actual output. When
the output y gives a higher reward than the recent average, the variance will decrease if jy , µj < σ and increase if jy , µj > σ. When the reward is less than
average, the opposite changes are made. This leads to a more narrow search behaviour if good solutions are found close to the current solution or bad solutions
are found outside the standard deviation and a wider search behaviour if good
solutions are found far away or bad solutions are found close to the mean.
Another strategy for a reinforcement learning system to improve its behaviour
is to differentiate a model of the reward with respect to the system parameters
in order to estimate the gradient of the reward in the system’s parameter space.
The model can be known a priori and built into the system, or it can be learned
and refined during the training of the system. To know the gradient of the reward
means to know in which direction in the parameter space to search for a better
performance. One way to use this strategy is described by Munro (1987) where
the model is a secondary network that is trained to predict the reward. This can
be done with back-propagation, using the difference between the reward and the
prediction as an error measure. Then back-propagation can be used to modify
the weights in the primary network, but here with the aim of maximizing the
prediction done by secondary network. A similar approach was used to train a
pole-balancing system (Barto et al., 1983). Other examples of similar strategies
are described by Williams (1988).
2.4 Reinforcement learning
17
Adaptive critics
When the learning system operates in a dynamic environment, the system may
have to carry out a sequence of actions to get a reward. In other words, the feedback to such a system may be infrequent and delayed and the system faces what is
known as the temporal credit assignment problem (see section 2.7.2 on page 35).
Assume that the environment or process to be controlled is a Markov process. A
Markov process consists of a set S of states si where the conditional probability of
a state transition only depends on a finite number of previous states. The definition of the states can be reformulated so that the state transition probabilities only
depend on the current state, i.e.
P(sk+1 j sk ; sk,1 ; : : : ; s1 ) = P(s0 k+1 j s0 k );
(2.8)
which is a first order Markov process. Derin and Kelly (1989) present a systematic
classification of different types of Markov models.
Suppose one or several of the states in a Markov process are associated with
a reward. Now, the goal for the learning system can be defined as maximizing
the total accumulated reward for all future time steps. One way to accomplish
this task for a discrete Markov process is, like in the MENACE example above,
to store all states and actions until the final state is reached and to update the state
transition probabilities afterwards. This method is referred to as batch learning.
An obvious disadvantage with batch learning is the need for storage which will
become infeasible for large dimensionalities of the input and output vectors as
well as for long sequences.
A problem occurring when only the final outcome is considered is illustrated
in figure 2.2. Consider a game where a certain position has resulted in a loss in
90% of the cases and a win in 10% of the cases. This position is classified as a
bad position. Now, suppose that a player reaches a novel state (i.e. a state that has
not been visited before) that inevitably leads to the bad state and finally happens
to lead to a win. If the player waits until the end of the game and only looks at the
result, he would label the novel state as a good state since it led to a win. This is,
however, not true. The novel state is a bad state since it probably leads to a loss.
Adaptive critics (Barto, 1992) is a class of methods designed to handle the
problem illustrated in figure 2.2. Let us, for simplicity, assume that the input
vector xk uniquely defines the state sk 1 . Suppose that for each state xk there is
a value Vg (xk ) that is an estimate of the expected future result (e.g. a weighted
sum of the accumulated reinforcement) when following a policy g, i.e. generating
the output as y = g(x). In adaptive critics, the value Vg (xk ) depends on the value
1 This
assumption is of course not always true. When it does not hold, the system faces the
perceptual aliasing problem which is discussed in section 2.7.1 on page 33.
18
Learning systems
loss
90 %
bad
novel
10 %
win
Figure 2.2: An example to illustrate the advantage of adaptive critics. A
state that is likely to lead to a loss is classified as a bad state. A novel state
that leads to the bad state but then happens to lead to a win is classified
as a good state if only the final outcome is considered. In adaptive critics,
the novel state is recognized as a bad state since it most likely leads to a
loss.
Vg (xk+1 ) and not only on the final result:
Vg (xk ) = r(xk ; g(xk )) + γVg (xk+1 );
(2.9)
where r(xk ; g(xk )) is the reward for being in the state xk and generating the response yk = g(xk ). This means that
N
Vg (xk ) = ∑ γi,k r(xk ; g(xk ));
(2.10)
i=k
i.e. the value of a state is a weighted sum of all future rewards. The weight
γ 2 [0; 1] can be used to make rewards that are close in time more valuable than
rewards further away. Equation 2.9 makes it possible for adaptive critics to improve their predictions during a process without always having to wait for the final
result.
Suppose that the environment can be described by the function f so that
xk+1 = f (xk ; yk ). Now equation 2.9 can be written as
Vg (xk ) = r(xk ; g(xk )) + γVg ( f (xk ; g(xk ))) :
(2.11)
The optimal response y is the response given by the optimal policy g :
y = g (x) = arg maxfr(x; y) + V ( f (x; y))g;
y
where V is the value of the optimal policy (Bellman, 1957).
(2.12)
2.4 Reinforcement learning
19
In the methods of temporal differences (TD) described by Sutton (1988), the
value function V is estimated using the difference between the values of two consecutive states as an internal reward signal. Another well known method for adaptive critics is Q-learning (Watkins, 1989). In Q-learning, the system is trying to
estimate the Q-function
Qg (x; y) = r(x; y) + Vg ( f (x; y))
(2.13)
rather than the value function V itself. Using the Q-function, the optimal response
is
y = g (x) = arg maxfQ (x; y)g:
y
(2.14)
This means that a model of the environment f is not required in Q-learning in
order to find the optimal response.
In control theory, an optimization algorithm called dynamic programming is a
well-known method for maximizing the expected total accumulated reward. The
relationship between TD-methods and dynamic programming has been discussed
for example by Barto (1992), Werbos (1990) and Whitehead et al. (1990). It
should be noted, however, that maximizing the expected accumulated reward is
not always the best criterion, as discussed by Heger (1994). He notes that this
criterion of choice of action
is based upon long-run consideration where the decision process is repeated
a sufficiently large number of times. It is not necessarily a valid criterion in
the short run or one-shot case, especially when the possible consequences
or their probabilities have extreme values.
assumes the subjective values of possible outcomes to be proportional to
their objective values, which is not necessarily the case, especially when
the values involved are large.
As an illustrative example, many people occasionally play on lotteries in spite
of the fact that the expected outcome is negative. Another example is that most
people do not invest all their money in stocks although such a strategy would give
a larger expected payoff than putting some of it in the bank.
The first well-known use of adaptive critics was in a checkers playing program
(Samuel, 1959). In that system, the value of a state (board position) was updated
according to the values of future states likely to appear. The prediction of future
states requires a model of the environment (game). This is, however, not the
case in TD-methods like the adaptive heuristic critic algorithm (Sutton, 1984)
where the feedback comes from actual future states and, hence, prediction is not
necessary.
20
Learning systems
Sutton (1988) has proved a convergence theorem for one TD-method2 that
states that the prediction for each state asymptotically converges to the maximumlikelihood prediction of the final outcome for states generated in a Markov process. Other proofs concerning adaptive critics in finite state systems have been
presented, for example by Watkins (1989), Jaakkola et al. (1994) and Baird (1995).
Proofs for continuous state spaces have been presented by Werbos (1990), Bradtke
(1993) and Landelius (1997).
Other methods for handling delayed rewards are for example heuristic dynamic programming (Werbos, 1990) and back-propagation of utility (Werbos,
1992).
Recent physiological findings indicate that the output of dopaminergic neurons indicate errors in the predicted reward function, i.e. the internal reward used
in TD-learning (Schultz et al., 1997).
2.4.2 Generating the reinforcement signal
Werbos (1990) defines a reinforcement learning system as
“any system that through interaction with its environment improves
its performance by receiving feedback in the form of a scalar reward
(or penalty) that is commensurate with the appropriateness of the response.”
The goal for a reinforcement learning system is simply to maximize the reward,
for example the accumulated value of the reinforcement signal r. Hence, r can be
said to define the problem to be solved and therefore the choice of reward function
is very important. The reward, or reinforcement, must be capable of evaluating the
overall performance of the system and be informative enough to allow learning.
In some cases, how to choose the reinforcement signal is obvious. For example, in the pole balancing problem (Barto et al., 1983), the reinforcement signal
is chosen as a negative value upon failure and as zero otherwise. Many times,
however, how to measure the performance is not evident and the choice of reinforcement signal will affect the learning capabilities of the system.
The reinforcement signal should contain as much information as possible
about the problem. The learning performance of a system can be improved considerably if a pedagogical reinforcement is used. One should not sit and wait for
the system to attain a perfect performance, but use the reward to guide the system to a better performance. This is obvious in the case of training animals and
this TD-method, called TD(0), the value Vk only depends on the following value Vk+1 and
not on later predictions. Other TD-methods can take into account later predictions with a function
that decreases exponentially with time.
2 In
2.4 Reinforcement learning
21
humans, but it also applies to the case of training artificial systems with reinforcement learning. Consider, for instance, an example where a system is to learn a
simple function y = f (x). If a binary reward is used, i.e.
(
r=
1
0
i f jỹ , yj < ε
;
else
(2.15)
where ỹ is the output of the system and y is the correct response, the system will
receive no information at all3 as long as the responses are outside the interval
defined by ε. If, on the other hand, the reward is chosen inversely proportional to
the error, i.e.
r=
1
jỹ , yj
(2.16)
a relative improvement will yield the same relative increase in reward for all output. In practice, of course, the reward function in equation 2.16 could cause numerical problems, but it serves as an illustrative example of a well-shaped reward
function. In general, a smooth and continuous function is preferable. Also, the
derivative should not be too small, at least not in regions where the system should
not get stuck, i.e. in regions of bad performance. It should be noted, however, that
sometimes there is no obvious way of defining a continuous reward function. In
the case of pole balancing (Barto et al., 1983), for example, the pole either falls or
not.
A perhaps more interesting example where a pedagogical reward is used can
be found in a paper by Gullapalli (1990), which presents a “reinforcement learning
system for learning real-valued functions”. This system was supplied with two
input variables and one output variable. In one case, the system was trained on
an XOR-task. Each input was 0:1 or 0:9 and the output was any real number
between 0 and 1. The optimal output values were 0:1 and 0:9 according to the
logical XOR-rule. At first, the reinforcement signal was calculated as
r = 1 ,j + εj;
(2.17)
where ε is the difference between the output and the optimal output. The system sometimes converged to wrong results, and in several training runs it did not
converge at all. A new reinforcement signal was calculated as
r0 =
3 Well,
r + rtask
:
2
(2.18)
almost none in any case, and as the number of possible solutions which give output
outside the interval approaches infinity (which it does in a continuous system), the information
approaches zero.
22
Learning systems
The term rtask was set to 0.5 if the latest output for similar input was less than the
latest output for dissimilar input and to -0.5 otherwise. With the reinforcement
signal in equation 2.18, the system began by trying to satisfy a weaker definition
of the XOR-task, according to which the output should be higher for dissimilar
inputs than for similar inputs. The learning performance of the system improved
in several ways with the new reinforcement signal.
Another reward strategy is to reward only improvements in behaviour, for
example by calculating the reinforcement as
r = p , r̄;
(2.19)
where p is a performance measure and r̄ is the mean reward acquired by the system. Equation 2.19 gives a system that is never satisfied since the reward vanishes
in any solution with a stable reward. If the system has an adaptive search behaviour as described in the previous section, it will keep on searching for better
and better solutions. The advantage with such a reward is that the system will
not get stuck in a local optimum. The disadvantage is, of course, that it will not
stay in the global optimum either, if such an optimum exists. It will, however, always return to the global optimum and this behaviour can be useful in a dynamic
environment where a new optimum may appear after some time.
Even if the reward in the previous equation is a bit odd, it points out the fact
that there might be negative reward or punishment. The pole balancing system
(Barto et al., 1983) is an example of the use of negative reinforcement and in this
case it is obvious that it is easier to deliver punishment upon failure than reward
upon success since the reward would be delivered after an unpredictably long sequence of actions; it would take an infinite amount of time to verify a success!
In general, however, it is probably better to use positive reinforcement to guide a
system towards a solution for the simple reason that there is usually more information in the statement “this was a good solution” than in the opposite statement
“this was not a good solution”. On the other hand, if the purpose is to make
the system avoid a particular solution (i.e. “Do anything but this!”), punishment
would probably be more efficient.
2.4.3 Learning in an evolutionary perspective
In this section, a special case of reinforcement learning called genetic algorithms
is described. The purpose is not to give a detailed description of genetic algorithms, but to illustrate the fact that they are indeed reinforcement learning algorithms. From this fact and the obvious similarity between biological evolution and
genetic algorithms (as indicated in the name), some interesting conclusions can be
drawn concerning the question of learning at different time scales.
2.5 Unsupervised learning
23
A genetic algorithm is a stochastic search method for solving optimization
problems. The theory was founded by Holland (1975) and it is inspired by the
theory of natural evolution. In natural evolution, the problem to be optimized is
how to survive in a complex and dynamic environment. The knowledge of this
problem is encoded as genes in the individuals’ chromosomes. The individuals
that are best adapted in a population have the highest probability of reproduction. In reproduction, the genes of the new individuals (children) are a mixture or
crossover of the parents’ genes. In reproduction there is also a random change in
the chromosomes. The random change is called mutation.
A genetic algorithm works with coded structures of the parameter space in
a similar way. It uses a population of coded structures (individuals) and evaluates the performance of each individual. Each individual is reproduced with a
probability that depends on that individual’s performance. The genes of the new
individuals are a mixture of the genes of two parents (crossover), and there is a
random change in the coded structure (mutation).
Thus, genetic algorithms learn by the method of trial and error, just like other
reinforcement learning algorithms. We might therefore argue that the same basic
principles hold both for developing a system (or an individual) and for adapting
the system to its environment. This is important since it makes the question of
what should be built into the machine from the beginning and what should be
learned by the machine more of a practical engineering question than a principal one. The conclusion does not make the question less important though; in
practice, it is perhaps one of the most important issues.
Another interesting relation between evolution and learning on the individual level is discussed by Hinton and Nowlan (1987). They show that learning
organisms evolve faster than non-learning equivalents. This is maybe not very
surprising if evolution and learning are considered as merely different levels of a
hierarchical learning system. Then the convergence of the slow high-level learning process (corresponding to evolution) depends on the adaptability of the faster
low-level learning process (corresponding to individual learning). This indicates
that hierarchical systems adapt faster than non-hierarchical systems of the same
complexity.
More information about genetic algorithms can be found for example in the
books by Davis (1987) and Goldberg (1989).
2.5 Unsupervised learning
In unsupervised learning there is no external feedback at all (see figure 2.1 on
page 10). The system’s experience mentioned on page 9 consists of a set of signals
and the measure of performance is often some statistical or information theoretical
24
Learning systems
property of the signal. Unsupervised learning is perhaps not learning in the word’s
everyday sense, since the goal is not to learn to produce responses in the form of
useful actions. Rather, it is to learn a certain representation which is thought to
be useful in further processing. The importance of a good representation of the
signals is discussed in chapter 3.
Unsupervised learning systems are often called self-organizing systems (Haykin,
1994; Hertz et al., 1991). Hertz et al. (1991) describe two principles for unsupervised learning: Hebbian learning and competitive learning. Also Haykin (1994)
uses these two principles but adds a third one that is based on mutual information, which is an important concept in this thesis. Next, these three principles of
unsupervised learning are described.
2.5.1 Hebbian learning
Hebbian learning originates from the pioneering work of neuropsychologist Hebb
(1949). The basic idea is that when one neuron repeatedly causes a second neuron
to fire, the connection between them is strengthened. Hebb’s idea has later been
extended to include the formulation that if the two neurons have uncorrelated activities, the connection between them is weakened. In learning and neural network
theory, Hebbian learning is usually formulated more mathematically. Consider a
linear unit where the output is calculated as
N
y = ∑ wi xi :
(2.20)
i=1
The simplest Hebbian learning rule for such a linear unit is
wi (t + 1) = wi (t ) + αxi (t )y(t ):
(2.21)
Consider the expected change ∆w of the parameter vector w using y = xT w:
E [∆w] = αE [xxT ]w = αCxx w:
(2.22)
Since Cxx is positive semi-definite, any component of w parallel to an eigenvector of Cxx corresponding to a non-zero eigenvalue will grow exponentially and
a component in the direction of an eigenvector corresponding to the largest eigenvalue (in the following called a maximal eigenvector) will grow fastest. Therefore
we see that w will approach a maximal eigenvector of Cxx . If x has zero mean,
Cxx is the covariance matrix of x and, hence, a linear unit with Hebbian learning
will find the direction of maximum variance in the input data, i.e. the first principal component of the input signal distribution (Oja, 1982). Principal component
analysis (PCA) is discussed in section 4.2 on page 64.
2.5 Unsupervised learning
25
A problem with equation 2.21 is that it does not converge. A solution to this
problem is Oja’s rule (Oja, 1982):
wi (t + 1) = αy(t ) (xi (t ) , y(t )wi (t )) :
(2.23)
This extension of Hebb’s rule makes the norm of w approach 1 and the direction
will still approach that of a maximal eigenvector, i.e. the first principal component
of the input signal distribution. Again, if x has zero mean, Oja’s rule finds the
one-dimensional representation y of x that has the maximum variance under the
constraint that kwk = 1.
In order to find more than one principal component, Oja (1989) proposed a
modified learning rule for N units:
N
wi j (t + 1) = αyi (t ) x j (t ) , ∑ yk (t )wk j (t )
!
;
(2.24)
k=1
where wi j is the weight j in unit i. A similar modification for N units was proposed
by Sanger (1989), which is identical to equation 2.24 except for the summation
that ends at i instead of N. The difference is that Sanger’s rule finds the N first
principal components (sorted in order) whereas Oja’s rule finds N vectors spanning the same subspace as the N first principal components.
A note on correlation and covariance matrices
In neural network literature, the matrix Cxx in equation 2.22 is often called a
correlation matrix. This can be a bit confusing, since Cxx does not contain the
correlations between the variables in a statistical sense, but rather the expected
values of the products between them. The correlation between xi and x j is defined
as
ρi j =
pEE[([(x x,i ,x̄ x̄)i2)(]Ex[(j ,x x̄,j )]x̄ )2]
i
i
j
(2.25)
j
(see for example Anderson, 1984), i.e. the covariance between xi and x j normalized by the geometric mean of the variances of xi and x j (x̄ = E [x]). Hence, the
correlation is bounded, ,1 ρi j 1, and the diagonal terms of a correlation
matrix, i.e. a matrix of correlations, are one. The diagonal terms of Cxx in equation 2.22 are the second order origin moments, E [x2i ], of xi . The diagonal terms
in a covariance matrix are the variances or the second order central moments,
E [(xi , x̄i )2 ], of xi .
The maximum likelihood estimator of ρ is obtained by replacing the expectation operator in equation 2.25 by a sum over the samples (Anderson, 1984). This
estimator is sometimes called the Pearson correlation coefficient after Pearson
(1896).
26
Learning systems
2.5.2 Competitive learning
In competitive learning there are several computational units competing to give
the output. For a neural network, this means that among several units in the output layer only one will fire while the rest will be silent. Hence, they are often
called winner-take-all units. Which unit fires depends on the input signal. The
units specialize to react on certain stimuli and therefore they are sometimes called
grandmother cells. This term was coined to illustrate the lack of biological plausibility for such highly specialized neurons. (There is probably not a single neuron
in your brain waiting just to detect your grandmother.) Nevertheless, the most
well-known implementation of competitive learning, the self-organizing feature
map (SOFM) (Kohonen, 1982), is highly motivated by the topologically organized feature representations in the brain. For instance, in the visual cortex, line
detectors are organized on a two-dimensional surface so that adjacent detectors
for the orientation of a line are sensitive to similar directions (Hubel and Wiesel,
1962).
In the simplest case, competitive learning can be described as follows: Each
unit gets the same input x and the winner is unit i if kwi , xk < kw j , xk; 8 j 6= i.
A simple learning rule is to update the parameter vector of the winner according
to
wi (t + 1) = wi (t ) + α(x(t ) , wi (t ));
(2.26)
i.e. to move the winning parameter vector towards the present input. The rest of
the parameter vectors are left unchanged. If the output of the winning unit is one,
equation 2.26 can be written as
wi (t + 1) = wi (t ) + αyi (x(t ) , wi (t ))
(2.27)
for all units (since yi = 0 for all losers). Equation 2.27 is a modification of the
Hebb rule in equation 2.21 and is identical to Oja’s rule (equation 2.23) if yi 2
f0; 1g (Hertz et al., 1991).
Vector quantization
A rather simple, but important, application of competitive learning is vector quantization (Gray, 1984). The purpose of vector quantization is to quantize a distribution of vectors x into N classes so that all vectors that fall into one class can be
represented by a single prototype vector wi . The goal is to minimize the distortion
between the input vectors x and the prototype vectors. The distortion measure is
usually defined using a Euclidean metric:
Z
D=
p(x)kx , wk2 dx;
(2.28)
RN
2.5 Unsupervised learning
27
where p(x) is the probability density function of x.
Kohonen (1989) has proposed a modification to the competitive learning rule
in equation 2.26 for use in classification tasks:
(
wi (t + 1) = wi (t )
, wi (t ))
,α(x(t ) , wi (t ))
+α(x(t )
if correct classification
if incorrect classification.
(2.29)
The need for feedback from a teacher means that this is a supervised learning rule.
It works as the standard competitive learning rule in equation 2.26 if the winning
prototype vector represents the desired class but moves in the opposite direction
if it does not. The learning rule is called learning vector quantization (LVQ) and
can be used for classification. (Note that several prototype vectors can belong to
the same class.)
Feature maps
The self-organizing feature map (SOFM) (Kohonen, 1982) is an unsupervised
competitive learning rule but without winner-take-all units. It is similar to the
vector quantization methods just described but has local connections between the
prototype vectors. The standard update rule for a SOFM is
wi (t + 1) = wi (t ) + αh(i; j)(x(t ) , wi (t ));
(2.30)
where h(i; j) is a neighbourhood function which is dependent on the distance between the current unit vector i and the winner unit j. A common choice of h(i; j)
is a Gaussian. Note that the distance is not between the parameter vectors but
between the units in a network. Hence, a topological ordering of the units is implied. Note also that all units, and not only the winner, are updated (although
some of them with very small steps). The topologically ordered units and the
neighbourhood function cause nearby units to have more similar prototype vectors than units far apart. Hence, if these parameter vectors are seen as feature
detectors (i.e. filters), similar features will be represented by nearby units.
Equation 2.30 causes the parameter vectors to be more densely distributed in
areas where the input probability is high and more sparsely distributed where the
input probability is low. Such a behaviour is desired if the goal is to keep the
distortion (equation 2.28) low. The density of parameter vectors is, however, not
strictly proportional to the input signal probability (Ritter, 1991), which would
minimize the distortion.
Higher level competitive learning
Competitive learning can also be used on a higher level in a more complex learning system. The function of the whole system is not necessarily based on unsu-
28
Learning systems
pervised learning. It can be trained using supervised or reinforcement learning.
But the system can be divided into subsystems that specialize on different parts of
the decision space. The subsystem that handle a certain part of the decision space
best will gain control over that part. An example is the adaptive mixtures of local
experts by Jacobs et al. (1991). They use a system with several local experts and
a gating network that selects among the output of the local experts. The whole
system uses supervised learning but the gating network causes the local experts to
compete and therefore to try to take responsibility for different parts of the input
space.
2.5.3 Mutual information based learning
The third principle of unsupervised learning is based on the concept of mutual
information. Mutual information is gaining an increased attention in the signal
processing society as well as among learning theorists and neural network researchers. The theory, however, dates back to 1948 when Shannon presented his
classic foundations of information theory (Shannon, 1948).
A piece of information theory
Consider a discrete random variable x:
x 2 fxi g; i 2 f1; 2; : : : ; N g:
(2.31)
(There is, in practice, no limitation in x being discrete since all measurements have
finite precision.) Let P(xk ) be the probability of x = xk for a randomly chosen x.
The information content in the vector (or symbol) xk is defined as
1
I (xk ) = log
P(xk )
=
, log P(xk )
:
(2.32)
If the basis 2 is used for the logarithm, the information is measured in bits. The
definition of information has some appealing properties. First, the information is
0 if P(xk ) = 1; if the receiver of a message knows that the message will be xk ,
he does not get any information when he receives the message. Secondly, the
information is always positive. It is not possible to lose information by receiving a message. Finally, the information is additive, i.e. the information in two
independent symbols is the sum of the information in each symbol:
I (xi ; x j ) = , log (P(xi ; x j )) = , log (P(xi )P(x j ))
=
, log P(xi ) , log P(x j ) = I (xi ) + I (x j )
if xi and x j are statistically independent.
(2.33)
2.5 Unsupervised learning
29
The information measure considers each instance of the stochastic variable
x but it does not say anything about the stochastic variable itself. This can be
accomplished by calculating the average information of the stochastic variable:
N
N
i=1
i=1
H (x) = ∑ P(xi )I (xi ) = , ∑ P(xi ) log(P(xi )):
(2.34)
H (x) is called the entropy of x and is a measure of uncertainty about x.
Now, we introduce a second discrete random variable y, which, for example,
can be an output signal from a system with x as input. The conditional entropy
(Shannon, 1948) of x given y is
H (xjy) = H (x; y) , H (y):
(2.35)
The conditional entropy is a measure of the average information in x given that y
is known. In other words, it is the remaining uncertainty of x after observing y.
The average mutual information4 I (x; y) between x and y is defined as the average
information about x gained when observing y:
I (x; y) = H (x) , H (xjy):
(2.36)
The mutual information can be interpreted as the difference between the uncertainty of x and the remaining uncertainty of x after observing y. In other words, it
is the reduction in uncertainty of x gained by observing y. Inserting equation 2.35
into equation 2.36 gives
I (x; y) = H (x) + H (y) , H (x; y) = I (y; x)
(2.37)
which shows that the mutual information is symmetric.
Now let x be a continuous random variable. Then the differential entropy h(x)
is defined as (Shannon, 1948)
Z
h(x) = ,
p(x) log p(x) dx;
(2.38)
RN
where p(x) is the probability density function of x. The integral is over all dimensions in x. The average information in a continuous variable would of course
be infinite since there are an infinite number of possible outcomes. This can be
seen if the discrete entropy definition (eq. 2.34) is calculated in limes when x
approaches a continuous variable:
H (x) = , lim
∞
∑
δx!0 i=,∞
4 Shannon
p(xi )δx log ( p(xi )δx) = h(x) , lim log δx;
δx!0
(2.39)
(1948) originally used the term rate of transmission. The term mutual information
was introduced later.
30
Learning systems
where the last term approaches infinity when δx approaches zero (Haykin, 1994).
But since mutual information considers the difference in entropy, the infinite term
will vanish and continuous variables can be used to simplify the calculations. The
mutual information between the continuous random variables x and y is then
I (x; y) = h(x) + h(y) , h(x; y) =
Z
Z
RN RM
p(x; y) log
p(x; y)
dxdy; (2.40)
p(x) p(y)
where N and M are the dimensionalities of x and y respectively.
Consider the special case of Gaussian distributed variables. The differential
entropy of an N-dimensional Gaussian variable z is
h(z) =
,
1
log (2πe)N jCj
2
(2.41)
where C is the covariance matrix of z (see proof B.1.1 on page 153). This means
that the mutual information between two N-dimensional Gaussian variables is
I (x; y) =
1
log
2
where
C=
C
jCxx j jCyyj xx
Cyx
jCj
Cxy
Cyy
;
(2.42)
:
Cxx and Cyy are the within-set covariance matrices and Cxy = CTyx is the betweensets covariance matrix. For more details on information theory, see for example
Gray (1990).
Mutual information based learning
Linsker (1988) showed that Hebbian learning gives maximum mutual information between the input and the output in a simple case with a linear unit with noise
added to the output. In a more advanced model with several units, he showed that
there is a tradeoff between keeping the output signals uncorrelated and suppressing the noise. Uncorrelated output signals give more information (higher entropy)
on the output, but redundancy can help to suppress the noise. The principle of
maximizing the information transferred from the input to the output is by Linsker
(1988) called the infomax principle.
Linsker has proposed a method, based on maximum mutual information, for
generating a topologically ordered feature map (Linsker, 1989). The map is similar to the SOFM mentioned in section 2.5.2 (page 27) but in contrast to the SOFM,
Linsker’s learning rule causes the distribution of input units to be proportional to
the input signal probability density.
2.5 Unsupervised learning
31
max(I (x : y))
x
max(I (y1 : y2 ))
y
-
-
x1
-
-
x2
-
-
(a)
y1
y2
(b)
Figure 2.3: The difference between infomax (a) and Imax (b).
Bell and Sejnowski (1995) have used mutual information maximization to
perform blind separation of mixed unknown signals and blind deconvolution of a
signal convolved with an unknown filter. Actually, they maximize the entropy in
the output signal y rather than explicitly maximizing the mutual information between x and y. The results are, however, the same if there is independent noise in
the output but no known noise in the input5 . To see that, consider a system where
y = f (x) + η where η is an independent noise signal. The mutual information
between x and y is then
I (x; y) = h(y) , h(yjx) = h(y) , h(η);
(2.43)
where h(η) is independent of the parameters of f .
Becker and Hinton (1992) have used mutual information maximization in another way than Linsker and Bell and Sejnowski. Instead of maximizing the mutual
information between the input and the output they maximize the mutual information between the output of different units, see figure 2.3. They call this principle
Imax and have used it to estimate disparity in random-dot stereograms (Becker
and Hinton, 1992) and to detect depth discontinuities in stereo images (Becker
and Hinton, 1993). A good overview of Imax is given by Becker (1996).
Among other mutual information based methods of unsupervised learning are
Barlow’s minimum entropy coding that aims at minimizing the statistical dependence between the output signals (Barlow, 1989; Barlow et al., 1989; Földiák,
1990) and the Gmax algorithm (Pearlmutter and Hinton, 1986) that tries to detect
statistical dependent features in the input signal.
5 “No
known noise” means that the input cannot be divided into a signal part x and a noise part
η . The noise is an indistinguishable part of the input signal x.
32
Learning systems
The relation between mutual information and correlation
There is a clear relation between mutual information and correlation for Gaussian
distributed variables. Consider two one-dimensional random variables x and y.
Equations 2.42 and 2.25 then gives
I (x; y) =
1
log
2
σ2x σ2y
σ2x σ2y , (σxy )2
!
=
1
log
2
1
1 , ρ2xy
!
;
(2.44)
where σ2x and σ2y are the variances of x and y respectively, σxy is the covariance
between x and y and ρxy is the correlation between x and y. The extension of this
relation to multidimensional variables is discussed in chapter 4.
This relationship means that for a single linear unit with Gaussian distributed
variables, the mutual information between the input and the output, i.e. the amount
of transferred information, is maximized if the correlation between the input and
the output is maximized.
2.6 Comparisons between the three learning methods
The difference between supervised learning, reinforcement learning and unsupervised learning may seem very fundamental at first. But sometimes the distinction
between them is not so clear and the classification of a learning method can depend upon the view of the observer.
As we have seen in section 2.4, reinforcement learning can be implemented
as a supervised learning of the reward function. The output is then chosen as the
one giving the maximum value of the approximation of the reward function given
the present input.
Another way of implementing reinforcement learning is to use the output of
the system as the desired output in a supervised learning algorithm and weight the
update step with the reward (Williams, 1988).
Furthermore, supervised learning can emerge as a special case of reinforcement learning where the system is forced to give the desired output while receiving maximum reward. Also, a task for a supervised learning system can always be
reformulated to fit a reinforcement learning system simply by mapping the error
vectors to scalars, for example as a function of the norm of the error vectors.
Also unsupervised learning can sometimes be formulated as supervised learning tasks. Consider, for example, the PCA algorithms (section 2.5.1) that find the
maximal eigenvectors of the distribution of x. For a single parameter vector w the
problem can be formulated as minimizing the difference between the signal x and
2.7 Two important problems
33
the output y = xT ŵ ŵ, i.e.
1
E
2
kx , xT ŵ ŵk2 = E xT x , ŵT xxT ŵ
= tr(C)
, ŵT Cŵ = ∑ λi , ŵT Cŵ
(2.45)
;
i
where C is the covariance matrix of x (assuming x̄ = 0) and λi are the eigenvalues
of C. Obviously, the best choice of w is the maximal eigenvector of C. The output
is a reconstruction of x and the desired output is the same as the input. Another
example is the methods described in chapter 4 and by van der Burg (1988).
Finally, there is a similarity between all three learning principles in that they
all generally try to optimize a scalar measure of performance, for example mean
square error, accumulated reward, variance, or mutual information.
A good example illustrating how similar these three methods can be is the
prediction matrix memory in section 3.3.1
2.7 Two important problems
There are some important fundamental problems in learning systems. One problem, called perceptual aliasing, deals with the problem of consistency in the internal representation of external states. Another problem is called the credit assignment problem and deals with the problem of distribution of the feedback in
the system during learning. These two problems are discussed in this section. A
third important problem is how to represent the information in a learning system,
which is discussed in chapter 3.
2.7.1 Perceptual aliasing
Consider a learning system that perceives the external world through a sensory
subsystem and represents the set of external states SE by an internal state representation set SI . This set can, however, rarely be identical to the real external
world state set SE . To assume a representation that completely describes the external world in terms of objects, their features and relationships, is unrealistic even
for relatively simple problem settings. Furthermore, the internal state is inevitably
limited by the sensor system, which leads to the fact that there is a many-to-many
mapping between the internal and external states. That is, a state se 2 SE in the
external world can map into several internal states and, what is worse, an internal
state si 2 SI could represent multiple external world states. This phenomenon has
been termed perceptual aliasing (Whitehead and Ballard, 1990a).
Figure 2.4 illustrates two cases of perceptual aliasing. One case is when two
external states s1e and s2e map into the same internal state s1i . An example is when
34
Learning systems
Lerning system
si1
si2
si3
SI
se1
se2
se3
SE
Figure 2.4: Two cases of perceptual aliasing. Two external states s1e and
s2e are mapped into the same internal state s1i and one external state s3e is
mapped into two internal states s2i and s3i .
two different objects appear as identical to the system. This is illustrated in view
1 in figure 2.5. The other case is when one external state s3e is represented by
two internal states s2i and s3i . This happens, for instance, in a system consisting of
several local adaptive models if two or more models happen to represent the same
solution to the same part of the problem.
Perceptual aliasing may cause the system to confound different external states
that have the same internal state representation. This type of problem can cause
a response generating system to make the wrong decisions. For example, let the
internal state si represent the external states sae and sbe and let the system generate
an action a. The expected reward for the decision (si ; a) to generate the action a
given the state si can now be estimated by averaging the rewards for that decision
accumulated over time. If sae and sbe occur approximately equally often and the
actual accumulated reward for (sae ; a) is greater than the accumulated reward for
(sbe ; a), the expected reward will be underestimated for (sae ; a) and overestimated
for (sbe ; a), leading to a non-optimal decision policy.
There are cases when the phenomenon is a feature, however. This happens
if all decisions made by the system are consistent. The reward for the decision
(si ; a) then equals the reward for all corresponding actual decisions (ske ; a), where
k is an index for this set of decisions. If the mapping between the external and
internal worlds is such that all decisions are consistent, it is possible to collapse
a large actual state space into a small one where situations that are invariant to
the task at hand are mapped onto one single situation in the representation space.
For a system operating in a large decision space, such a strategy is in fact necessary in order to reduce the number of different states. The goal is then to find a
representation of the decision space such that consistent decisions can be found.
The simplest example of such a deliberate perceptual aliasing is quantization. If
2.7 Two important problems
35
2)
1)
A
s
i
2)
s1
i
s2
i
A
B
B
1)
A
B
Figure 2.5: Avoiding perceptual aliasing by observing the environment
from another direction.
the quantization is properly designed, the decisions will be consistent within each
quantized state.
Whitehead and Ballard (1990b) have presented a solution to the problem of
perceptual aliasing for a restricted class of learning situations. The basic idea is to
detect inconsistent decisions by monitoring the estimated reward error, since the
error will oscillate for inconsistent decisions as discussed above. When an inconsistent decision is detected, the system is guided (e.g. by changing its direction
of view) to another internal state uniquely representing the desired external state.
In this way, more actions will produce consistent decisions (see figure 2.5). The
guidance mechanisms are not learned by the system. This is noted by Whitehead
who admits that a dilemma is left unresolved:
“In order for the system to learn to solve a task, it must accurately
represent the world with respect to the task. However, in order for
the system to learn an accurate representation, it must know how to
solve the task.”
The issue of information representation is further discussed in chapter 3.
2.7.2 Credit assignment
In all complex control systems, there probably exist some uncertainty of how to
distribute credit (or blame) for the control actions taken. This uncertainty is called
the credit assignment problem (Minsky, 1961, 1963). Consider, for example, a political system. Is it the trade politics or the financial politics that deserves credit
36
Learning systems
for the increasing export? We may call this a structural credit assignment problem. Is it the current government or the previous one that deserves credit or blame
for the economic situation? This is a temporal credit assignment problem. Is it
the management or the staff that should be given credit for the financial result in
a company? This is what we may call a hierarchical credit assignment problem.
These three types of credit assignment problems are also encountered in the type
of control systems considered here, i.e. learning systems.
The structural credit assignment problem occurs, for instance, in a neural network when deciding which weights to alter in order to achieve an improved performance. In supervised learning, the structural credit assignment problem can
be handled by using back-propagation (Rumelhart et al., 1986) for instance. The
problem becomes more complicated in reinforcement learning where only a scalar
feedback is available. In section 3.4, a description is given of how the structural
credit assignment problem can be handled by the use of local adaptive models.
The temporal credit assignment problem occurs when a system acts in a dynamic environment and a sequence of actions is performed. The problem is to decide which of the actions taken deserves credit for the result. Obviously, it is not
certain that it is the final action taken that deserves all the credit or blame. (For example, consider the situation when the losing team in a football game scores a goal
during the last seconds of the game. It would not be clever to blame the person
who scored that goal for the loss of the game.) The problem becomes especially
complicated in reinforcement learning if the reward occurs infrequently. The temporal credit assignment problem is thoroughly investigated by Sutton (1984).
Finally, the hierarchical credit assignment problem can occur in a system consisting of several levels. Consider, for example, the Adaptive mixtures of local
experts (Jacobs et al., 1991). That system consists of two levels. On the lower
level, there are several subsystems that specialize on different parts of the input
space. On the top level, there is a supervisor that selects the proper subsystem for
a certain input. If the system makes a bad decision, it can be difficult to decide if
it was the top level that selected the wrong subsystem or if the top level made a
correct choice but the subsystem that generated the response made a mistake. This
problem can of course be regarded as a type of structural credit assignment problem, but to emphasize the difference we call it a hierarchical credit assignment
problem. Once the hierarchical credit assignment problem is solved and it is clear
on what level the mistake was made, the structural credit assignment problem can
be dealt with to alter the behaviour on that level.
Chapter 3
Information representation
A central issue in the design of learning systems is the representation of information in the system. The algorithms treated in this work can be seen as signal
processing systems, in contrast to AI or expert systems that have symbolic representations1 . We may refer to the representation used in the signal processing systems as a continuous representation while the symbolic approach can be said to
use a string representation. Examples of the latter are the Lion Algorithm (Whitehead and Ballard, 1990a), the Reinforcement Learning Classifier Systems (Smith
and Goldberg, 1990) and the MENACE example in section 2.4. The genetic algorithms that were described in section 2.4.3 are perhaps the most obvious examples
of string representation in biological reinforcement learning systems.
The main difference between the two approaches is that a continuous representation has an implicit metric, i.e. there is a continuum of states and there exist
meaningful interpolations between different states. One can say that two states are
more or less similar. Interpolations are important in a learning system since they
make it possible for the system to make decisions in situations never experienced
before. This is often referred to as generalization. In a string representation there
is no implicit metric, i.e. there is no unambiguous way to tell which of two strings
is more similar to a third string than the other. There are, however, also advantages with string representations. Today’s computers, for example, are designed to
work with string representations and have difficulties in handling continuous information in an efficient way. A string representation also make it easy to include
a priori knowledge in terms of explicit rules.
An approach that can be seen as a mix of symbolic representation and continuous representation is fuzzy logic (Zadeh, 1968, 1988). The symbolic expressions
in fuzzy logic include imprecise statements like “many”, “close to”, “usually”,
1 By
“symbolic”, a more abstract representation is referred to than just a digitalization of the
signal; a digital signal processing system is still a signal processing system.
38
Information representation
etc. This means that statements need not be true or false; they can be somewhere
in between. This introduces a kind of metric and interpolation is possible (Zadeh,
1988). Lee and Berenji (1989) describe a rule-based fuzzy controller using reinforcement learning that solves the pole balancing problem.
Ballard (1990) suggests that it is unreasonable to suppose that peripheral motor and sensory activity are correlated in a meaningful way. Instead, it is likely
that abstract sensory and motor representations are built and related to each other.
Also, combined sensory and motor information must be represented and used in
the generation of new motor activity. This implies a learning hierarchy and that
learning occurs on different temporal scales (Granlund, 1978, 1988; Granlund and
Knutsson, 1982, 1983, 1990). Hierarchical learning system designs have been
proposed by several other researchers (e.g. Jordan and Jacobs, 1994).
Both approaches (signal and symbolic) described on the preceding page are
probably important, but on different levels in hierarchical learning systems. On a
low level, the continuous representation is probably to prefer since signal processing techniques have the potential of being faster than symbolic reasoning as they
are easier to implement with analogue techniques. On a low level, interpolations
are meaningful and desirable. In a simple control task for instance, consider two
similar2 stimuli s1 and s2 which have the optimal responses r1 and r2 respectively.
For a novel stimulus s3 located between s1 and s2 , the response r3 could, with
large probability, be assumed to be in between r1 and r2 .
On a higher level, on the other hand, a more symbolic representation may be
needed to facilitate abstract reasoning and planning. Here, the processing speed is
not as crucial and interpolation may not even be desirable. Consider, for instance,
the task of passing a tree. On a low level, the motor actions are continuous and
meaningful to interpolate and they must be generated relatively fast. The higher
level decision on which side of the tree to pass is, however, symbolic. Obviously,
it is not successful to interpolate the two possible alternatives of “walking to the
right” and “walking to the left”. Also, there is more time to make this decision
than to generate the motor actions needed for walking.
The choice of representation can be crucial for the ability to learn. Geman
et al. (1992) argue that
“the fundamental challenges in neural modelling are about representation rather than learning per se.”
Furthermore, Hertz et al. (1991) present a simple but illustrative example to emphasize the importance of the representation of the input to the system. Two tasks
are considered: the first one is to decide whether or not the input is an odd number;
2 Similar
means here that they are relatively close to each other in the given metric compared to
the variance of the distribution of stimuli.
3.1 The channel representation
39
the second is to decide if the input has an odd number of prime factors. If the input
has a binary representation, the first task is extremely simple: the system just has
to look at the least significant bit. The second task, however, is very difficult. If
the base is changed to 3, for instance, the first task will be much harder. And if
the input is represented by its prime factors, the second task will be easier. Hertz
et al. (1991) also prove an obvious (and, as they say, silly) theorem:
“learning will always succeed, given the right preprocessor.”
In the discussion above, representation of two kinds of information is actually
treated: the information entering the system as input signals (signal representation) and the information in the system about how to behave, i.e. knowledge
learned by the system (model representation). The representations of these two
kinds of information are, however, closely related to each other. As we will see, a
careful choice of input signal representation can allow for a very simple representation of knowledge.
In the following section, a special type of signal representation called the
channel representation is presented. It is a representation that is biologically inspired and which has several computational advantages. The later sections will
deal more with model representations. The probably most well-known class of
model representations among learning systems, neural networks, is presented in
section 3.2. They can be seen as global non-linear models. In section 3.3 is shown
how the channel representation makes it possible to use a simple linear model.
In section 3.4 is argued that low-dimensional linear models are sufficient if they
are local enough and the adaptive distribution of such models is briefly discussed
in section 3.5. The chapter ends with simple examples of reinforcement learning
systems solving the same problem but with different representations.
3.1 The channel representation
As has been discussed above, the internal representation of information may play
a decisive role for the performances of learning systems. The representation that
is intuitively most obvious in a certain situation, for example a scalar t for temperature or a three dimensional vector p = (x y z)T for a position in space, is, however, in some cases not a very good way to represent information. For example,
consider an orientation in R 2 which can be represented by an angle ϕ 2 [,π; π]
relative to a fix orientation, for example the x-axis. While this may appear as a
very natural representation of orientation, it is in fact not a very good one since
it has got a discontinuity at π which means that an orientation average cannot be
consistently defined (Knutsson, 1989).
Another, perhaps more natural, way of representing information is the channel
representation (Nordberg et al., 1994; Granlund, 1997). In this representation, a
40
Information representation
set of channels is used where each channel is sensitive to some specific feature
value in the signal, for example a certain temperature ti or a certain position pi . In
the example above, the orientation in R2 could be represented by a set of channels
evenly spread out on the unit circle, as proposed by Granlund (1978). If three
channels of the shape
ck = cos
2
3
4
(ϕ
, pk )
;
(3.1)
2π
where p1 = 2π
3 ; p2 = 0 and p3 = , 3 , are used (Knutsson, 1982), the orientation
can be represented continuously by the channel vector c = (c1 c2 c3 )T which
has a constant norm for all orientations. The reason to call this a more natural
representation than for instance the angle ϕ, is that the channel representation is
frequently used in biological systems, where each nerve cell responds strongly to
a specific feature value. One example of this is the orientation sensitive cells in the
primary visual cortex (Hubel and Wiesel, 1959; Hubel, 1988). This representation
is called value encoding by Ballard (1987) who contrasts it with variable encoding
where the activity is monotonically increasing with some parameter.
Theoretically, the channels can be designed so that there is one channel for
each feature value that can occur. A function of these feature values would then be
implemented simply as a look-up table. In practice, however, the range of feature
values is often continuous (or at least quantized finely enough to be considered
continuous). Each channel can be seen as a response of a filter that is tuned to
some specific feature value. The coding is then designed so that the channel has
its maximum value (for example one) when the feature and the filter are exactly
tuned to each other, and decreases to zero in a smooth way as the feature and the
filter become less similar. This is similar to the magnitude representation proposed
by Granlund (1989).
The channel representation increases the number of dimensions in the representation. It should, however, be noted that an increase in the dimensionality does
not have to lead to increased complexity of the learning problem. A great advantage of the channel representation is that it allows for simple processing structures.
To see this, consider any continuous function y = f (x). If x is represented by a sufficiently large number of channels ck of a suitable form, the output y can simply be
calculated as a weighted sum of the input channels y = wT c however complicated
the function f may be. This implies that by using a channel representation, linear
operations can be used to a great extent; this fact is used further in this chapter.
It is not obvious how to choose the shape of the channels. Consider, for example, the coding of a variable x into channels. According to the description above,
each channel is positive and has its maximum for one specific value of x and it
decreases smoothly to zero away from this maximum. In addition, to enable representation of all values of x in an interval, there must be overlapping channels on
3.1 The channel representation
ck−2
ck−1
41
ck
ck+1
ck+2
x
Figure 3.1: A set of cos2 channels. Only three channels are activated
simultaneously. The sum of the squared channel outputs ck,1 , ck and ck+1
is drawn with a dotted line.
this interval. It is also convenient if the norm of the channel vector is constant so
that the feature value is only represented by the orientation of the channel vector.
This enables the use of the scalar product for calculating the similarity between
values. It also makes it possible to use the norm of the channel vector to represent
some other entity related to the measurement, for instance the energy or the certainty of the measurement. One channel form that fulfils the requirements above
is:
ck
( cos2 , π (x , k) jx , kj
3
=
0
<
3
2
otherwise
(3.2)
(see figure 3.1). This set of channels has a constant norm (see proof B.2.1 on page
154). It also has a constant square sum of its first derivatives (see proof B.2.2 on
page 155) (Knutsson, 1982, 1985). This means that a change ∆x in x always gives
the same change ∆c in c for any x. Of course, not only scalars can be coded into
vectors with constant norm. Any vector v in a vector space of (N , 1) dimensions
can be transformed into the orientation of a unit-length vector in an N-dimensional
space. This was used, for example, by Denoeux and Lengellé (1993) in order to
keep the norm of the input vectors constant and equal to one while preserving all
the information. By using this new input representation, a scalar product could be
used for calculating the similarity between the input vectors and a set of prototype
vectors.
The channel vectors described above only exist in a small limited number of
dimensions at a time; the channels in all other dimensions are zero. The number
of simultaneously active channels is called local dimensionality. In the example
in figure 3.1, the local dimensionality is three. This means that the vector moves
along a curve as in figure 3.2 (left) as x changes. If we look at channels far apart,
42
Information representation
c
k+1
c
k+n
ck
ck
c
k−1
c
k−m
Figure 3.2: Left: The curve along which a channel vector can move in
a subspace spanned by three neighbouring channels. The broken part of
the curve illustrates the proceeding of the vector into other dimensions.
Right: The possible channel vectors viewed in a subspace spanned by
three distant non-overlapping channels.
only one of these channels is active at a time (figure 3.2, right); the activity is
local. We call this type of channel vector a pure channel vector. The pure channel
vector can be seen as an extreme of the sparse distributed coding (Field, 1994).
This is a coding that represents data with a minimum number of active units in
contrast to compact coding that represents data with a minimum number of units.
In general, the input to a system cannot be a pure channel vector. Consider,
for example, a system that uses visual input, i.e. images. It is obvious that the
dimensionality of the space of pure channel vectors that can represent all images
would be far to large to be of practical interest. The input should rather consist
of many sets of channels where each set measures a local property in the image,
for example local orientation. Each set can be a pure channel vector, but the total
input vector, consisting of several concatenated pure channel vectors, will not
only have local activity. We call this type of vector, which consists of many sets
of channels, a mixed channel vector.
The use of mixed channel vectors is not only motivated by limited processing
capacity. Consider, for example, the representation of a two-dimensional variable
x = (x1 x2 )T . We may represent this variable with a pure channel vector by distributing on the X -plane overlapping channels that are sensitive to different xi , as
in figure 3.3 (left). Another way is to represent x with a mixed channel vector by
using two sets of channels as in figure 3.3 (right). Here, each set is only sensitive
3.1 The channel representation
x
2
43
x
2
x1
x1
Figure 3.3: Left: Representation of a two-dimensional variable with one
set of channels that constitute a pure channel vector Right: Representation of the same variable with two sets of channels that together form a
mixed channel vector.
to one of the two parameters x1 and x2 and it does not depend on the other parameter at all; the channel vector c1 on the x1 -axis is said to be invariant with respect to
x2 . Invariance can be seen as a deliberate perceptual aliasing as discussed in section 2.7.1. If x1 and x2 represent different properties of x, for instance colour and
size, the invariance can be a very useful feature. It makes it possible to observe
one property independently of the others by looking at a subset of the channels.
Note, however, that this does not mean that all multidimensional variables should
be represented by mixed channel vectors. If, for example, (x1 x2 )T in figure 3.3
represents the two-dimensional position of a physical object, it does not seem useful to see the x1 and x2 positions as two different properties. In this case, the pure
channel vector (left) might be a proper representation.
The use of mixed channel vectors offers another advantage compared to using
the original variables, namely the simultaneous representation of properties which
belong to different objects. Consider a one-dimensional variable x representing a
position of an object along a line and compare this with a channel vector c representing the same thing. Now, if two objects occur at different positions, a mixed
channel vector allows for the positions of both objects to be represented. This is
obviously not possible when using the single variable x. Note that the mixed channel vector discussed here differs from the one described previously which consists
of two or more concatenated pure channel vectors. In that case, the mixed channel
vector represents several features and one instance of each feature. In the case of
representing two or more positions, the mixed channel vector represents several
44
Information representation
x1
w1
x2
w2
xi
wi
+
f( )
y
Figure 3.4: The basic neuron. The output y is a non-linear function f of a
weighted sum of the inputs x.
instances of the same feature, i.e. multiple events. Both representations are, however, mixed channel vectors in the sense that they can have simultaneous activity
on channels far apart as opposed to pure channel vectors.
3.2 Neural networks
Neural networks are perhaps the most popular and well-known implementations
of artificial learning systems. The concept is so popular that it is often used synonymous with machine learning, which sometimes can be a bit misleading. There
is no unanimous definition of neural networks, but they are usually characterized by a large number of massively connected relatively simple processing units.
Learning capabilities are often understood even if they are not explicit. One could
of course imagine a hard-wired neural network incapable of learning. Neural networks can be seen as global parameterized non-linear models.
The processing units in a neural network are often called neurons (hence, the
name neural network) since they were originally designed as models of the nerve
cells (neurons) in the brain. In figure 3.4, an artificial neuron is illustrated. This
basic model of an artificial neuron was proposed by McCulloch and Pitts (1943)
where the non-linear function f was a Heaviside (unit step) function, i.e.
8
>
<0
f (x) = (1 2
>
:1
=
x<0
x = 0) :
x>0
(3.3)
An example of a neural network is the two-layer perceptron illustrated in figure
3.5 which consists of neurons like the one described above connected in a feedforward manner. The neural network is a parameterized model and the parameters
3.2 Neural networks
45
y1
x1
y2
x2
y3
Figure 3.5: A two-layer perceptron with a two-dimensional input and a
three-dimensional output.
are often called weights. Rosenblatt (1962) presented a supervised learning algorithm for a single layer perceptron. Later, however, Minsky and Papert (1969)
showed that a single layer perceptron failed to solve even some simple problems, for example the Boolean exclusive-or function. While it was known that
a three-layer perceptron can represent any continuous function, Minsky and Papert doubted that a learning method for a multi-layer perceptron would be possible
to find. This finding almost extinguished the interest in neural networks for nearly
two decades until the 1980s when learning methods for multi-layer perceptrons
were developed. The most well-known method is back-propagation presented in
a Ph.D. thesis by Werbos (1974) and later presented by Rumelhart et al. (1986).
The solution to the problem of how to update a multi-layer perceptron was
to replace the Heaviside function (equation 3.3) with a differentiable nonlinear
function, usually a sigmoid function. Examples of common sigmoid functions are
f (x) = tanh(x) and the Fermi function:
f (x) =
1
:
1 + e,x
(3.4)
The sigmoid function can be seen as a basis function for the internal representation in the network. Another choice of basis function is the radial-basis function
(RBF), for example a Gaussian, that is used in the input layer in RBF networks
(Broomhead and Lowe, 1988; Moody and Darken, 1989). The RBFs can be seen
as a kind of channel representation.
The feed-forward design in figure 3.5 is, of course, not the only possible arrangement of neurons in a neural network. It is also possible to have connections
from the output back to the input, so called recurrent networks. Two famous examples of recurrent networks are the Hopfield network (Hopfield, 1982) and the
Boltzmann machine (Hinton and Sejnowski, 1983, 1986).
46
Information representation
3.3 Linear models
While neural networks are non-linear models, it could sometimes be sufficient
to use a linear model, especially if the representation of the input to the system
is chosen carefully. As mentioned above, the channel representation makes it
possible to realize a rather complicated function as a linear function of the input
channels. In fact, the RBF networks can be seen as a hidden layer creating a
channel representation followed by an output layer implementing a linear model.
In this section, a linear model for reinforcement learning called the prediction
matrix memory is presented.
3.3.1 The prediction matrix memory
In this subsection, a system that is to learn to produce an output channel vector q
as a function of an input channel vector v is described. The functions considered
here are continuous functions of a pure channel vector (see page 42) or functions
that are dependent on one property while invariant with respect to the others in a
mixed channel vector; in other words, functions that can be realized by letting the
output channels be linear combinations of the input channels. We call this type
of functions first-order functions3 . The order can be seen as the number of events
in the input vector that must be considered simultaneously in order to define the
output. In practice, this means that, for instance, a first-order function does not
depend on any relation between different events; a second-order function depends
on the relation between no more than two events and so on.
Consider a first-order system which is supplied with an input channel vector
v and which generates an output channel vector q. Suppose that v and q are pure
channel vectors. If there is a way of defining a scalar r (the reinforcement) for
each decision (v, q) (i.e. input-output pair), the function r(v; q) is a second-order
function. The tensor space Q V that contains the outer products qvT we call
the outer product decision space. In this space, the decision (v, q) is one event.
Hence, r can be calculated as a first-order function of the outer product qvT .
In practice, the system will, of course, handle a finite number of overlapping
channels and r will only be an approximation of the reward. But if the reward
function is continuous, this approximation can be made arbitrarily good by using
a sufficiently large set of channels.
3 This concept
of order have similarities to the one defined by Minsky and Papert (1969). In their
discussion, the inputs are binary vectors which of course can be seen as mixed channel vectors with
non-overlapping channels.
3.3 Linear models
47
W
qv T
p
Figure 3.6: The reward prediction p for a certain stimulus-response pair
(v, q) viewed as a projection onto W in Q V .
Learning the reward function
If supervised learning is used, the linear function could be learned by training a
weight vector w̃i for each output channel qi so that qi = w̃T v. This could be done
by minimizing some error function, for instance
E = E [kq , q̃k2 ];
(3.5)
where q̃ is the correct output channel vector supplied by the teacher. This means,
for the whole system, that a matrix W is trained so that a correct output vector is
generated as
f
f
q = Wv =
e !
wTi v
..
.
:
(3.6)
In reinforcement learning, however, the correct output is unknown; only a
scalar r that is a measure of the performance of the system is known (see section 2.4 on page 12). But the reward is a function of the stimulus and the response,
at least if the environment is not completely stochastic. If the system can learn this
function, the best response for each stimulus can be found. As described above,
the reward function for a first-order system can be approximated by a linear combination of the terms in the outer product qvT . This approximation can be used as
a prediction p of the reward and is calculated as
p = hW j qvT i;
(3.7)
see figure 3.6. The matrix W is therefore called a prediction matrix memory.
The reward function can be learned by modifying W in the same manner as in
48
Information representation
supervised learning, but here with the aim to minimize the error function
E = E [jr , pj2]:
(3.8)
Now, let each triple (v; q; r) of stimulus, response, and reward denote an experience. Consider a system that has been subject to a number of experiences. How
should a proper response be chosen by the system? The prediction p in equation
3.7 can be rewritten as
p = qT Wv = hq j Wvi:
(3.9)
Due to the channel representation, the actual output is completely determined by
the direction of the output vector. Hence, we can regard the norm of q as fixed
and try to find an optimal direction of q. The q that gives the highest predicted
reward obviously has the same direction as Wv. Now, if p is a good prediction of
the reward r for a certain stimulus v, this choice of q would be the one that gives
the highest reward. An obvious choice of the response q is then
q = Wv
(3.10)
f
which is the same first-order function as W suggested for supervised learning in
equation 3.6. Since q is a function of the input v, the prediction can be calculated directly from the input. Equation 3.9 together with equation 3.10 give the
prediction as
p = (Wv)T Wv = kWvk2 :
(3.11)
Now we have a very simple processing structure (essentially a matrix multiplication) that can generate proper responses and predictions of the associated
rewards for any first-order function.
This structure is similar to the learning matrix or correlation matrix memory
described by Steinbuch and Piske (1963) and later by Anderson (1972, 1983) and
by Kohonen (1972, 1989). The correlation matrix memory is a kind of linear associative memory that is trained with a generalization of Hebbian learning (Hebb,
1949). An associative memory maps an input vector a to an output vector b, and
the correlation matrix memory stores this mapping as a sum of outer products:
M = ∑ baT :
(3.12)
The stored patterns are then retrieved as
b = Ma
(3.13)
3.3 Linear models
49
which is equal to equation 3.10. The main difference is that in the method described here, the correlation strength is retrieved and used as a prediction of the
reward. Kohonen (1972) has investigated the selectivity and tolerance with respect
to destroyed connections in the correlation matrix memories.
The training of the matrix W is a very simple algorithm. For a certain experience (v; q; r), the prediction p should, in the optimal case, equal r. This means
that the aim is to minimize the error in equation 3.8. The desired weight matrix
W0 would yield a prediction
p0 = r = hW0 j qvT i:
(3.14)
Since this is a linear problem, it could be tempting to solve it analytically. This
could be done recursively using the recursive least squares (RLS) method (Ljung,
1987). The problem is that RLS involves the estimation and inversion of a p p
matrix where p = dim(q)dim(v). Since the dimensionalities of q and v are high
in general due to the channel representation, RLS is not a very useful tool in this
case. Instead, we use stochastic gradient search (see section 2.3.1 on page 10) to
find W0 . From equations 3.7 and 3.8 we get the error
ε = jr ,hW j qvT ij2
(3.15)
and the gradient is
∂ε
∂W
=
,2(r , p)qvT
:
(3.16)
To minimize the error, W should be changed a certain amount a in the direction
qvT , i.e.
W0 = W + aqvT :
(3.17)
r = p + akqk2 kvk2
(3.18)
Equation 3.14 now gives that
(see proof B.2.3 on page 156) which gives
a=
r, p
kqk2 kvk2 :
(3.19)
To perform stochastic gradient search (equation 2.6 on page 11), we change
the parameter vector a small step in the negative gradient direction for each iteration. The update rule therefore becomes
W(t + 1) = W(t ) + ∆W(t );
(3.20)
50
Information representation
where
∆W = α
r, p
T
kqk2 kvk2 qv ;
(3.21)
where α is the update factor (0 < α 1) (see section 2.3.2 on page 11). If the
channel representation is chosen so that the norm of the channel vectors is constant
and equal to one, this equation is simplified to
∆W = α(r , p)qvT :
(3.22)
Here, the difference between this method and the correlation matrix memory becomes clearer. The learning rule in equation 3.12 corresponds to that in
equation 3.22 with α(r , p) = 1. The prediction matrix W in equation 3.22 will
converge when r = p, while the correlation matrix M in equation 3.12 would grow
for each iteration unless a normalization procedure is used.
Here, we can see how reinforcement learning and supervised learning can be
combined, as mentioned in section 2.6. By setting r = p + 1 and α = 1 we get the
update rule for the correlation matrix memory in equation 3.12, and with r = 1
we get a correlation matrix memory with a converging matrix. This means that
if the correct response is known, it can be learned using supervised learning by
forcing the output to the correct response and setting the parameters α = 1 and
r = 1 or r = p + 1. When the correct response is not known, the system is let to
produce the response and the reinforcement learning algorithm described above
can be used.
Relation to Q-learning
The description above of the learning algorithm assumed a reinforcement signal as a feedback to the system for each single decision (i.e. stimulus-response
pair). This is, however, not necessary. Instead of learning the instantaneous reward function r(x; y), the system can be trained to learn the Q-function Q(x; y)
(equation 2.13 on page 19), which can be written as
Q(x(t ); y(t )) = r(x(t ); y(t )) + γQ(x(t + 1); y(t + 1));
(3.23)
where γ is a prediction decay factor (0 < γ 1) that makes the predicted reinforcement decay as the distance from the actual rewarded state increases. Now
the right-hand side of equation 3.23 can be used instead of r in equation 3.22 as
the desired prediction. This gives
∆W = α(r(t ) + γp(t + 1) , p(t ))qvT :
(3.24)
3.4 Local linear models
51
This means that the system can handle dynamic problems with infrequent
reinforcement signals by maximizing the long-term reward function.
In one sense, this system is better suited for the use of TD-methods than the
systems mentioned in section 2.4.1 on page 17, since they have to use separate
subsystems to calculate the predicted reinforcement. With the algorithm suggested here, this prediction is calculated by the same system as the response.
3.4 Local linear models
Global linear models (e.g. the prediction matrix memory) can of course not be
used for all problems. The number of dimensions required for a pure channel
representation would in general be far too high. But a global non-linear model
(e.g. a neural network) is in general not a solution. The number of parameters in
a global non-linear model would be far too high to be possible to estimate with a
low variance using a reasonable number of samples. The rescue in this situation
is that we generally do not need a global model at all.
Consider a system with a visual input consisting only of a binary4 image with
8 8 pixels (picture elements), which is indeed a limited visual sensor. There are
264 > 1019 possible different binary 8 8 images. If they were displayed with a
frame rate of 50 frames per second, it would take about 10 billion years to view
them all, a period of time that is about the same as the age of the universe!
It is quite obvious that most of the possible events in a high-dimensional space
will never occur during the lifetime of a system. In fact, only a very small fraction
of the signal space will ever be visited by the signal. Furthermore, the environment that causes the input signals is limited by the dynamic of the outside world
and this dynamic put restrictions on how the input signal can move. This means
that the high-dimensional input signal will move on a low-dimensional subspace
(Landelius, 1997) and we do not have to search for a global model for the whole
signal space (at least if a proper representation is used).
The low dimensionality can intuitively be understood if we consider a signal
consisting of N frequency components. Such a signal can span at most a 2Ndimensional space since each frequency component defines an ellipse and hence
spans at most a two-dimensional plane (Johansson, 1997) (see proof B.2.4 on
page 156). In the case of images, this is expressed in the assumption of local
one-dimensionality (Granlund, 1978; Granlund and Knutsson, 1995):
“The content within a window, measured at a sufficiently small bandwidth, will as a rule have a single dominant component.”
4 Binary,
white.
in this case, means that each pixel can only have two different values, e.g. black or
52
Information representation
By this reasoning, it is sufficient to have a model or a set of models that covers the manifold where the signal exists (Granlund and Knutsson, 1990). If the
signal manifold is continuous in space and time (which is reasonable due to the
dynamic of the outside world), the low-dimensional manifold could locally be
approximated with a linear subspace (Bregler and Omohundro, 1994; Landelius,
1997).
Since we are dealing with learning systems, the local models should be adaptive. In this context, low-dimensional linear local models have several advantages.
First of all, the number of parameters in a low-dimensional linear model is low,
which reduces the number of samples needed for estimating the model compared
to a global model. This is necessary since the locality constraint limits the number
of available samples. Moreover, the locality reduces the spatial credit assignment
problem (section 2.7.2, page 35) since the adaptation of one local model will in
general not have any major effects on the other models (Baker and Farell, 1992).
How the local linear models should be chosen, i.e. according to what criteria
the models’ adaptation should be optimized, depends of course on the task. A
method for estimating local linear models for four different criteria is presented
in chapter 4.
3.5 Adaptive model distribution
In the previous section was argued that the signal distribution in a learning system
with high-dimensional input should be modelled with local adaptive models. This
raises the question of how to distribute these local models. The simplest way is,
of course, to divide the signal space into a number of regions (e.g. N-dimensional
boxes) and put an adaptive model in each region. Such an approach is, however, not very efficient since, as have been discussed above, most of the space
will be empty and, hence, most models will never be used. Moreover, with such
an approach, parts of the signal that could be modelled using one single model
would make use of several models due to the pre-defined subdivision. This would
cause each of these models to be estimated using a smaller number of samples
than would be the case if a single model was used and hence this would cause
an unnecessary uncertainty in the parameter estimation. Finally, the pre-defined
subdivision cannot be guaranteed to be fine enough in areas where the signal has
a complicated behaviour.
An obvious solution to this problem is to make the model distribution adaptive. First of all, such an approach would only put models where the signal really
exists. Furthermore, an adaptive model distribution makes it possible to distribute
models sparsely where the signal has a smooth behaviour and more densely where
it has not.
3.6 Experiments
53
An example of adaptive distribution of local linear models is given by Ritter
et al. (1989, 1992) who use a SOMF (Kohonen, 1982) (see section 2.5.2, page 27)
to distribute local linear models (Jacobian matrices) in a robot positioning task.
Other methods are discussed by Landelius (1997) who sugests linear or quadratic
models and Gaussian applicability funcitons organized in a tree structure (see also
Landelius et al., 1996). The applicability functions define the regions where the
local models are valid. In the system by Ritter et al., the applicability functions
are defined by a winner-take-all rule for the units in the SOFM (page 27).
Just as in the case of estimating the model parameters, the adaptive model
distribution is task dependent. If, for example, the goal of the system is to achieve
maximum reward, the models should be positioned where they are as useful as
possible for getting that reward and if the goal is maximum information transmission, the models should be positioned according to this goal. Hence, no general
rule can be given for how to adaptively distribute local models. One can only state
that the goal must be to optimize the same criteria as the local models are trying to
optimize together. This implies that the choice of models and the distribution of
them are dependent on each other. The simpler a model is, i.e. the less parameters
it has, the smaller the region will be where it is valid and, hence, the larger the
number of models required. This does not mean, however, that a small number of
more global complex models is as good as a large number of simpler and more
local models, even if the total number of parmeters is the same. As mentioned
above (section 3.4), the locality in the latter approach reduces the spatial credit
assignment problem and, hence, facilitates efficient learning.
3.6 Experiments
This chapter ends with two simple examples of reinforcement learning with different representations. The first one uses the channel representation described in
section 3.1 and the prediction matrix memory from section 3.3.1 for learning the
Q-function. The second example is a TD-method that uses local adaptive linear models both to represent the input-ouput function and to approximate the V function. This algorithm was presented at the ICANN’93 in Amsterdam (Borga,
1993).
The experiment is made up of a system that plays “badminton” with itself.
For simplicity, the problem is one-dimensional. The position of the shuttlecock is
represented by a variable x. The system can change the value of x by adding the
output value y to x. A small noise is also added to punish playing on the margin.
The reinforcement signal to the system is zero except upon failure when r = ,1.
Failure is the case when x does not change sign (i.e. the shuttlecock does not pass
the net), or when jxj > 0:5 (i.e. the shuttlecock ends up outside the court).
54
Information representation
3.6.1 Q-learning with the prediction matrix memory
The position x is represented by 25 cos2 -channels in the interval ,0:6 < x < 0:6
and the ouput y is represented by 45 cos2 -channels in the interval ,1:1 < x < 1:1.
The channels have the shape defined in equation 3.2, illustrated in figure 3.1 on
page 41. An offset value of one was added to the reinforcement signal, i.e. r = 1
except upon failure when r = 0, since the prediction matrix memory must contain
positive values.
The prediction matrix memory was trained to learn the Q-function as defined
in equation 3.23 with the discount factor γ = 0:9. The matrix was updated according to the update rule in equations 3.20 and 3.24. α was set to a constant value of
0:05.
The output channel vector q was generated according to equation 3.10. This
vector was then decoded into a scalar. As mentioned in section 2.4.1, stochastic
search methods are often used in reinforcement learning. Here, this is accomplished by adding Gaussian noise to the output. The variance σ was calculated
as
σ = maxf0; 0:1 (10 , p)g
(3.25)
which gives a high noise level when the system predicts a low Q-value and a low
noise level if the prediction is high. The value 10 is determined by the maximum
i
value of the Q-function for γ = 0:9 since ∑∞
i=0 γ = 10. The max operation is to
ensure that the variance does not get negative if the stochastic estimation occasionally gives predictions higher than 10.
A typical run is illustrated to the left in figure 3.7. The graph shows the accumulated reward in a sliding window of 100 iterations. Note that the original
reinforcement signal (i.e. -1 for failure) was used. To the right, the contents of the
memory after convergence are illustrated. We see that the highest Q-value is predicted for the positions 0:2 and the corresponding outputs 0:4 approximately,
which is a reasonable solution.
3.6.2 TD-learning with local linear models
In this experiment, both the predictions p of future accumulated reward and the
actions y are linear functions of the input variable x. There is one pair of reinforcement association vectors vi and one pair of action association vectors wi ; i = f1; 2g.
For each model i, the predicted reinforcement is calculated as
pi = vi1 x + vi2
(3.26)
yi = N (µiy ; σy );
(3.27)
and the output is calculated as
3.6 Experiments
55
0
1
−20
−40
output
reward
0.5
0
−60
−0.5
−80
−100
0
1000
2000
3000
iterations
4000
−1
−0.6
5000
−0.4
−0.2
0
input
0.2
0.4
0.6
Figure 3.7: Left: A typical run of the prediction matrix memory. The
graph shows the accumulated reward in a sliding window of 100 iterations. Right: The prediction matrix memory after convergence. Black is
zero and white is the maximum value.
where
µiy = wi1 x + wi2:
(3.28)
The system chooses the model c such that
pc = maxfmi g;
(3.29)
mi = N ( pi ; σ p )
(3.30)
i
where
and generates the corresponding action yc .
The internal reinforcement signal at time t + 1 is calculated as
r̂[t + 1] = r[t + 1] + γpmax [t ; t + 1] , pc [t ; t ]:
(3.31)
This is in principle the same TD-method as the one used by Sutton (1984), except
that here there are two predictions at each time, one for each model. pmax [t ; t +
1] is the maximum predicted reinforcement calculated using the reinforcement
association vector from time t and the input from time t + 1. If the system fails,
i.e. r = ,1, then pmax [t ; t + 1] is set to zero. pc [t ; t ] is the prediction of the selected
model.
Learning is accomplished by changing the weights in the reinforcement association vectors and the action association vectors. Only the vectors associated
with the chosen model are altered.
56
Information representation
The association vectors are updated according to the following rule:
wc [t + 1] = wc [t ] + αr̂(yc , µcy )x
(3.32)
vc [t + 1] = vc [t ] + βr̂x ;
(3.33)
and
where c denotes the model choice, α and β are positive learning rate constants and
x=
x
1
:
In this experiment, noise is added to the output on two levels. First in the
selection of model and then in the generation of output signal. The noise levels
are controlled by σ p and σy respectively, as shown in equations 3.27 and 3.30.
The variance parameters are calculated as
σ p = maxf0;
, 0 1 maxf pi gg
:
(3.34)
and
σy = maxf0;
, 0 1 pc g
:
:
(3.35)
The first “max” in the two equations is to make sure that the variances do not become negative. The negative signs are there because of the (relevant) predictions
being negative. In this way, the higher the prediction of reinforcement is, the more
precision there will be in the output.
The learning behaviour is illustrated to the left in figure 3.8. To the right, the
total input-output function is plotted. For each input value, the model with the
highest predicted reward has been used. The discrete step close to zero marks the
point in the input space where the system switches between the two models. The
optimal position for this point is of course zero.
One problem that can occur with this algorithm, and other similar algorithms,
is when both models prefer the same part of the input space. This means that
the two reinforcement prediction functions predict the same reinforcement for the
same inputs and, as a result, both models generate the same actions. This problem
can of course be solved if the teacher who generates the external reinforcement
signal knows approximately where the breakpoint should be and which model
should act on which side. The teacher could then punish the system for selecting the wrong model by giving negative reinforcement. In general, however, the
teacher does not know how to divide the problem. In that case, the teacher must
try to use a pedagogical reward as discussed in section 2.4.2 on page 20. The
teacher could for instance give less reward if the models try to cover the same part
of the input space and a higher reward when the models tend to cover different
parts of the space.
3.6 Experiments
57
0
1
−20
−40
ouput
reward
0.5
0
−60
−0.5
−80
−1
−100
0
1000
2000
3000
iterations
4000
5000
−0.6
−0.3
0
input
0.3
0.6
Figure 3.8: Left: A typical run of the TD-learning system with two local
models. The graph shows the accumulated reward in a sliding window of
100 iterations. Right: The total input-output function after convergence.
3.6.3 Discussion
If we compare the contents of the prediction matrix memory to the right in figure
3.7 and the combined function of the linear models in the TD-system plotted to
the right in figure 3.8, we see that the two systems implement approximately the
same function.
If we compare the learning behaviour (plotted to the left in figure 3.7 and
3.8), the prediction matrix memory appears to learn faster than the TD-method. It
should however be noted that each iteration of the prediction matrix memory has
a computational complexity of order O (Q V ), where Q and V are the number
of channels used for representing the input and output signals respectively. In this
experiment, we used Q = 25 and V = 45. A larger number of channels enhances
the performance when the system has converged but increases the required number of iterations until convergence as well as the computational complexity for
each iteration. The computational complexity of the second method is of order
O (N (X + 1) Y ) per iteration, where N is the number of local models (in this
case 2), X is the dimensionality of the input signal (in this case 1) and Y is the
dimensionality of the output signal (in this case 1).
The algorithms have not been optimized with respect to convergence time.
The convergence speed depends on the setting of the learning rate constants α
and β and the modulation of the variance parameters σ p and σy . These parameters have only been tuned to constant values that work reasonably well. Better
results can be expected if the learning rates are made adaptive, as discussed in
section 2.3.2 on page 11.
58
Information representation
Chapter 4
Low-dimensional linear models
As we have seen in the previous chapter (in section 3.4), local low-dimensional
linear models is a good way of representing high-dimensional data in a learning
system. The linear models can be seen as basis vectors spanning a (local) subspace of the signal space. The signal can then be (approximately) described in
this new basis in terms of projections onto the new basis vectors. For signals with
high dimensionality, an iterative algorithm for finding this basis must not exhibit
a memory requirement nor a computational cost significantly exceeding O (d ) per
iteration, where d is the dimensionality of the signal. Techniques involving matrix
multiplications (having memory requirements of order O (d 2 ) and computational
costs of order O (d 3 )), quickly become infeasible when signal space dimensionality increases.
The purpose of local models is dimensionality reduction which means throwing away information that is not needed. Hence, the criterion for an appropriate
local model is dependent on the application. One criterion is to preserve as much
variance as possible given a certain dimensionality of the model. This is done by
projecting the data on the subspace of maximum data variation, i.e. the subspace
spanned by the largest principal components. This is known as principal component analysis (PCA). There is a number of applications in signal processing where
principal components play an important role, for example image coding.
In applications where relations between two sets of data (e.g. process input
and output) are considered, PCA or other self-organizing algorithms for representing the two sets of data separately are not very useful since such methods
cannot separate useful information from noise. Consider, for example, two highdimensional signals that are described by their most significant principal components. There is no reason to believe that these descriptions of the signals are
related in any way. In other words, the signal in the direction of maximum variance in one space may be totally independent of the signal in the direction of
60
Low-dimensional linear models
maximum variance in another space, even if there is a strong relation between the
signals. The reason for this is that there is no way of finding the relation between
two sets of data just by looking at one of the sets. Instead, the two signal spaces
must be considered together. One method for doing this is finding the subspaces
in the input and the output spaces for which the data covariation is maximized.
These subspaces turn out to be the ones accompanying the largest singular values of the between-sets covariance matrix (Landelius et al., 1995). A singular
value decomposition (SVD) of the between-sets covariance matrix corresponds to
partial least squares (PLS) (Wold et al., 1984; Höskuldsson, 1988).
In general, however, the input to a system comes from a set of different sensors
and it is evident that the range (or variance) of the signal values from a given
sensor is unrelated to the importance of the received information. The same line
of reasoning holds for the output which may consist of signals to a set of different
effectuators. In these cases, the covariances between signals are not relevant.
There may, for example, be one pair of directions in the two spaces that has a high
covariance due to high signal magnitude but has a high noise level, while another
pair of directions has an almost perfect correlation but a small signal magnitude
and therefore low covariance. Here, correlation between input and output signals
is a more appropriate target for analysis since this measure of signal relations is
invariant to the signal magnitudes. This approach leads to a canonical correlation
analysis (CCA) (Hotelling, 1936) of the two sets of signals.
Finally, when the goal is to predict a signal as well as possible in a least square
error sense, the basis must be chosen so that this error measure is minimized.
This corresponds to a low-rank approximation of multivariate linear regression
(MLR). This is also known as reduced rank regression (Izenman, 1975) or as
redundancy analysis (van den Wollenberg, 1977).
In general, these four different criteria for selecting basis vectors lead to four
different solutions. But, as we will see, the problems are related to each other
and can be formulated in very similar ways. An important problem which is
directly related to the situations discussed above is the generalized eigenproblem or two-matrix eigenproblem (Bock, 1975; Golub and Loan, 1989; Stewart,
1976). In the next section, the generalized eigenproblem is described in some detail and its relation to an energy function called the Rayleigh quotient is shown. It
is shown that the four important methods discussed above (principal component
analysis (PCA), partial least squares (PLS), canonical correlation analysis (CCA)
and multivariate linear regression (MLR)) emerge as solutions to special cases of
the generalized eigenproblem.
In section 4.7, an iterative O (d ) algorithm that solves the generalized eigenproblem by a gradient search on the Rayleigh quotient is presented. The solutions
are found in a successive order beginning with the largest eigenvalue and the cor-
4.1 The generalized eigenproblem
61
responding eigenvector. It is shown how to apply this algorithm in order to obtain
the required solutions in the special cases of PCA, PLS, CCA and MLR.
Throughout this chapter, the variables are assumed to be real valued and have
zero mean so that the covariance matrices can be defined as Cxx = E [xxT ]. The
zero mean does not impose any limitations on the methods discussed since the
mean values can easily be estimated and stored by each local model.
The essence of this chapter has been submitted for publication (Borga et al.,
1997b).
4.1 The generalized eigenproblem
When dealing with many scientific and engineering problems, some version of the
generalized eigenproblem sometimes needs to be solved along the way:
Aê = λBê or B,1 Aê = λê:
(4.1)
(In the right-hand equation, B is supposed to be non-singular.) In mechanics,
the eigenvalues often correspond to modes of vibration. Here, however, the case
where the matrices A and B consist of components which are expectation values
from stochastic processes is considered. Furthermore, both matrices are symmetric and, in addition, B is positive definite.
The generalized eigenproblem is closely related to the problem of finding the
extremum points (i.e. the points of zero derivatives) of a ratio of quadratic forms:
r=
wT Aw
;
wT Bw
(4.2)
where both A and B are symmetric and B is positive definite. This ratio is known
as the Rayleigh quotient and its critical points correspond to the eigensystem of
the generalized eigenproblem. To see this, consider the gradient of r:
∂r
∂w
=
2
wT Bw
(Aw
, rBw) = α(Aŵ , rBŵ)
;
(4.3)
where α = α(w) is a positive scalar. Setting the gradient to 0 gives
Aŵ = rBŵ or
B,1Aŵ = rŵ
(4.4)
which is recognized as the generalized eigenproblem (equation 4.1). The solutions ri and ŵi are the eigenvalues and eigenvectors respectively of the matrix
B,1 A. This means that the extremum points of the Rayleigh quotient r(w) are
solutions to the corresponding generalized eigenproblem. The eigenvalues are
the extremum values of the quotient and the eigenvectors are the corresponding
62
Low-dimensional linear models
parameter vectors w of the quotient. A special case of the Rayleigh quotient is
Fisher’s linear discriminant function (Fisher, 1936) used in classification. In this
case, A is the between-class scatter matrix and B is the within-class scatter matrix
(see for example Duda and Hart, 1973).
The Rayleigh quotient
1
The gradient ( A w − r B w)
e
2
1
e
→
2
r( w)
w
0
←
0
→
w2
2
e
→
1
−1
e
1
−1
−1
0
w1
1
−1
0
w1
1
Figure 4.1: Left: The Rayleigh quotient r(w) between two matrices A
and B. The curve is plotted as rŵ. The eigenvectors of B,1A are marked
as reference. The corresponding eigenvalues are marked as the radii of the
two circles. Note that the quotient is invariant to the norm of w. Right:
The gradient of r. The arrows indicate the direction of the gradient and
the radii of the blobs correspond to the magnitude of the gradient.
As an illustration, the Rayleigh quotient is plotted to the left in figure 4.1 for
two matrices A and B:
A=
1
0
0 0:25
and
B=
2 1 1 1
:
(4.5)
The quotient is plotted as the radius in different directions ŵ. Note that the quotient is invariant to the norm of w. The two eigenvalues are shown as circles with
their radii corresponding to the eigenvalues. The figure shows that the eigenvectors e1 and e2 of the generalized eigenproblem coincide with the maximum and
minimum values of the Rayleigh quotient. To the right in the same figure, the
gradient of the Rayleigh quotient is illustrated as a function of the direction of w.
Note that the gradient is orthogonal to w (see equation 4.3). This means that a
small change of w in the direction of the gradient can be seen as a rotation of w.
4.1 The generalized eigenproblem
63
The arrows indicate the direction of this orientation and the radii of the blobs correspond to the magnitude of the gradient. The figure shows that the directions of
zero gradient coincide with the eigenvectors and that the gradient points towards
the eigenvector corresponding to the largest eigenvalue.
If the eigenvalues ri are distinct1 (i.e. ri 6= r j for i 6= j), the different eigenvectors are orthogonal in the metrics A and B, i.e.
(
ŵTi Bŵ j =
0
for
βi > 0 for
(
i 6= j
i= j
ŵTi Aŵ j =
and
0
ri βi
for
for
i 6= j
i= j
(4.6)
(see proof B.3.1 on page 157). This means that the wi s are linearly independent
(see proof B.3.2 on page 158). Since an n-dimensional space gives n eigenvectors
which are linearly independent, fw1 ; : : : ; wn g constitutes a base and any w can
be expressed as a linear combination of the eigenvectors. Now, it can be proved
(see proof B.3.3 on page 158) that the function r is bounded by the largest and the
smallest eigenvalue, i.e.
rn r r1
(4.7)
which means that there exists a global maximum and that this maximum is r1 .
To investigate if there are any other local maxima, we look at the second
derivative, or the Hessian H, of r for the solutions to the eigenproblem,
Hi =
∂2 r
∂w2
w
=ŵi
=
2
(A
ŵTi Bŵi
, riB)
(4.8)
(see proof B.3.4 on page 159). The Hessian Hi have positive eigenvalues for i > 1,
i.e. there exist vectors w such that
w T Hi w > 0
8i
>
1
(4.9)
(see proof B.3.5 on page 159). This means that for all solutions to the eigenproblem except for the largest root, there exists a direction in which r increases. In
other words, all extremum points of the function r are saddle points except for the
global minimum and maximum points. Since the two-dimensional example in figure 4.1 only has two eigenvalues, they correspond to the maximum and minimum
values of r.
In the following sections is shown that finding the directions of maximum
variance, maximum covariance, maximum correlation and minimum square error
can be seen as special cases of the generalized eigenproblem.
1 The
eigenvalues will be distinct in all practical applications since all real signals contain noise.
64
Low-dimensional linear models
4.2 Principal component analysis
Consider a set of random vectors x (signals) with a covariance matrix defined by
Cxx = E [xxT ]:
(4.10)
Suppose the goal is to find the direction of maximum variation in the signal distribution. The direction of maximum variation means the direction ŵ such that the
linear combination x = xT ŵ possesses maximum variance. Hence, finding this
direction is equivalent to finding the maximum of
ρ = E [xx] = E [ŵT xxT ŵ] = ŵT E [xxT ]ŵ =
wT Cxx w
:
wT w
(4.11)
This is a special case of the Rayleigh quotient in equation 4.2 on page 61 with
A = Cxx
and
B = I:
(4.12)
Since the covariance matrix is symmetric, it is possible to decompose it into
its eigenvalues and orthogonal eigenvectors as
Cxx = E [xxT ] = ∑ λi êi êTi ;
(4.13)
where λi and êi are the eigenvalues and the orthogonal eigenvectors respectively.
Hence, the problem of maximizing the variance, ρ, can be seen as the problem of
finding the largest eigenvalue, λ1 , and its corresponding eigenvector since
λ1 = êT1 Cxx ê1 = max
wT Cxx w
wT w
= max ρ:
(4.14)
It is also worth noting that it is possible to find the direction and magnitude of
maximum data variation for the inverse of the covariance matrix. In this case, we
simply identify the matrices in eq. 4.2 on page 61 as A = I and B = Cxx .
The eigenvectors ei are also known as the principal components of the distribution of x. Principal component analysis (PCA) is an old tool in multivariate
data analysis. It was used already in 1901 (Pearson, 1901). The projection of data
onto the principal components is sometimes called the Hotelling transform after
Hotelling (1933) or Karhunen-Loéve transform (KLT) after Karhunen (1947) and
Loéve (1963). This transformation is as an orthogonal transformation that diagonalizes the covariance matrix.
PCA gives a data dependent set of basis vectors that is optimal in a statistical
mean square error sense. This was shown in equation 2.45 on page 33 for one
basis vector and the result can easily be generalized to a set of basis vectors by
the following reasoning: Given one basis vector, the best we can do is to choose
4.2 Principal component analysis
65
the maximal eigenvector of the covariance matrix. This basis vector describes the
signal completely in that direction. Hence, there is nothing more in that direction
to describe and the next basis vector should be chosen orthogonal to the first. Now
the same problem is faced again, but in a smaller space where the first principal
component of the signal is removed. So the best choice of the second basis vector
is a unit vector in the direction of the first principal component in this subspace
and that direction corresponds to the second eigenvector2 of the covariance matrix.
This process can be repeated for all basis vectors.
The KLT can be used for image coding (Torres and Kunt, 1996) since it is the
optimal transform coding in a mean square error sense. This is, however, not very
common. One reason for this is that the KLT is computationally more expensive
than the discrete cosine transform (DCT). Another reason is the need for transmission of the data dependent basis vectors. Besides that, in general the mean square
error is not a very good error measure for images since two images with a large
mean square distance can look very similar to a human observer. Another use for
PCA in multivariate statistical analysis is to find linear combinations of variables
where the variance is high. Here, it should be noted that PCA is dependent on
the units used for measuring. If the unit of one variable is changed, for example
from metres to feet, the orientations of the principal components may change. For
further details on PCA, see for example the overview by Jolliffe (1986).
When dealing with learning systems, it could be tempting to use PCA to find
local linear models to reduce the dimensionality of a high-dimensional input (and
output) space. The problem with this approach is that the best representation of
the input signal is in general not the least mean square error representation of that
signal. There may be components in the input signal that have high variances
that are totally irrelevant when it comes to generating responses and there may be
components with small variances that are very important. In other words, PCA
is not a good tool when analysing the relations between two sets of variables.
The need for simultaneous analysis of the input and output signals in learning
systems was indicated in the quotation from Brooks (1986) on page 8 and also in
the wheel-chair experiment (Held and Bossom, 1961; Mikaelian and Held, 1964)
mentioned on the same page.
2 The
somewhat informal notation “second eigenvector” refers to the eigenvector corresponding
to the second largest eigenvalue.
66
Low-dimensional linear models
4.3 Partial least squares
Now, consider two sets of random vectors x and y with the between-sets covariance matrix defined by
Cxy = E [xyT ]:
(4.15)
Suppose, this time, that the goal is to find the two directions of maximal data
covariation, by which is meant the directions ŵx and ŵy such that the linear combinations x = xT ŵx and y = yT ŵy give maximum covariance. This means that the
following function should be maximized:
ρ = E [xy] = E [ŵTx xyT ŵy ] = ŵTx E [xyT ]ŵy =
qwxT CxywTy
T
:
(4.16)
wx wx wy wy
Note that, for each ρ, a corresponding value ,ρ is obtained by rotating wx or wy
180 . For this reason, the maximum magnitude of ρ is obtained by finding the
largest positive value.
This function cannot be written as a Rayleigh quotient. However, the critical
points of this function coincide with the critical points of a Rayleigh quotient with
proper choices of A and B. To see this, we calculate the derivatives of this function
with respect to the vectors wx and wy (see proof B.3.6 on page 160):
( ∂ρ
1
= kw k (Cxy ŵy
x
1
= kw k (Cyx ŵx
y
∂wx
∂ρ
∂wy
, ρŵx)
, ρŵy)
(4.17)
:
Setting these expressions to zero and solving for wx and wy results in
(
= ρ2 ŵx
= ρ2 ŵy :
Cxy Cyx ŵx
Cyx Cxy ŵy
(4.18)
This is exactly the same result as that given by the extremum points of r in equation 4.2 on page 61 if the matrices A and B and the vector w are chosen according
to:
A=
0
Cyx
Cxy
0
, B=
I 0
0 I
w=
and
µ ŵ x
x
µy ŵy
:
(4.19)
This is easily verified by insertion of the expressions above into equation 4.4,
which results in
8
<Cxyŵy
:Cyxŵx
µ
= r µx ŵx
y
µy
= r µ ŵy
x
:
(4.20)
4.4 Canonical correlation analysis
67
Solving for wx and wy gives equation 4.18 with r2 = ρ2 . Hence, the problem of
finding the direction and magnitude of the largest data covariation can be seen as
maximizing a special case of the Rayleigh quotient (equation 4.2 on page 61) with
the appropriate choice of matrices.
The between-sets covariance matrix can be expanded by means of singular
value decomposition (SVD) where the two sets of vectors fêxi g and fêyi g are
mutually orthogonal:
Cxy = ∑ λi êxi êTyi
(4.21)
where the positive numbers, λi , are referred to as the singular values. Since the
basis vectors are orthogonal, the problem of maximizing the quotient in equation
4.16 is equivalent to finding the largest singular value:
λ1 = êTx1 Cxy êy1 = max
qwxT CxywTy
T
= max ρ:
(4.22)
wx wx wy wy
The SVD of a between-sets covariance matrix is directly related to the method
of partial least squares (PLS). PLS was developed in econometrics in the 1960s by
Herman Wold. It is most commonly used for regression in the field of chemometrics (Wold et al., 1984). For an overview, see for example Geladi and Kowalski
(1986) and Höskuldsson (1988). In PLS regression, the principal vectors corresponding to the largest principal values are used as a new, lower dimensional,
basis for the signal. A regression of y onto x is then performed in this new basis.
As in the case of PCA, the scaling of the variables affects the solutions of
the PLS. The reason for this is the maximum covariance criteria; the covariance
between two variables is proportional to the variances of the variables. Therefore,
a scaling of the x variables to unit variance is sometimes suggested (Wold et al.,
1984). Such a solution can of course also amplify the noise which can cause
problems in the parameter estimation3 .
4.4 Canonical correlation analysis
Again, consider two random variables x and y with zero mean and stemming from
a multi-normal distribution with the total covariance matrix
C=
3 An
C
xx
Cyx
Cxy
Cyy
" T #
=E
x
y
x
y
:
(4.23)
example of such a problem has been reported from the paper industry (personal communication). In that case, the normalized data had to be filtered to reduce the amplified noise! The
filtering will likely introduce new artifacts.
68
Low-dimensional linear models
Now, suppose that the goal is to find the directions of maximum data correlation.
Consider the linear combinations x = xT ŵx and y = yT ŵy of the two variables
respectively. This means that the function to be maximized is
ρ=
=
pEE[x[2xy]E] [y2] = q
E [ŵTx xyT ŵy ]
E [ŵTx xxT ŵx ]E [ŵTy yyT ŵy ]
q
(4.24)
wTx Cxy wy
:
wTx Cxx wx wTy Cyy wy
Also in this case, since ρ changes sign if wx or wy is rotated 180 , it is sufficient
to find the positive values.
Like equation 4.16, this function cannot be written as a Rayleigh quotient. But
also in this case, it can be shown that the critical points of this function coincide
with the critical points of a Rayleigh quotient with proper choices of A and B. The
partial derivatives of ρ with respect to wx and wy are (see proof B.3.7 on page 160)
8 ∂ρ
>
< ∂w
>
: ∂w∂ρ
x
a
= kw k
x
y
= kwa k
y
ŵ C ŵ
Cxy ŵy , ŵ C ŵ Cxx ŵx
ŵ C ŵ
T
x
T
x
Cyx ŵx ,
xy
y
xx
x
T
y yx x
ŵTy Cyy ŵy
Cyy ŵy
(4.25)
;
where a is a positive scalar. Setting the derivatives to zero gives the equation
system
8
<Cxyŵy = ρλxCxxŵx
:Cyxŵx = ρλyCyyŵy
where
1
λx = λ,
y =
s
(4.26)
;
ŵTy Cyy ŵy
:
ŵTx Cxx ŵx
(4.27)
λx is the ratio between the standard deviation of y and the standard deviation of
x and vice versa. The λs can be interpreted as scaling factors between the linear
combinations. Rewriting equation system 4.26 gives
(
,1 Cxy C,1Cyx ŵx = ρ2 ŵx
Cxx
yy
,
1
1
2
Cyy Cyx C,
xx Cxy ŵy = ρ ŵy :
(4.28)
,1Cxy C,1 Cyx
Hence, ŵx and ŵy are found as the eigenvectors of the matrices Cxx
yy
,1Cyx C,1 Cxy respectively. The corresponding eigenvalues ρ2 are the squared
and Cyy
xx
4.4 Canonical correlation analysis
69
canonical correlations. The eigenvectors corresponding to the largest eigenvalue
ρ21 are the vectors ŵx1 and ŵy1 that maximize the correlation between the canonical variates x1 = xT ŵx1 and y1 = yT ŵy1 .
Now, if
A=
0
Cyx
Cxy
0
;
B=
C
xx
0
0
Cyy
w=
and
w µ ŵ x
wy
=
x
x
µy ŵy
(4.29)
equation 4.4 can be written as
8
<Cxyŵy = r µµ Cxxŵx
:Cyxŵx = r µµ Cyyŵy
x
y
y
x
which is recognized as equation 4.26 for ρλx
=
(4.30)
;
r µµxy and ρλy
=
µ
r µyx . Solving for
wx and wy in equation 4.30, gives equation 4.28 with r2 = ρ2 . This shows that the
equations for the canonical correlations are obtained as the result of maximizing
the Rayleigh quotient (equation 4.2 on page 61).
Canonical correlation analysis was developed by Hotelling (1936). Some of
the results presented here can also be found in (Borga, 1995; Knutsson et al.,
1995; Borga et al., 1997a). Although being a standard tool in statistical analysis
(see for example Anderson, 1984), where canonical correlation has been used for
example in economics, medical studies, meteorology and even in classification of
malt whisky (Lapointe and Legendre, 1994) and wine (Montanarella et al., 1995),
it is surprisingly unknown in the fields of learning and signal processing. Some
exceptions are Becker (1996), Kay (1992), Fieguth et al. (1995), Das and Sen
(1994) and Li et al. (1997).
An important property of canonical correlations is that they are invariant with
respect to affine transformations of x and y. An affine transformation is given by
a translation of the origin followed by a linear transformation. The translation
of the origin of x or y has no effect on ρ since it leaves the covariance matrix C
unaffected. Invariance with respect to scalings of x and y follows directly from
equation 4.24. For invariance with respect to other linear transformations see
proof B.3.8 on page 161. Hence, in contrast to PLS, there is no need for normalization of the variables in CCA.
Another important property is that the canonical correlations are uncorrelated
for different solutions, i.e.
8
>
<E [xx]
E [yy]
>
:E [xy]
T
T
= E [wT
xi xx wx j ] = wxi Cxx wx j = 0
T
T
= E [wT
yi yy wy j ] = wyi Cyy wy j = 0
T
T
= E [wT
xi xy wy j ] = wxi Cxy wy j = 0
according to equation 4.6.
for
i 6= j;
(4.31)
70
Low-dimensional linear models
4.4.1 Relation to mutual information and ICA
As mentioned in section 2.5.3, there is a relation between correlation and mutual
information (equation 2.44). Since information is additive for statistically independent variables (equation 2.33) and the canonical variates are uncorrelated, the
mutual information between x and y is the sum of mutual information between the
variates xi and yi if there are no higher order statistic dependencies than correlation
(second-order statistics). For Gaussian variables this means
I (x; y) =
1
1
log
2
∏i (1 , ρ2i )
=
1
log
2∑
i
1
(1
, ρ2i )
(4.32)
using equation 2.44 on page 32. This is also more formally shown in proof B.3.9
on page 162. Kay (1992)4 has shown that this relation plus a constant holds for
all elliptically symmetrical distributions of the form
c f ((z , z̄)T C,1(z , z̄)):
(4.33)
Minimizing mutual information between signal components is known as independent component analysis (ICA) (see for example Comon, 1994). If there are
no higher order statistic dependencies than correlation (e. g. if the variables are
jointly Gaussian5 ), the canonical correlates xi , x j , i 6= j are independent components since they are uncorrelated.
4.4.2 Relation to SNR
The correlation is strongly related to signal to noise ratio (SNR), which is a more
commonly used measure in signal processing. This relation is used later in this
thesis.
Consider a signal x and two noise signals η1 and η2 all having zero mean6
and all being uncorrelated with each other. Let S = E [x2 ] and Ni = E [η2i ] be
the energy of the signal and the noise signals respectively. Then the correlation
4 There
is a difference by a factor 0.5 between equation 4.32 and Kay’s equation, due to a typographical error.
5 The definition of ICA requires that at most one of the source components is Gaussian (Comon,
1994).
6 The assumption of zero mean is for convenience. A non-zero mean does not affect the SNR or
the correlation.
4.4 Canonical correlation analysis
71
between a(x + η1 ) and b(x + η2 ) is
pE [Ea2[(ax(+x +ηη)12)]bE(x[b+2(ηx2+)]η )2]
1
2
2
E x
= q,
,
E [x2 ] + E η2 E [x2 ] + E η2
ρ=
1
=
p(S + NS)(S + N )
1
(4.34)
2
:
2
Note that the amplification factors a and b do not affect the correlation or the SNR.
Equal noise energies
In the special case where the noise energies are equal, i.e.
N1 = N2 = N, equation 4.34 can be written as
ρ=
S
:
S+N
(4.35)
This means that the SNR can be written as
S
N
=
ρ
1,ρ
(4.36)
:
Here, it should be noted that the noise affects the signal twice, so this relation
between SNR and correlation is perhaps not so intuitive. This relation is illustrated
in figure 4.2 (top).
Correlation between a signal and the corrupted signal
Another special case is when N1 = 0 and N2 = N. Then, the correlation between
a signal and a noise-corrupted version of that signal is
ρ=
pS(SS+ N )
:
(4.37)
In this case, the relation between SNR and correlation is
S
N
=
ρ2
:
1 , ρ2
(4.38)
This relation between correlation and SNR is illustrated in figure 4.2 (bottom).
72
Low-dimensional linear models
50
SNR (dB)
25
0
−25
−50
0
0.2
0.4
0.6
Correlation
0.8
1
0.2
0.4
0.6
Correlation
0.8
1
50
SNR (dB)
25
0
−25
−50
0
Figure 4.2: Top: The relation between correlation and SNR for two signals each corrupted by uncorrelated noise. Both noise signals have the
same energy. Bottom: The relation between correlation and SNR. The
correlation is measured between a signal and a noise-corrupted version of
that signal.
4.5 Multivariate linear regression
73
4.5 Multivariate linear regression
Again, consider two random variables x and y with zero mean and stemming from
a multi-normal distribution with covariance as in equation 4.23. In this case, the
goal is to minimize the square error
ε2 = E [ky , βxT ŵx ŵy k2 ]
, 2βŵTx xyT ŵy + β2ŵTx xxT ŵx ]
T
T
2 T
= E [y y] , 2βŵx Cxy ŵy + β ŵx Cxx ŵx
T
= E [y y
(4.39)
;
i.e. a rank-one approximation of the MLR of y onto x based on minimum square
error. The problem is to find not only the regression coefficient β, but also the
optimal basis ŵx and ŵy . To get an expression for β, we calculate the derivative
∂ε2
∂β
=2
,βŵT C
x
, ŵTx Cxyŵy
xx ŵx
(4.40)
:
Setting the derivative equal to zero gives
β=
ŵTx Cxy ŵy
:
ŵTx Cxx ŵx
(4.41)
By inserting this expression into equation 4.39 we get
ε2 = E [yT y] ,
2
(ŵT
x Cxy ŵy )
:
ŵTx Cxx ŵx
(4.42)
Since ε2 cannot be negative and the left term is independent of the parameters,
we can minimize ε2 by maximizing the quotient to the right in equation 4.42, i.e.
maximizing the quotient
ρ=
pŵŵx TCCxyŵŵy
T
x
xx
=
x
q
wTx Cxy wy
:
(4.43)
wTx Cxx wx wTy wy
Note that if wx and wy minimize ε2 , the negation of one or both of these vectors
will give the same minimum. Hence, it is sufficient to maximize the positive root.
Like in the two previous cases, this function cannot be written as a Rayleigh
quotient, but its critical points coincide with the critical points of a Rayleigh quotient with proper choices of A and B. The partial derivatives of ρ with respect to
wx and wy are (see proof B.3.10 on page 163)
8 ∂ρ
< ∂w
: ∂w∂ρ
, βCxx ŵx)
Cyx ŵx , ρβ ŵy
x
a
= kw k (Cxy ŵy
x
y
a
= kw k
x
2
:
(4.44)
74
Low-dimensional linear models
Setting the derivatives to zero gives the equation system
8
<Cxyŵy = βCxxŵx
:Cyxŵx = ρβ ŵy
(4.45)
2
;
which gives
(
Now, if we let
A=
0
Cyx
Cxy
0
;
B=
,1 Cxy Cyx ŵx = ρ2 ŵx
Cxx
1
2
Cyx C,
xx Cxy ŵy = ρ ŵy :
C
0
I
xx
0
w=
and
(4.46)
w µ ŵ x
wy
=
x
x
µy ŵy
;
(4.47)
equation 4.4 can be written as
8
<Cxyŵy = r µµ Cxxŵx
:Cyxŵx = r µµ ŵy
x
y
y
x
(4.48)
;
which is recognized as equation 4.45 for β = r µµxy and
ρ2
β =
µ
r µyx . Solving equation
4.48 for wx and wy gives equation 4.46 with r2 = ρ2 . This shows that the minimum
square error in equation 4.39 is found as a result of maximizing the Rayleigh
quotient in equation 4.2 on page 61 for the proper choice of matrices A and B and
regression coefficient β.
So far, the first pair of eigenvectors wx1 and wy1 , i.e. a rank-one solution, has
been discussed. Intuitively, a rank N regression can be expected to be optimized
(in a mean square error sense) if the N first pairs of eigenvectors are used, i.e.
"
ε
2
=E
N
ky , ∑
i=1
k
βi ŵTxi xŵyi 2
#
(4.49)
is minimized if wxi and wyi are the solutions to equation 4.46 corresponding to the
N largest eigenvalues. To see that this really is the case, note that the eigenvec1
tors wyi in Y are orthogonal since Cyx C,
xx Cxy in equation 4.46 is symmetric. The
orthogonality of the wy s is explained by the Cartesian separability of the square
error; when the error in one direction is minimized, no more can be done in that
direction to reduce the error. This means that the minimization of ε2 in equation
4.49 can be seen as N separate problems that can be solved consecutively beginning with the first solution that minimizes equation 4.39. When the first solution
is found, the next solution can be searched for in the subspace orthogonal to wy1 .
4.6 Comparisons between PCA, PLS, CCA and MLR
PCA
0
PLS
Cxx
I
Cyx
Cyx
Cxy
0
Cyx
Cxy
0
0
MLR
B
Cxy
0
0
CCA
A
75
I 0
C
0 I
xx
0
C
xx
0
0
Cyy
0
I
Table 4.1: The matrices A and B for PCA, PLS, CCA and MLR.
Now since fwyi g is orthogonal, the next solution is the second pair of eigenvectors and so on. Since fwyi g is orthogonal, the solutions are not unique; any set of
vectors spanning the same subspace in Y can be used to minimize ε2 in equation
4.49 but, of course, with other wxi s and βs.
If all solutions to the eigenproblem in equation 4.46 and the corresponding βi s
are used, a solution for multivariate linear regression (MLR), also known as the
Wiener filter, is obtained. The mean square sum of the eigenvalues, i.e.
∑ ρ2i
1
dim(Y ) = tr(Cyx C,
xx Cxy )=dim (Y )
=
i
is known as the redundancy index (Stewart and Love, 1968).
It should be noted that the regression coefficient β defined in equation 4.41 is
valid for any choice of ŵx and ŵy . In particular, if we use the directions of maximum variance, β is the regression coefficient for principal components regression
(PCR). For the directions of maximum covariance, β is the regression coefficient
for PLS regression.
4.6 Comparisons between PCA, PLS, CCA and MLR
The similarities and differences between the four methods can be seen by comparing the matrices A and B in the generalized eigenproblem (equation 4.1 on
page 61). The matrices are listed in table 4.1.
MLR differs from the other three problems in that it is formulated as a mean
square error problem, while the other three methods are formulated as maximi-
76
Low-dimensional linear models
sation problems. Reduced rank multivariate linear regression can, for example,
be used to increase the stability of the predictors when there are more parameters
than observations, when the relation is known to be of low rank or, maybe most
importantly, when a full rank solution is unobtainable due to computational costs.
The regression coefficients can of course also be used for regression in the first
three cases. In the case of PCA, the idea is to separately reduce the dimensionality of the X and Y spaces and to do a regression of the first principal components
of Y on the first principal components of X . This method is known as principal
components regression. The obvious disadvantage here is that there is no reason
to believe that the principal components of X are related to the principal components of Y . To avoid this problem, PLS regression is sometimes used. Clearly,
this choice of basis is better than PCA for regression purposes since directions of
high covariance are selected, which means that a linear relation is easier to find.
However, neither of these solutions results in minimum least squares error. This
is only obtained using the directions corresponding to the MLR problem.
It is not only the MLR that can be formulated as a mean square error problem. van der Burg (1988) formulated CCA as a mean square error minimization
problem:
minimize ε
2
"N
=E
∑ (x
i=1:
T
ŵxi , y ŵyi )
T
#
2
;
(4.50)
where N is the rank of the solution. In this way, CCA can be seen as a supervised
learning method as discussed in section 2.6.
PCA differs from the other three methods in that it only concerns one set of
variables while the other three concern relations between two sets of variables.
The difference between PLS, CCA and MLR can be seen by comparing the matrices in the corresponding eigenproblems (see table 4.1). In CCA, the between-sets
covariance matrices are normalized with respect to the within-set covariances in
both the x and the y spaces. In MLR, the normalization is done only with respect
to the x space covariance while the y space, where the square error is defined, is
left unchanged. In PLS, no normalization is done. Hence, these three cases can
be seen as the same problem, covariance maximization, where the variables have
been subjected to different, data dependent, scaling.
The main difference between CCA and the other three methods is that CCA
is closely related to mutual information as described in section 4.4.1 and, hence,
can easily be motivated in information theoretical terms. Because of this relation,
it is a bit surprising that canonical correlation seems to be rather unknown in
the signal processing, learning and neural networks societies. As an example,
a search for “neural network(s)” together with “canonical correlation(s)” in the
SciSearch Database of the Institute for Scientific Information, Philadelphia, gave
4.6 Comparisons between PCA, PLS, CCA and MLR
CCA
MLR
PLS
77
PCA
X
Y
Figure 4.3: Examples of eigenvectors using CCA, MLR, PLS and PCA
on the same sets of data.
3 hits. A corresponding search for “partial least square(s)” gave 103 hits, for
“linear regression” 212 hits and for “principal component(s)” 287 hits7 . The same
test with “signal processing” instead of “neural networks” gave 2, 5, 18 and 31 hits
respectively. This result does not, of course, mean that all articles that matched
“principal component(s)” presented learning methods based on PCA. But it may
indicate the difference in interest in, or awareness of, the different methods within
these fields of research.
To see how these four different special cases of the generalized eigenproblem
may differ, the solutions for the same data are plotted in figure 4.3. The data are
two-dimensional in X and Y and randomly distributed with zero mean. The top
row shows the eigenvectors in X for the CCA, MLR, PLS and PCA respectively.
The bottom row shows the solutions in Y . Note that all solutions except the two
solutions for CCA and the X -solution for MLR are orthogonal. Figure 4.4 shows
the correlation, mean square error, covariance and variance of the data projected
onto the first eigenvectors for each method. The figure shows that: the correlation
is maximized for the CCA solution; the mean square error is minimized for the
MLR solution; the covariance is maximized for the PLS solution; the variance is
maximized for the PCA solution.
7 The
search was made on November 4, 1997, through the Norwegian BIBSYS library system
(http://www.bibsys.no). The “free text” field was used which performs a search in the title,
abstract and keywords.
78
Low-dimensional linear models
mse
corr
0.5
10
0.45
9
0.4
8
0.35
7
cov
1.4
var
3
1.2
2.5
1
2
0.3
6
0.25
5
0.2
4
0.15
3
0.1
2
0.8
1.5
0.6
1
0.4
0.5
0.2
0
PCA
PLS
MLR
CCA
PCA
PLS
0
MLR
CCA
0
PCA
PLS
MLR
CCA
0
PCA
1
PLS
MLR
CCA
0.05
Figure 4.4: The correlation, mean square error, covariance and variance
when using the first pair of vectors for each method. The correlation is
maximized for the CCA solution. The mean square error is minimized
for the MLR solution. The covariance is maximized for the PLS solution.
The variance is maximized for the PCA solution. (See section 4.6)
4.7 Gradient search on the Rayleigh quotient
In this section is shown that the solutions to the generalized eigenproblem can
be found and, hence, PCA, PLS, CCA or MLR can be performed by a gradient
search on the Rayleigh quotient.
Finding the largest eigenvalue
In the previous section was shown that the only stable critical point of the Rayleigh
quotient is the global maximum (equation 4.9 on page 63). This means that it
should be possible to find the largest eigenvalue of the generalized eigenproblem
and its corresponding eigenvector by performing a gradient search on the Rayleigh
quotient (equation 4.2 on page 61). This can be done by using an iterative algo-
4.7 Gradient search on the Rayleigh quotient
79
rithm:
w(t + 1) = w(t ) + ∆w(t );
(4.51)
where the update vector ∆w, on average, lies in the direction of the gradient:
E [∆w] = β
∂r
∂w
= α(Aŵ
, rBŵ)
(4.52)
;
where α and β are positive numbers. α is the gain controlling how far, in the
direction of the gradient, the vector estimate is updated at each iteration. This
gain could be constant as well as data or time dependent, as discussed in section
2.3.2.
In all four cases treated here, A has got at least one positive eigenvalue, i.e.
there exists an r > 0. An update rule such that
E [∆w] = α(Aŵ , Bw)
(4.53)
can then be used to find the positive eigenvalues. Here, the length of the vector
represents the corresponding eigenvalue, i.e. kwk = r. To see this, consider a
choise of w that gives r < 0. Then wT ∆w < 0 since wT Aw < 0 and wT Bw 0.
This means that kwk will decrease until r becomes positive.
The function Aŵ , Bw is illustrated in figure 4.5 together with the Rayleigh
quotient plotted to the left in figure 4.1 on page 62.
Finding successive eigenvalues
Since the learning rule defined in equation 4.52 maximizes the Rayleigh quotient
in equation 4.2 on page 61, it will find the largest eigenvalue λ1 and a corresponding eigenvector ŵ1 = ê1 of equation 4.1 on page 61. The question that naturally
arises is if, and how, the algorithm can be modified to find the successive eigenvalues and vectors, i.e. the successive solutions to the eigenvalue equation 4.1.
Let G denote the n n matrix B,1 A. Then the n equations for the n eigenvalues solving the eigenproblem in equation 4.1 on page 61 can be written as
GE = ED
)
G = EDE,1 = ∑ λi êi fTi ;
(4.54)
where the eigenvalues and vectors constitute the matrices D and E respectively:
2λ
1
6
D=4
0
0
..
.
λn
3
75
;
2j
j3
E = 4ê1 ên 5
j
j
;
2,
6
E,1 = 4
,
fT1
..
.
fTn
3
75
,
,
:
(4.55)
80
Low-dimensional linear models
^ − B w)
The gradient ( A w
1
w2
r( w)
0
−1
−1
0
w1
1
Figure 4.5: The function Aŵ , Bw, for the same matrices A and B as
in figure 4.1, plotted for different w. The Rayleigh quotient is plotted as
reference.
The vectors, fi , appearing in the rows of the inverse of the matrix containing the
eigenvectors are the dual vectors of the eigenvectors êi , which means that
fTi ê j = δi j :
(4.56)
ffi g are also called the left eigenvectors of G and fêi g and ff̂i g are said to be
biorthogonal. Remember (from equation 4.6 on page 63) that the eigenvectors êi
are both A and B orthogonal, i.e.
êTi Aê j = 0 and
êTi Bê j = 0 for
i 6= j:
(4.57)
Hence, the dual vectors fi possessing the property in equation 4.56 can be found
by choosing them according to:
fi =
Bêi
:
T
êi Bêi
(4.58)
Now, if ê1 is the eigenvector corresponding to the largest eigenvalue of G, the new
matrix
H = G , λ1ê1 fT1
(4.59)
4.7 Gradient search on the Rayleigh quotient
81
has the same eigenvectors and eigenvalues as G except for the eigenvalue corresponding to ê1 , which now becomes 0 (see proof B.3.11 on page 164). This means
that the eigenvector corresponding to the largest eigenvalue of H is the same as
the one corresponding to the second largest eigenvalue of G.
Since the algorithm starts by finding the vector ŵ1 = ê1 , it is only necessary to
estimate the dual vector f1 in order to subtract the correct outer product from G and
remove its largest eigenvalue. In our case, this is a little bit tricky since G is not
generated directly. Instead, its two components A and B must be modified in order
0
to produce the desired subtraction. Hence, we want two modified components, A
0
and B , with the following property:
B ,1 A
0
0
=B
,1 A , λ1ê1 fT :
(4.60)
1
A simple solution is obtained if only one of the matrices is modified and the other
matrix is kept fixed:
B
0
=B
and
A
0
=A
, λ1Bê1 fT1
(4.61)
:
This modification can be accomplished by estimating a vector u1 = λ1 Bê1 = Bw1
iteratively as:
u1 (t + 1) = u1 (t ) + ∆u1(t )
(4.62)
E [∆u1] = α (rBŵ1 , u1 ) :
(4.63)
where
Once this estimate has converged, u1
product in equation 4.61:
λ1 Bê1 fT1
=
=
λ1 Bê1 can be used to express the outer
λ1 Bê1 êT1 BT
êT1 BêT1
=
u1uT1
:
êT1 u1
(4.64)
Now A0 can be estimated and, hence, a modified version of the learning algorithm in equation 4.52 which finds the second eigenvalue and the corresponding
eigenvector to the generalized eigenproblem is obtained:
0
E [∆w] = α A ŵ , rBŵ
=α
u1 uT
A , T 1 ŵ , rBŵ
ŵ1 u1
:
(4.65)
The vector w1 is the solution first produced by the algorithm, i.e. the largest eigenvalue and the corresponding eigenvector.
This scheme can of course be repeated in order to find the third eigenvalue by
subtracting the second solution in the same way and so on. Note that this method
does not put any demands on the range of B in contrast to exact solutions involving
matrix inversion.
In the following four sub-sections is shown how this iterative algorithm can
be applied to the four important problems described in the previous section.
82
Low-dimensional linear models
4.7.1 PCA
Finding the largest principal component
The direction of maximum data variation can be found by a stochastic gradient
search according to equation 4.53 with A and B defined according to equation
4.12:
A = Cxx
B = I:
and
(4.12)
This leads to an unsupervised Hebbian learning algorithm that finds both the direction of maximum data variation and the variance of the data in that direction:
E [∆w]
=γ
∂ρ
∂w
= α (Cxx ŵ
, w) = α E [xxT ŵ , w]
(4.66)
:
The update rule for this algorithm is given by
∆w = α (xxT ŵ , w);
(4.67)
where the length of the vector represents the estimated variance, i.e. kwk = ρ.
(Note that ρ in this case is always positive.)
Note that this algorithm finds both the direction of maximum data variation as
well as how much the data vary along that direction. Often algorithms for PCA
only find the direction of maximum data variation. If one is also interested in the
variation along this direction, another algorithm needs to be employed. This is the
case for the well-known PCA algorithm presented by Oja and Karhunen (1985).
Finding successive principal components
In order to find successive principal components, recall that A = Cxx and B = I.
Hence the matrix G = B,1 A = Cxx is symmetric and has orthogonal eigenvectors.
This means that the dual vectors and the eigenvectors become indistinguishable
and that no other vector than w itself needs to be estimated. The outer product in
equation 4.61 then becomes:
λ1 Bê1 fT1
T
T
= λ1 Iê1 ê1 = w1 ŵ1 :
(4.68)
This means that the modified learning rule for finding the second eigenvalue can
be written as
0
E [∆w] = α A ŵ , Bw
=α
,(C , w ŵT )ŵ , w
xx
1 1
:
(4.69)
A stochastic approximation of this rule is achieved the vector w is updated by
∆w = α
,(xxT , w ŵT )ŵ , w
1
1
(4.70)
4.7 Gradient search on the Rayleigh quotient
83
at each time step.
As mentioned in section 4.2, it is possible to perform a PCA on the inverse of
the covariance matrix by choosing A = I and B = Cxx . The learning rule associated with this behaviour then becomes:
∆w = α (ŵ , xxT w):
(4.71)
4.7.2 PLS
Finding the largest singular value
If the aim is to find the directions of maximum data covariance, the matrices A
and B are defined according to equation 4.19:
A=
0
Cxy
0
Cyx
I 0
, B=
0 I
and
w=
µ ŵ x
x
µy ŵy
(4.19)
:
Since w on average should be updated in the direction of the gradient, the update
rule in equation 4.53 gives:
∂r
E [∆w] = γ
∂w
=α
0
Cxy
I 0
ŵ , r
ŵ
0 I
0
Cyx
(4.72)
:
This behaviour is accomplished if at each time step, the vector w is updated with
∆w = α
0
yxT
xyT
ŵ , w
0
(4.73)
;
where the length of the vector at convergence represents the covariance, i.e. kwk =
r = ρ. This can be done since it is sufficient to search for positive values of ρ.
Finding successive singular values
Also in this case, the special structure of the A and B matrices simplifies the procedure of finding the subsequent directions with maximum data covariance. The
compound matrix G = B,1 A = A is symmetric and has orthogonal eigenvectors,
which are identical to their dual vectors. The outer product for modification of the
matrix A in equation 4.61 is identical to the one presented in the previous section:
λ1 Bê1 fT1 =
λ1
I 0
0 I
ê1 êT1
T
= w1 ŵ1 :
(4.74)
A modified learning rule for finding the second eigenvalue can thus be written as
0
E [∆w] = α A ŵ , Bw
=
α
0
Cyx
Cxy
0
,
w1ŵT1
ŵ ,
I 0 0 I
w
:
(4.75)
84
Low-dimensional linear models
A stochastic approximation of this rule is achieved if the vector w is updated at
each time step by
∆w = α
0
xyT
0
yxT
,
w1ŵT1
ŵ , w
(4.76)
:
4.7.3 CCA
Finding the largest canonical correlation
Again, the algorithm in equation 4.53 for solving the generalized eigenproblem
can be used for the stochastic gradient search. With the matrices A and B and the
vector w as in equation 4.29:
A=
0
Cyx
Cxy
0
B=
;
C
0
the update direction is:
∂r
E [∆w] = γ
w
xx
=α
0
Cyy
0
Cyx
and
w=
w µ ŵ x
wy
=
x
x
µy ŵy
(4.29)
Cxy
C
0
ŵ , r xx
ŵ
0 Cyy
0
(4.77)
:
This behaviour is accomplished if at each time step the vector w is updated with
∆w = α
0
yxT
xyT
xxT
ŵ ,
0
0
0
w
yyT
(4.78)
:
Since kwk = r = ρ when the algorithm converges, the length of the vector represents the correlation between the variates.
Finding successive canonical correlations
In the two previous cases it was easy to cancel out an eigenvalue because the
matrix G was symmetric. This is not the case for canonical correlation. In this
case
C,1
G = B,1 A = xx
0
0
,1
Cyy
0
Cyx
Cxy
0
=
1
0
C,
xx Cxy
,1 Cyx
Cyy
0
:
(4.79)
Because of this, it is necessary to estimate the dual vector f1 corresponding to the
eigenvector ê1 , or rather the vector u1 = λ1 Bê1 as described in equation 4.63:
E [∆u1] = α (Bw1 , u1) = α
C
xx
0
0
w , u1
Cyy 1
:
(4.80)
4.7 Gradient search on the Rayleigh quotient
xxT
85
A stochastic approximation of this rule is given by
∆u1 = α
0
0
w1 , u 1
yyT
(4.81)
:
With this estimate, the outer product in equation 4.61 can be used to modify the
matrix A:
0
u1 uT
A = A , λ1Bê1 fT1 = A , T 1 :
(4.82)
ŵ1 u1
A modified version of the learning algorithm in equation 4.78 which finds
the second largest canonical correlation and its corresponding directions can be
written on the following form:
0
E [∆w] = α A ŵ , Bw
0
=α
Cxy
0
Cyx
T
, ŵu1Tuu1 ŵ , C0xx C0 w :
yy
1 1
(4.83)
Again to get a stochastic approximation of this rule, the update at each time step
is performed according to:
∆w = α
0
xyT
0
yxT
T
T
, ŵu1Tuu1 ŵ , xx0
1 1
0
w
yyT
(4.84)
:
Note that this algorithm simultaneously finds both the directions of canonical
correlations and the canonical correlations ρi in contrast to the algorithm proposed
by Kay (1992), which only finds the directions.
4.7.4 MLR
Finding the directions for minimum square error
Also here, the algorithm in equation 4.53 can be used for a stochastic gradient
search. With the A, B and w according to equation 4.47:
A=
0
Cyx
Cxy
0
;
B=
C
xx
0
the update direction is:
∂r
E [∆w] = γ
∂w
=α
0
I
0
0
Cyx
w=
and
w µ ŵ x
wy
=
x
;
(4.47)
Cxy
C
0
ŵ , r xx
ŵ
0 I
0
x
µy ŵy
:
(4.85)
This behaviour is accomplished if the vector w at each time step is updated with
∆w = α
yxT
xyT
xxT
ŵ ,
0
0
0
w
I
:
(4.86)
Since kwk = r = ρ when the algorithm converges, the regression coefficient is
obtained as β = kwk µµxy .
86
Low-dimensional linear models
Finding successive directions for minimum square error
Also in this case, the dual vectors must be used to cancel out the detected eigenvalues. The non-symmetric matrix G is
C,1 0
G = B,1A = xx
0
0
I
Cyx
Cxy
0
0
=
1
C,
xx Cxy
0
Cyx
(4.87)
:
Again, the vector u1 = λ1 Bê1 is estimated as described in equation 4.63:
E [∆u1] = α (Bw1 , u1 ) = α
C
0
w , u1
I 1
xx
0
A stochastic approximation for this rule is given by
∆u1 = α
xxT 0
0
I
w1 , u 1
(4.88)
:
(4.89)
:
With this estimate, the outer product in equation 4.61 can be used to modify the
matrix A:
A
0
T
=A
, λ1Bê1fT1 = A , ŵu1Tuu1
(4.90)
:
1
1
A modified version of the learning algorithm in equation 4.86 which finds the
successive directions of minimum square error and their corresponding regression
coefficient can be written on the following form:
0
E [∆w] = α A ŵ , Bw
=α
0
Cyx
Cxy
0
T
, ŵu1Tuu1
1
1
ŵ ,
C
xx
0
0
w :
I
(4.91)
Again, to get a stochastic approximation of this rule, the update at each time step
is performed according to:
∆w = α
0
yxT
xyT
0
T
T
, ŵu1Tuu1 ŵ , xx0
1 1
0
w
I
:
(4.92)
As mentioned earlier, the wy s are orthogonal in this case. This means that
this method can be used for successively building up a low-rank approximation of
MLR by adding a sufficient number of solutions, i.e.
N
ỹ = ∑ βi xT ŵxi ŵyi ;
i=1
where ỹ is the estimated y and N is the rank.
(4.93)
4.8 Experiments
87
4.8 Experiments
The memory requirement as well as the computational cost per iteration of the
presented algorithm is of order O (Nd ), where N is the number of estimated models, i.e. the rank of the solution, and d is the dimensionality of the signal. This
enables experiments in signal spaces having dimensionalities which would be impossible to handle using traditional techniques involving matrix multiplications
(having memory requirements of order O (d 2 ) and computational costs of order
O (d 3 )).
This section presents some experiments using the algorithm for analysis of
stochastic processes. First, the algorithm is employed to perform PCA, PLS,
CCA, and MLR. Here, the dimensionality of the signal space is kept reasonably
low in order to make a comparison with the performance of an optimal (in the
sense of maximum likelihood (ML)) deterministic solution which is calculated
for each iteration, based on the data accumulated so far.
In the final experiment, the algorithm is applied to a process in a high-dimensional (1,000 dimensions) signal space. In this case, the update factor is made
data dependent and the output from the algorithm is post-filtered in order to meet
requirements of quick convergence together with algorithm robustness.
The errors in magnitude and angle were calculated relative the correct answer
wc . The same error measures were used for the output from the algorithm as well
as for the ML estimate:
εm (w) = kwc k,kwk
(4.94)
εa (w) = arccos(ŵT ŵc ):
(4.95)
4.8.1 Comparisons to optimal solutions
The test data for these four experiments were generated from a 30-dimensional
Gaussian distribution such that the eigenvalues of the generalized eigenproblem
decreased exponentially from 0.9:
λi = 0:9
2 i,1
3
:
The two largest eigenvalues (0.9 and 0.6) and the corresponding eigenvectors
were simultaneously searched for. In the PLS, CCA and MLR experiments, the
dimensionalities of the signal vectors belonging to the x and y parts of the signal
were 20 and 10 respectively.
The average angular and magnitude errors were calculated based on 10 different runs. This computation was made for each iteration, both for the algorithm
88
Low-dimensional linear models
PCA: Mean angular error for w1
PCA: Mean angular error for w2
π/2
π/2
π/4
π/4
0
2000
4000
6000
8000 10000
0
2000
PCA: Mean norm error for w
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
6000
8000 10000
8000 10000
2
1
4000
6000
PCA: Mean norm error for w
1
1
2000
4000
2000
Figure 4.6: Results for the PCA case.
4000
6000
8000 10000
4.8 Experiments
89
PLS: Mean angular error for w1
PLS: Mean angular error for w2
π/2
π/2
π/4
π/4
0
2000
4000
6000
8000 10000
0
2000
PLS: Mean norm error for w
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
6000
8000 10000
8000 10000
2
1
4000
6000
PLS: Mean norm error for w
1
1
2000
4000
2000
Figure 4.7: Results for the PLS case.
4000
6000
8000 10000
90
Low-dimensional linear models
CCA: Mean angular error for w1
CCA: Mean angular error for w2
π/2
π/2
π/4
π/4
0
2000
4000
6000
8000 10000
0
2000
CCA: Mean norm error for w
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
6000
8000 10000
8000 10000
2
1
4000
6000
CCA: Mean norm error for w
1
1
2000
4000
2000
Figure 4.8: Results for the CCA case.
4000
6000
8000 10000
4.8 Experiments
91
MLR: Mean angular error for w1
MLR: Mean angular error for w2
π/2
π/2
π/4
π/4
0
2000
4000
6000
8000 10000
0
2000
MLR: Mean norm error for w
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
6000
8000 10000
8000 10000
2
1
4000
6000
MLR: Mean norm error for w
1
1
2000
4000
2000
Figure 4.9: Results for the MLR case.
4000
6000
8000 10000
92
Low-dimensional linear models
and for the ML solution. The results are plotted in figures 4.6, 4.7, 4.8 and 4.9 for
PCA, PLS, CCA and MLR respectively.
The errors of the algorithm are drawn with solid lines and the errors of the
ML solution are drawn with dotted lines. The vertical bars show the standard
deviations. Note that the angular error is always positive and, hence, does not
have a symmetrical distribution. However, for simplicity, the standard deviation
indicators have been placed symmetrically around the mean. The first 30 iterations
were omitted to avoid singular matrices when calculating matrix inverses for the
ML solutions.
No attempt was made to find an optimal set of parameters for the algorithm.
Instead, the experiments and comparisons were carried out only to display the behaviour of the algorithm and to show that it is robust and converges to the correct
solutions. Initially, the estimate was assigned a small random vector. A constant
gain factor of α = 0:001 was used throughout all four experiments.
4.8.2 Performance in high-dimensional signal spaces
The purpose of the methods discussed in this chapter is dimensionality reduction
in high-dimensional signal spaces. We have previously shown that the proposed
algorithm has the computational capacity to handle such signals. This experiment
illustrates that the algorithm behaves well also in practice for high-dimensional
signals. The dimensionality of x is 800 and the dimensionality of y is 200, so the
total dimensionality of the signal space is 1,000. The object in this experiment is
CCA.
In the previous experiment, the algorithm was used in its basic form with constant update rates set by hand. In this experiment, however, a more sophisticated
version of the algorithm is used where the update rate is adaptive and the vectors
are averaged over time. The details of this extension of the algorithm are numerous and beyond the scope of this thesis. Only a brief explanation of the basic
structure of the extended algorithm is given here. The algorithm can be described
in terms of four blocks as illustrated in figure 4.10.
The first block, ∆w, calculates the delta-vectors according to
(
∆wx
∆wy
= (yT ŵy
= (xT ŵx
, xT wx )x
, yT wy )y
(4.96)
:
The difference between this update rule and the update rule in 4.78 on page 84
is that here the two delta-vectors wx and wy are calculated separately. But the
update rule can still be identified as the gradient of ρ in equation 4.25 on page 68
for wx =
ŵTx Cxy ŵy
ŵ
ŵTx Cxx ŵx x
and wy =
ŵTy Cyx ŵx
ŵ .
ŵTy Cyy ŵy y
4.8 Experiments
93
wx , wy
x, y
DCC-SUM
∆w
LP
wxa , wya
α1
f1
f2 α2
c
∆wx , ∆wy
CONS
Figure 4.10: The extended CCA-algorithm. See the text for explanations.
The delta vectors are then accumulated in the DCC-SUM block in a way that
compensates for the influence of the DC-component of the sample data:
8
>
<wx
>
:wy
= ∑ α1x ∆wx
= ∑ α1y ∆wy
, ∑α
1x x
,
∑ α1x (yT ŵy ,xT wx )
∑ α1x
(4.97)
∑ α1y y ∑ α1y (xT ŵx ,yT wy )
:
∑ α1y
Note that the sums can be accumulated on-line. Also note that the update factor
α1 can be different for wx and wy .
Finally, the weighted averages wxa and wya of wx and wy respectively are
calculated:
(
wxa (t + 1)
wya (t + 1)
, wxa(t ))
= wya (t ) + α2y (wy (t ) , wya (t ))
= wxa (t ) + α2x (wx (t )
(4.98)
:
Adaptability is necessary for a system without a pre-specified (time dependent) update rate α. Here, the adaptive update rate is dependent on the consistency
of the change of the vector. The consistency is calculated in the CONS block as
g
c = k∆wx k;
(4.99)
94
Low-dimensional linear models
g
where ∆wx is an estimate of the normalized average delta vector:
∆g
wx (t + 1) = ∆g
wx (t ) + α2x
g
∆wx
k∆wxk , ∆wx(t )
:
(4.100)
A similar calculation of c is made for wx .
The functions f1 and f2 map the consistency c in a suitable way. f2 increases
the sensitivity to changes in c (α2 c2 ) and f1 decreases the sensitivity (α1 c1=2 ).
When there is a consistent change in w, c is large and the averaging window
is short which makes wa follow w quickly. When the changes in w are less consistent, the window gets longer and wa is the average of an increasing number
of instances of w. This means, for example, that if w is moving symmetrically
around the correct solution with a constant variance, the error of wa will still tend
towards zero (see figure 4.11).
The experiment was carried out using a randomly chosen distribution of a
800-dimensional x variable and a 200-dimensional y variable. Two x and two y
dimensions were correlated. The other 798 dimensions of x and 198 dimensions
of y were uncorrelated. The variances in the 1000 dimensions were of the same
order of magnitude.
The upper plot in figure 4.11 shows the estimated first canonical correlation
as a function of number of iterations (solid line) and the true correlation in the
current directions found by the algorithm (dotted line). Note that each iteration
gives one sample.
The lower plot in figure 4.11 shows the effect of the adaptive averaging. The
two upper noisy curves show the logarithms of angular errors of the ‘raw’ estimates wx and wy and the two lower curves show the angular errors for wxa
(dashed) and wya (solid). The angular errors of the smoothed estimates are much
more stable and decrease more rapidly than the ‘raw’ estimates. The errors after
2 105 samples are below one degree. (It should be noted that this is extreme
precision as, with a resolution of 1 degree, a low estimate of the number of different orientations in a 1000-dimensional space is 102000 .) The angular errors were
calculated as the angle between the vectors and the exact solutions ê (known from
the x y sample distribution), i.e.
Err[ŵ] = arccos(ŵTa ê):
4.8 Experiments
95
1.5
correlation
1
0.5
0
0
0.2
0.4
0.6
0.8
1
1.2
iterations
1.4
1.6
1.8
2
x 105
−5
0
0.2
0.4
0.6
0.8
1
1.2
iterations
1.4
1.6
1.8
2
x 105
1
Angle error [ log(rad) ]
0
−1
−2
−3
−4
Figure 4.11: Top: The estimated first canonical correlation as a function
of number of iterations (solid line) and the true correlation in the current
directions found by the algorithm (dotted line). The dimensionality of one
set of variables is 800 and of the second set 200. Bottom: The logarithm
of the angular error as a function of number of iterations.
96
Low-dimensional linear models
Part II
Applications in computer vision
Chapter 5
Computer vision
In this part of this dissertation is shown how local linear adaptive models based on
canonical correlation can be used in computer vision. This chapter serves as an
introduction by giving a brief overview of the parts of the theory and terminology
of computer vision relevant to the remaining chapters. For an extensive treatment
of this subject, see (Granlund and Knutsson, 1995).
5.1 Feature hierarchies
An image in a computer is usually represented by an array of picture elements
(pixels), each one containing a gray level value or a colour vector. The images
referred to in this thesis are gray scale images. The pixel values can be seen as
image features on the lowest level. On a higher level, there are for example the
orientation and phase of one-dimensional events such as lines and edges. On
the next level, the curvature describes the change of orientation. On still higher
levels, there are features like shape, relations between objects, disparity et cetera.
It is, of course, not obvious how to sort complex features into different levels.
But, in general, it can be assumed that a function that estimates the values of
a certain feature uses features of a lower level as input. High-level features are
often estimated on a larger spatial scale than low-level features.
Low-level features (e.g. orientation) are usually estimated by using fairly simple combinations of linear filter outputs. These filter outputs are generated by
convolving (see for example Bracewell, 1986) the image with a set of filter kernels. The filter coefficients can be described as a vector and so can each region of
the image. Hence, for each position in the image, the filter output can be seen as
a scalar product between a (reversed) filter vector and a signal vector.
100
Computer vision
qe
θ
ql
Figure 5.1: The phase representation of a line/edge event.
5.2 Phase and quadrature filters
Consider a half period of a cosine wave. It can illustrate the cross section of a
white1 line if it is centred around 0 and a dark line if it is centred around π. If it
is centred around π=2 or 3π=2 it can illustrate lines of opposite slopes. This leads
to the concept of phase. To represent the kind of line/edge event in question, a
phase angle θ can be used as illustrated in figure 5.1. If the line and edge filters
are designed so that they are orthogonal, their outputs, ql and qe respectively, can
be combined geometrically so that the magnitude
jqj =
q2
ql
+ q2e
(5.1)
indicates the presence of a line or an edge of a certain orientation and the argument
θ = arctan(qe =ql )
(5.2)
represents the kind of event in question, i.e. the phase.
A filter that fits this representation can be obtained as a complex filter consisting of a real-valued line filter and an imaginary edge filter:
q = ql + iqe :
1 White
(5.3)
is here represented by the highest value and black is represented by the lowest value.
5.3 Orientation
101
The magnitude is then the magnitude of the complex filter output q and the phase
is the complex argument of q. If the magnitude is invariant with respect to the
phase when applied on a pure sine wave function, the filter is said to be a quadrature filter. A quadrature filter has zero DC component and is zero in one half-plane
in the frequency domain. An example of a quadrature filter is shown in figure 7.5
on page 130. It should be noted that the phase can only be defined after defining
a direction in which to measure the phase.
5.3 Orientation
According to the assumption of local one-dimensionality (see page 52), it can be
assumed that a small region of an image generally contains at most one dominant orientation. This orientation can be detected by using a set of at least three
quadrature filters evenly spread out over all orientations (Knutsson, 1982). Here,
the channel representation discussed in section 3.1 can be recognized. The orientation can be represented by a pure channel vector. If four filter orientations are
used, the pure channel vector is
0jq j1
1
B
j
q
jC
q=B
@jq2jC
A
(5.4)
:
3
jq4 j
By choosing a cos2 shape with proper width of the filter functions as described
in section 3.1, the channel vector has a constant norm for all orientations. If four
filter orientations are used, each channel looks like
jqk j = d cos2(ϕk , φ)
;
ϕk = (k , 1)
π
;
4
(5.5)
where ϕk is the filter orientation, φ is the line or line or edge orientation and d is
an orientation invariant component. By using this set of channels, a more compact
orientation vector can be composed:
z=
jq j,jq j
1
3
jq2 j,jq4j
:
(5.6)
Inserting equation 5.5 into equation 5.6 gives
z=a
cos(2φ)
sin(2φ)
;
(5.7)
where a is an orientation invariant component. This orientation representation
is called double angle representation (Granlund, 1978). The advantage with this
102
Computer vision
q4
z
q3
q1
q2
Figure 5.2: The double angle representation.
representation can be seen when considering the rotation of a line. A line is identical if it is rotated 180 . Since z rotates 360 as a line rotates 180 , this gives
an unambiguous and continuous representation of the orientation as illustrated in
figure 5.2. The norm and the orientation of the orientation vector z represent different independent features. While the argument of z represents the orientation of
the signal, the norm depends on the energy of the signal in the passband of the
filters.
The double angle representation enables vector averaging of the orientation
estimates. Vector averaging is usually performed to get a more robust orientation estimate in a larger region of the image. Vector averaging is a geometrical
summation of the vectors followed by a normalization:
v̄ =
1 n
vi :
n i∑
=1
(5.8)
The sum of inconsistently oriented vectors is shorter than the sum of vectors with
similar directions. This means that the norm of the average vector can be interpreted as a kind of variance, or certainty, measure. This is an important difference
between vector averaging and an ordinary geometric scalar average. If the vector
average is normalized using the average norm, i.e.
v̄ =
∑ni=1 vi
;
∑ni=1 kvi k
(5.9)
the certainty measure lies between 0 and 1 where 1 means that all vectors have the
same orientation.
5.4 Frequency
103
5.4 Frequency
Since the norm of z depends on the frequency content of the signal, it can be used
for estimating local (spatial) frequency. While frequency is only strictly defined
for stationary signals, which do not hold for most physical signals, the concept
of instantaneous frequency (Carson and Fry, 1937; van der Pol, 1946) is usually
defined as the rate of change of the phase of the analytical signal (see for example
Bracewell, 1986; Granlund and Knutsson, 1995).
The instantaneous frequency can be estimated using the ratio between the output of two lognormal quadrature filters (Knutsson, 1982). The radial function of
a lognormal filter is defined in the frequency domain by
Ri ( f ) = e,CB ln
2
( f = fi )
(5.10)
;
where f = kuk is the norm of the frequency vector, fi is the centre frequency and
CB = 4=(B2 ln 2) where B is the 6 dB relative bandwidth. Function 5.10 is a Gaussian on a logarithmic scale. The instantaneous frequency can now be estimated
as
ωi =
jqi 1 j
jqi j
+
(5.11)
;
where qi = kqi k is the (orientation invariant) norm of the quadrature filter vector
of centre frequency fi and the difference between fi and fi+1 is one octave (i.e.
a factor two). An example of this is illustrated in figure 5.3 where the frequency
function of two such lognormal filters are plotted (solid curves) together with the
quotient in equation 5.11 (dashed line). See Granlund and Knutsson (1995) for
further details.
To estimate local frequencies in a wider range than that covered by the passbands of two filters, a weighted sum of instantaneous frequencies can be used:
f˜ =
N ,1
∑ jqi j
!,1 N,1
i=0
pf f
∑ jqi 1 j
+
i i+1 ;
(5.12)
i=0
where fi+1 = 2 fi .
Also the frequency can be represented by a vector as illustrated in figure 5.4.
This enables vector averaging of the frequency estimates too.
5.5 Disparity
An important feature of binocular vision systems is disparity, which is a measure
of the shift between two corresponding neighbourhoods in a pair of stereo images.
104
Computer vision
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Figure 5.3: The local frequency (dashed line) estimated as a quotient between the magnitude of two lognormal quadrature filter outputs. The centre frequencies of the filters differ one octave.
High
frequencies
High and
low
frequencies
Medium
frequencies
Low
frequencies
Figure 5.4: The vector representation of frequency.
5.5 Disparity
105
The disparity is related to the angle the eyes (cameras) must be be rotated relative
to each other in order to focus on the same point in the 3-dimensional outside
world. The corresponding process is known as vergence.
The problem of estimating disparity between pairs of stereo images is not a
new one (Barnard and Fichsler, 1982). Early approaches often used matching of
some feature in the two images (Marr, 1982). The simplest way to calculate the
disparity is to correlate a region in one image with all horizontally shifted regions
on the same vertical position and then to find the shift that gave maximum correlation. This is, however, a computationally very expensive method. Since vergence
implies a vision system acting in real time, other methods must be employed.
Later approaches have been more focused on using the phase information
given by for example Gabor or quadrature filters (Sanger, 1988; Wilson and Knutsson,
1989; Jepson and Fleet, 1990; Westelius, 1995). An advantage of phase-based
methods is that phase is a continuous variable that allows for sub-pixel accuracy.
In phase-based methods, the disparity can be estimated as a ratio between the
phase difference between corresponding vertical line/edge filter outputs from the
two images and the instantaneous frequency:
∆x =
∆φ
;
φ0
(5.13)
where φ0 = ω is the instantaneous frequency.
Phase-based stereo methods require the filters to be large enough to cover the
same structure in the two images, i.e. the shift must be small compared to the
wavelength of the filter. Otherwise, the phase difference ∆φ will not be related
to the shift. On the other hand, if the shift is too small compared to the wavelength of the filter, the resolution becomes poor which leads to a bad disparity
estimate. The disparity algorithm proposed by Wilson and Knutsson (1989) handles this problem by working on a scale pyramid of different image resolutions.
It starts by estimating the disparity on a coarse scale, which corresponds to using
low-frequency filters, and adjusting the cameras to minimize this disparity. This
process is then iterated on consecutively finer scales.
A problem that is not solved by that approach is when the observed surface is
tilted in depth so that the depth varies along the horizontal axis. In this situation,
the surface will be viewed at different scales by the two cameras as illustrated in
figure 5.5. This means that phase information on one scale in the left image must
be compared with phase information on another scale in the right image. In most
stereo algorithms, this problem cannot be handled in a simple way.
Another problem that most stereo algorithms are faced with occurs at vertical
depth discontinuities (but see Becker and Hinton (1993)). Around the discontinuity there is a region where the algorithm either will not be able to make an estimate
106
Computer vision
Figure 5.5: Scaling effect when viewing a tilted plane.
at all, or the estimate will be some average between the two correct disparities, indicating a slope rather than a step.
Chapter 6
Learning feature descriptors
In this chapter is shown how canonical correlation analysis can be used to find
models that represent local features in images. Such models can be seen as filters
that describe a particular feature in an image. The filters can also be forced to be
invariant with respect to certain other features. The features to be described are
learned by giving the algorithm examples that are presented in pairs. The pairs
are arranged in such a way that the property of a certain feature, for example the
orientation of a line, is equal for each pair while other properties, for example
phase, are presented in an unordered way. This method was presented at SCIA’97
(Borga et al., 1997a).
The idea behind this approach is to use CCA to analyse two signals where
the common signal components are due to the feature that is to be represented,
as illustrated in figure 6.1. The signal vectors fed into the CCA are image data
mapped through some function f . If f is the identity operator (or any other fullrank linear function), the CCA finds the linear combinations of pixel data that
have the highest correlation. In this case, the canonical correlation vectors can be
seen as linear filters. In general, f can be any vector-valued function of the image
data, or even different functions fx and fy , one for each signal space. The choice
of f can be seen as the choice of representation of input data for the canonical
correlation analysis. As discussed in chapter 3, the choice of representation is
very important for the ability to learn.
The canonical correlation vectors wx and wy together with the functions fx and
fy can be seen as filters. The filters that are developed in this way have the property
of maximum correlation between their outputs when applied to two image patches
where the represented feature varies simultaneously. In other words, the filters
maximize the signal to noise ratio between the desired feature and other signals
(see section 4.4.2 on page 70).
A more general approach is to try to maximize mutual information instead
108
Learning feature descriptors
f
CCA
f
Figure 6.1: A symbolic illustration of the method of using CCA for finding feature detectors in images. The desired feature (here illustrated by
a solid line) is varying equally in both image sequences while other features (here illustrated with dotted curves) vary in an uncorrelated way.
The input to the CCA is a function f of the image.
of canonical correlation. This could be accomplished by changing not only the
linear projection in the CCA, but also the functions fx and fy until the maximum
correlation ρ is found. This approach relies on the relation between canonical correlation and mutual information discussed in section 4.4.1. The maximum mutual
information approach is illustrated in figure 6.2. If fx and fy are parameterized
functions, the parameters can be updated in order to maximize ρ. This is related
to the work of Becker and Hinton (1992) where f was implemented as neural
networks with a single neuron in the output layers. The cost function in their approach was the quotient between the variance of the sum and the variance of the
difference of the network outputs. The approach illustrated in 6.2 allows fx and
fy to be implemented as neural networks with several units in their output layers.
In the work presented here, however, the functions are fixed and identical for
x and y. In this chapter, f is the outer product of pixel data or the outer product
of quadrature filter outputs. But projection operators can also be useful, as will
be seen in chapter 7. The motive for choosing non-linear functions here is that
we want to find feature descriptors with useful invariance properties. Of course,
also a linear filter is invariant to several changes of the signal. It is, for example,
109
ρ
sx
fx
CCA
fy
sy
Figure 6.2: A general approach for finding maximum mutual information.
easy to design a linear filter that is invariant with respect to the mean intensity
of the image. But higher-order functions can have more interesting invariance
properties, as discussed by Giles and Maxwell (1987) and Nordberg et al. (1994).
To see this, consider the output q of a linear filter f for a signal s in one point:
q = sT f. The invariance of this filter can be defined as
dq = dsT f = 0:
(6.1)
This means that the changes ds of the signal for which the linear filter is invariant
must be orthogonal to the filter.
Since the invariance properties of linear filters are very limited, it is natural
to try second-order functions, which means that f is an outer product of the pixel
data. For a quadratic function F, the output can be written as q = sT Fs. Here, the
invariance is defined by
dq = 2dsT Fs = 0:
(6.2)
This expression can, for example, include the invariances in the linear case if
F = ffT . But the quadratic filter can also have invariance properties that depend
on the signal s and not only on the change ds as in the linear case. An example
illustrating the differences between invariances of linear and quadratic functions
is illustrated in figure 6.3. In the linear case, the invariances define lines in the
two-dimensional case (hyper-planes in general). The lines are orthogonal to f. In
the quadratic case, the invariances can define, for example, hyperbolic or parabolic
surfaces or ellipsoids. One example of interesting invariance properties of secondorder functions is shift or phase invariance when the filter is applied on a sine wave
pattern. This is the case for the norm of the output from a pair of quadrature filters,
which is a quadratic function of the pixel data.
110
Learning feature descriptors
dsT f = 0
dsT Fs = 0
Figure 6.3: Examples of invariances for linear (left) and quadratic (right)
two-dimensional functions. The lines are iso-curves on which the function is constant. A change of the parameter vector s along the lines will
not change the output.
6.1 Experiments
If f is an outer product and the image pairs contain sine wave patterns with equal
orientations but different phase, the CCA should find a linear combination of the
outer products that is sensitive with respect to the orientation and invariant with
respect to phase. As illustrated in the experiments below, this is also what happens. The outer products weighted by the canonical correlation vectors can be
interpreted as outer products of linear filters. As shown in the experiment, these
linear filters are approximately quadrature filters, which explains the phase invariance of the product. The findings of quadrature filters in the interpretation of the
result of the CCA can serve as a motive for trying products of quadrature filter
outputs as input to CCA on a higher level.
To simplify the description, two functions are used to reshape a matrix into a
vector and the other way around: vec(M) transforms (flattens) an m n matrix M
into a vector with mn components (see definition A.1) and mtx(v; m; n) reshapes
the vector v into an m n matrix (see definition A.2). In particular, for an m n
matrix M,
mtx(vec(M); m; n) = M:
(6.3)
6.1.1 Learning quadrature filters
The first experiment shows that quadrature filters are found by the method discussed above when products of pixel data are presented to the algorithm.
6.1 Experiments
111
1
1
25
5
Ix
ix
5
Iy
5
25
iy
5
outer products
1
625
25
25
1
x
CCA
y
625
25
25
Figure 6.4: Illustration of the generation of input data vectors x and y as
outer products of pixel data. See the text for a detailed explanation.
Let Ix and Iy be a pair of 5 5 image patches. Each image consists of a sine
wave pattern with a frequency of 2π
5 and additive Gaussian noise. A sequence of
such image pairs is constructed so that, for each pair, the orientation is equal in the
two images while the phase differs in a random way. The images have independent
noise. Each image pair is described by vectors ix = vec(Ix ) and iy = vec(Iy ).
Let x and y be vectors describing the outer products of the image vectors, i.e.
x = vec(ix iTx ) and y = vec(iy iTy ). This gives a sequence of pairs of 625-dimensional
vectors describing the products of pixel data from the images. This scheme is
illustrated in figure 6.4.
The sequence consists of 6,500 examples, i.e. 20 examples per degree of freedom. (The outer product matrices are symmetric and, hence, the number of free
2
parameters is n 2+n where n is the dimensionality of the image vector.) For a signal
to noise ratio (SNR) of 0 dB, there were 6 significant1 canonical correlations and
1 By
significant, we mean that they differ from the random correlations caused by the limited
112
Learning feature descriptors
Projections onto canonical correlation vectors w 1 to w 8
1
wx3, wx4
0.6
wx5, wx6
Canonical correlation
0.8
0.4
wx7, wx8
0.2
0
0
x
wx1, wx2
x
5
10
15
Number of correlation
20
0
orientation
π
Figure 6.5: Left: The 20 largest canonical correlations for the 10 dB SNR
sequence. Right: Projections of outer product vectors x onto the 8 first
canonical correlation vectors.
for an SNR of 10 dB there were 8 significant canonical correlations. The canonical
correlations are plotted to the left in figure 6.5. The two most significant correlations for the 0 dB case were both 0.7 which corresponds to an SNR2 of 3.7 dB.
For the 10 dB case, the two highest correlations were both 0.989, corresponding
to an SNR of 19.5 dB.
The projections of image signals x for orientations between 0 and π onto the 8
first canonical correlation vectors wx from the 10 dB case are shown to the right in
figure 6.5. The test signals were generated with random phase and without noise.
As seen in the figure, the filters defined by the first two canonical correlation vectors are sensitive to the double angle of the orientation of the signal and invariant
with respect to phase. The two curves are 90 out of phase and, hence, generate
double angle representation (see figure 5.2 on page 102). The following curves
show the projections onto the successive canonical correlation vectors with lower
canonical correlations. The filters defined by these vectors are sensitive to the
fourth, sixth and eighth multiples of the orientation.
set of samples. The random correlations, in the case of 20 samples per degree of freedom, is
approximately 0.4 (given by experiments).
2 The relation between correlation and SNR in this case is defined by the correlation between
two signals with the same SNR, i.e. corr(s + η1 ; s + η2 ). (See section 4.4.2.)
6.1 Experiments
113
Interpretation of the result
It is not easy to interpret the 625 coefficients in each canonical correlation vector.
But since the data actually were generated as outer products, i.e. 25 25 matrices,
the interpretation of the resulting canonical correlation vectors can be facilitated
by writing them as 25 25 matrices, Wx = mtx(wx ; 25; 25). This means that the
projection of x onto a canonical correlation vector wx can be written as
xT wx = iTx Wx ix ;
(6.4)
where ix is the pixel data vector. By an eigenvalue decomposition of Wx , this
projection can be written as
T
x
wx = iTx
∑
j
!
λ j e j eTj
, 2
ix = ∑ λ j iTx e j
;
(6.5)
j
i.e. a square sum of the pixel data vector projected onto the eigenvectors of Wx
weighted with the corresponding eigenvalues. This means that the eigenvectors
e j can be seen as liner filters and the curves plotted to the right in figure 6.5 are
weighted square sums of the pixel data vectors projected onto the eigenvectors of
the matrices Wxi .
It turns out that only a few of the eigenvectors give significant contributions
to the projection sum in equation 6.5. This can be seen if the terms in the sum are
averaged over all orientations of the signal:
, 2]
m j = E [λ j iTx e j
:
(6.6)
The coefficients m j measure the average energy picked up by the corresponding
eigenvectors and can therefore be seen as significance measures for the different
eigenvectors. In figure 6.6, the significance measures m j for the 25 eigenvectors
are plotted for the two first canonical correlation vectors wx1 and wx2 .
Since the projections of x onto the canonical correlation vectors wx can be described in terms of projections of pixel data ix onto a few 25-dimensional eigenvectors e j , these eigenvectors can be used to interpret the canonical correlation
vectors. Since the image data ix are collected from 5 5 neighbourhoods Ix , it is
logical to view also the eigenvectors e j as 5 5 matrices E j . These matrices can
be called eigenimages. The process of extracting eigenimages from a canonical
correlation vector is illustrated in figure 6.7. In figure 6.8, the four most significant eigenimages are shown for the first (top) and second (bottom) canonical
correlations respectively.
The eigenimages can be interpreted as quadrature filter pairs, i.e. filter pairs
that have the same spectrum and differ 90 in phase (see section 5.2). For wx1 ,
114
Learning feature descriptors
0.4
0.2
0
0
5
10
15
20
25
5
10
15
20
25
0.4
0.2
0
0
Figure 6.6: The significance measures m j for the 25 eigenvectors for the
two first canonical correlation vectors wx1 (top) and wx2 (bottom).
1
25
1
5
wx
CCA
625
25
Wx
ej
25
Ej
eigenvalue decomposition
Figure 6.7: Illustration of the extraction of 5 5 eigenimages E j from a
625-dimensional canonical correlation vector wx .
5
6.1 Experiments
115
E
for w 1
E
for w 1
E
for w 1
E
for w 1
E
for w 2
E
for w 2
E
for w 2
E
for w 2
15
12
x
x
16
13
x
x
17
14
x
x
18
16
x
x
Figure 6.8: The four most significant eigenimages are shown for the first
(top row) and second (bottom row) canonical correlations respectively.
eigenimages E15 and E16 form a quadrature pair in one direction and eigenimages E17 and E18 form a quadrature pair in the perpendicular direction. The same
interpretation can be made for wx2 . To see more clearly that this interpretation
is correct, the eigenimage pairs can be combined in the same way as complex
quadrature filters, i.e. as one real filter and one imaginary filter with a phase difference of 90 , by multiplying one of the filters3 with i (see section 5.2, page 101).
The spectra of the combinations E15 + iE16 and E17 + iE18 for wx1 are shown in
the upper row in figure 6.9. In the lower row, the spectra of the combinations
E12 + iE14 and E13 + iE16 for wx2 are shown. The DC-component is in the centre
of the spectrum. The white circle illustrates the centre frequency of the training
signal. The blobs in the figure show that these eight eigenvectors can be interpreted as four quadrature filter pairs in four different directions.
6.1.2 Combining products of filter outputs
In this experiment, outputs from neighbouring sets of quadrature filters are used
rather than pixel values as input to the algorithm. The experimental result shows
that canonical correlation can find a way of combining filter outputs from a local
neighbourhood to get orientation estimates that are less sensitive to noise than the
3 Usually,
the real filter is symmetric and the imaginary filter is anti-symmetric. However, the
choice of offset phase does not matter as long as the filters differ 90 in phase.
116
Learning feature descriptors
|F(E
15
|F(E
12
+ i*E )|2 for w 1
|F(E
+ i*E )|2 for w 2
|F(E
16
14
x
17
x
13
+ i*E )|2 for w 1
18
x
+ i*E )|2 for w 2
16
x
Figure 6.9: Spectra for the eigenimages interpreted as complex quadrature
filter pairs.
vector averaging method (see section 5.3 on page 102).
Let qxi and qyi , i 2 f1::25g, be 4-dimensional complex vectors of filter responses from four quadrature filters from each of 25 different positions in a 5 5
neighbourhood. The quadrature filters used here have kernels of 7 7 pixels, a
centre frequency of 2pπ 2 and a bandwidth of two octaves. Let Xi = qxi qxi and
Yi = qyi qyi be the outer products of the filter responses in each position for each
image. Finally, all products are gathered into two 400-dimensional vectors:
0 vec(X ) 1
BB vec(X12) C
C
[ht ]x = B
@ ... C
A
vec(X25 )
This scheme is illustrated in figure 6.10.
and
0 vec(Y ) 1
1
B
vec
(
Y
B . 2) C
C
y=B
@ .. C
A
vec(Y25 )
:
(6.7)
6.1 Experiments
117
convolution
1
1
1
4
5
Ix
q
x1
1
1
qx25
5
4
Iy
5
q y1
5
q y25
outer products
1
4
4
4
X1
X25
400
1
x
CCA
y
400
4
4
Y1
Y25 4
Figure 6.10: Illustration of the generation of input data vectors x and y as
outer products of quadrature filter response vectors from 5 5 neighbourhoods.
118
Learning feature descriptors
Angular error using canonical correlation (deg)
90
0
−90
0
90
200
400
600
800
Angular error using vector averaging (deg)
1000
0
−90
0
200
400
600
800
1000
Figure 6.11: Angular errors for 1,000 different samples using canonical
correlations (top) and vector averaging (bottom).
8,000 pairs of vectors were generated. For each pair of vectors, the local
orientation was equal while the phase and noise differed randomly. Gaussian
noise was added to the images giving 0 dB SNR. The data set was analysed using
CCA. The two largest canonical correlations were both 0.85. The corresponding
vectors detected the double angle of the orientation invariant with respect to phase.
New test data were generated using a rotating sine-wave pattern with an SNR
of 0 dB and projected onto the first two canonical correlation vectors. The angular
error4 is shown in the upper plot in figure 6.11. The lower plot shows the angular
error using vector averaging on the same data. The standard deviation of the
angular error was 9:4 with the CCA method and 14:8 using vector averaging.
This is an improvement of the SNR with 4dB compared to the result when using
vector averaging on the same neighbourhood size.
4 The
mean angular error is not relevant since it only depends on a reference orientation. The
reference orientation can be arbitrarily chosen and, hence, it has been chosen so that the mean
angular error is zero.
6.2 Discussion
119
6.2 Discussion
In this chapter has been shown how a system can learn image feature descriptors
by using canonical correlation. A nice property of the method is that the training
is done by giving examples of what the user defines as being “equal”. In the
experiments, sine wave patterns were considered to be “equal” if they had the
same orientation, irrespectively of the phase. This was presented to the system as a
set of examples and the user did not have to figure out how to represent orientation
and phase. In the first experiment, the system developed a phase invariant double
angle orientation representation.
This type of learning is of course more useful for higher-level feature descriptors, were it can be difficult to define proper representations of features. Such
features are for example corners and line crossings.
In the next chapter, another application of this method is presented, namely
disparity estimation, where the horizontal displacement between the images is
equal within the training set.
120
Learning feature descriptors
Chapter 7
Disparity estimation using CCA
An important problem in computer vision that is suitable to handle with CCA is
stereo vision, since data in this case naturally appear in pairs. In this chapter, a
novel stereo vision algorithm that combines CCA and phase analysis is presented.
The algorithm has been presented in a paper submitted to ICIPS’98 (Borga and
Knutsson, 1998).
For a learning system, the stereo problem is difficult to solve; for small disparities, the high-frequency filters will give the highest accuracy, while for large
disparities, the high-frequency filters will be uncorrelated with the disparity and
only the low-frequency filters can be used. So the choice of which filters to use
for the disparity estimate must be based on a disparity estimate! Furthermore,
a general learning system cannot be supposed to know beforehand which inputs
come from a certain scale1 . A solution to this problem is to let the system adapt
filters to fit the disparity in question instead of using fixed filters.
The algorithm described here consists of two parts: CCA and phase analysis.
Both are performed for each disparity estimate. Canonical correlation analysis
is used to create adaptive linear combinations of quadrature filters. These linear
combinations are new quadrature filters that are adapted in frequency response
and spatial position in order to maximize the correlation between the filter outputs
from the two images.
These new filters are then analysed in the phase analysis part of the algorithm.
The coefficients given by the canonical correlation vectors are used as weighting
coefficients in a pre-computed table that allows for an efficient phase-based search
for disparity.
In the following two sections, the two parts of the stereo algorithm are de1 This problem is similar to the problem a system faces when learning to interpret numbers that
are represented by one digit on each input. Only the most significant digit will have any correlation
with the correct number but the correlation will be weak due to the coarse quantization. Only after
this digit is identified, it is possible to detect the use of the next digit.
122
Disparity estimation using CCA
scribed in more detail. In section 7.3 some experiments are presented to illustrate
the performance of the proposed method. Finally, the method is discussed in section 7.4.
7.1 The canonical correlation analysis part
The input x and y to the CCA come from the left and right images respectively.
Each input is a vector with outputs from a set of quadrature filters:
0q 1
x1
B
C
x = @ ... A
and
0q 1
y1
B
C
y = @ ... A
qxN
;
(7.1)
qyN
where qi is the (complex) filter output for the ith quadrature filter in the filter set.
The quadrature filters can bee seen as the functions f in figure 6.1 on page 108.
In this case, f is a complex vector-valued linear function, i.e. a complex matrix.
In the implementation described here, the filter set consists of two identical onedimensional (horizontal) quadrature filters with two pixels relative displacement.
(Other and larger sets of filters can be used including, for example, filters with
different bandwidths, different centre frequencies, different positions, etc.)
The data are sampled from a neighbourhood N around the point of the disparity estimate. The choice of neighbourhood size is a compromise between noise
sensitivity and locality. The covariance matrix C is calculated using the vectors x
and y in N . The fact that quadrature filters have zero mean simplifies this calculation to an outer product sum:
C=
∑
j2N
x x j
j
yj
yj
:
(7.2)
If a rectangular neighbourhood is used, this calculation can be made efficient by
a Cartesian separable summation of the outer products as illustrated in figure 7.1.
First the outer products are summed in a window moving horizontally for each
row. Then this result is summed again by using a window moving vertically for
each column. This scheme requires 2 m n additions and subtractions of outer
products where m n is the size of the image (except for the borders that are not
reached by the centre of the neighbourhood). This can be compared to a straightforward summation over each neighbourhood that requires N m n additions of
outer products, where N is the size of the neighbourhood. Hence, for a neighbourhood of 10 10, the separable summation is 50 times faster than a straightforward
summation over each neighbourhood.
7.2 The phase analysis part
123
1
2
Figure 7.1: Cartesian separable summation of the outer products.
The first canonical correlation ρ1 and the corresponding (complex) vectors wx
and wy are then calculated. If the set of filters is small, this is done by solving
equation 4.26 on page 68. In the case where only two filters are used, this calculation is very simple. If very large sets of filters are used, an analytical calculation
of the canonical correlation becomes computationally very expensive. In such a
case, the iterative algorithm presented in section 4.7.3 can be used.
The canonical correlation vectors define two new filters:
M
fx = ∑ wxi fi
M
and
i=1
fy = ∑ wyi fi ;
(7.3)
i=1
where fi are the basis filters, M is the number of filters in the filter set and wxi and
wyi are the components in the first pair of canonical correlation vectors. Due to
the properties of canonical correlation, the new filters, fx and fy , have outputs with
maximum correlation over N , given the set of basis filters fi .
7.2 The phase analysis part
The key idea of this part is to search for the disparity that corresponds to a realvalued correlation between the two new filters. This idea is based on the fact that
canonical correlations are real valued (see proof B.4.1 on page 165). In other
words, find the disparity δ such that
Im [Corr (qy (ξ + δ) ; qx (ξ))] = Im [c(δ)] = 0;
(7.4)
124
Disparity estimation using CCA
where qx and qy are the left and right filter outputs respectively and ξ is the spatial
(horizontal) coordinate. There does not seem to exist a well-established definition
of correlation for complex variables. The definition used here (see definition A.3)
is a generalization of correlation for real-valued variables similar to the definition
of covariance for complex variables.
A calculation of the correlation over N for all δ would be very expensive.
A much more efficient solution is to assume that the signal s can be described
by a covariance matrix Css . Under this assumption, the correlation between the
left filter convolved with the signal s and the right filter convolved with the same
signal shifted a certain amount δ can be measured. But convolving a filter with a
shifted signal is the same as convolving a shifted filter with the non-shifted signal.
Hence, the correlation c(δ) can be calculated as the correlation between the left
filter convolved with s and a shifted version of the right filter convolved with the
same signal s.
Under the assumption that the signal s has the covariance matrix Css , the correlation in equation 7.4 can be written as
pEE[j[qqxjq2y]E(δ[j)]q j2]
x
y
E (s fx ) (s fy (δ))
= q E (s f ) (s f ) E (s f ) (s f )
c(δ) =
x
x
E [fx s s fy (δ)]
= q
y
y
E [fx ss fx ] E fy (δ)ss fy (δ)
=
(7.5)
pffxCCssffyf(δC)
ss x y
x
ss fy
;
where fy (δ) is a shifted version of fy . Remember that the quadrature filter outputs have zero mean, which is necessary for the first equality. Note the similarity
between the last expression and the expression for canonical correlation in equation 4.24 on page 68.
A lot of the computations needed to calculate c(δ) can be saved since
f Css fy (δ) =
x
M
∑ wxi fi
!
M
Css
i=1
M M
=
∑∑
∑ wy j f j (δ)
j=1
w wy j f Css f j (δ) =
xi
!
i
i=1 j=1
(7.6)
∑ vi j gi j (δ)
;
ij
where
gi j (δ) = fi Css f j (δ):
(7.7)
7.2 The phase analysis part
125
The function gi j (δ) does not depend on the result from the CCA and can therefore
be calculated in advance for different disparities δ and stored in a table. The
denominator in equation 7.5 can be treated in the same way but does not depend
on δ:
fx Cssfx = ∑ vxij gi j (0)
and
ij
fy Cssfy = ∑ vyij gi j (0);
(7.8)
ij
where vxij = wxi wx j and vi j = wyi wy j . Note that the filter vectors f must be padded
with zeros at both ends to enable the scalar product between a filter and a shifted
filter δ. (The zeros do not, of course, affect the result of equation 7.6.) In the case
of two basis filters, the table contains four rows and eight constants.
Hence, for a given disparity a (complex) correlation c(δ) can be computed as
a normalized weighted sum:
y
c(δ) =
∑i j vi j gi j (δ)
q
y
∑i j vxij gi j (0) ∑i j vi j gi j (0)
:
(7.9)
The aim is to find the δ for which the correlation c(δ) is real valued. This is
done by finding the zero crossings of the phase of the correlation. A very coarse
quantization of δ can be used in the table since the phase is, in general, rather
linear near the zero crossing (as opposed to the imaginary part which in general is
not linear). Hence, first a coarse estimate of the zero crossing is obtained. Then the
derivative of the phase at the zero crossing is measured, using two neighbouring
samples. Finally, the error in the coarse estimate is compensated for by using the
actual phase value and the phase derivative at the estimated position:
δ = δc ,
ϕ(δc )
;
∂ϕ=∂δ
(7.10)
where δc is the coarse estimate of the zero crossing and ϕ(δc ) is the complex phase
of c(δc ) (see figure 7.2 on the next page).
7.2.1 The signal model
If the signal model is uncorrelated white noise, Css is the identity matrix and the
calculations of the values in the table reduce to a simple scalar product: gi j (δ) =
fi f j (δ). There is no computational reason to choose white noise as signal model
if there is a better model, since the table is calculated only once. But it can still
be interesting to compare the correlation for white noise with the correlation for
another signal model in order to get a feeling for the algorithm’s sensitivity with
respect to the signal model. In other words, how does the choice of model affect
the position of the zero phase of c(δ)?
126
Disparity estimation using CCA
ϕ(δ)
ϕ(δc )
δ0
δc
δ
Figure 7.2: The estimation of the coordinate δ0 of the phase zero crossing
using the coarse estimate δc of the zero crossing, the phase value ϕ(δc )
and the derivative at the coarse estimate. The black dots illustrate the
sampling points of the phase given by the table gi j (δ).
First of all, it should be noted that the denominator in equation 7.5 is real valued and, hence, does not affect the complex phase of c(δ). So, only the numerator
c0 (δ) = fx Css fy (δ)
(7.11)
has to be considered. In general, Css is a Toeplitz matrix (i.e. Ci j = C(i , j))
with the columns (and rows) containing shifted versions of the (non-normalized)
autocorrelation function cs of the signal s. This means that f˜x = fx Css can be seen
as a convolution of fx with the autocorrelation function cs . But
c0 (δ) = f̃x fy (δ)
(7.12)
,
can be seen as a convolution between f̃x and f,
y , where fy is fy reversed, since δ
0
only causes a shift of fy . This means that c (δ) can be written as
c0 (δ) = (fx cs ) f,
y;
(7.13)
where () denotes convolution. Since the order of convolutions does not matter
(convolution is commutative and associative), c0 (δ) can be written as
c0 (δ) = (fx f,
y ) cs = (fx fy (δ)) cs ;
(7.14)
i.e. the convolution between fx and f,
y can be calculated first. This function can
then be convolved with the autocorrelation function to get the correct c0 for the
model.
This means that the difference between the correlation c(δ) calculated for
white noise (i.e. Css is the identity matrix) and the correlation calculated by using
7.2 The phase analysis part
127
π
π/2
0
−π/2
−π
0
20
40
60
80
100
Figure 7.3: The phase of the four rows of the table containing gi j (δ) without convolution (solid line) and with convolution with the autocorrelation
function 1 ,jξj.
another signal model is given by the convolution of c(δ) with the autocorrelation
function of the signal model (and an amplitude scaling that does not affect the
phase). Hence, if the phase around the zero crossing is anti-symmetric (e.g. linear) in an interval that is large compared to the autocorrelation function of the
signal model, the result will be very similar to that obtained for a white noise
model. Another lax interpretation of the reasoning above is that as long as the
phase does not have bad behaviour around zero, the choice of signal model is not
critical.
As an example, the phases of the four rows in a table gi j (δ) are plotted in
figure 7.3 with and without convolution with the autocorrelation function 1 ,jξj.
(Note that two of the rows, g11 and g22 , are equal, which means that only three
curves for each case are visible.) This autocorrelation function is usually assumed
for natural images.
7.2.2 Multiple disparities
If more than one zero crossing are detected, the magnitudes of the correlations can
be used to select a solution. Since the CCA searches for maximum correlation,
the zero crossing with maximum correlation c(δ) is most likely to be the best
estimate. If two zero crossings have approximately equal magnitude (and the
canonical correlation ρ is high), both disparity estimates can be considered to be
correct within the neighbourhood, which indicates either a depth discontinuity or
that there really exist two disparities.
128
Disparity estimation using CCA
fx
fy
Figure 7.4: A simple example of a pair of filters that have two correlation
peaks.
The latter is the case for semi-transparent images, i.e. images that are sums
of images with different depths. Such images are typical of many medical applications such as x-ray images. An every-day example of this kind of image is
obtained by looking through a window with reflection. (The effect on the intensity
of a light- or X-ray when passing two objects is in fact multiplicative, but a logarithmic transfer function is usually applied when generating X-ray images which
makes the images additive.)
Note that both disparity estimates are represented by the same canonical correlation solution. This means that the CCA must generate filters that have correlation peaks for two different disparities. To see how this can be done, consider the
simple filter pair illustrated in figure 7.4. The autocorrelation function (or convolution) between these two filters is identical to the left filter, which consists of
two impulses. The example is much simplified, but illustrates the possibility of
having two filters with two correlation peaks. If the CCA was used directly on
the pixel data instead of on the quadrature filter outputs, such a filter pair could
develop. In the present method, the image data are represented by using other
basis functions (the quadrature filters of the basis filter set) but it is still possible
to construct filters with two correlation peaks.
7.2.3 Images with different scales
If the images are differently scaled, the CCA will try to create filters scaled correspondingly. In order to improve the disparity estimates in these cases, the table
can be extended with scaled versions of the basis filters:
gi j (σ; δ) = fi Css f j (σ; δ);
(7.15)
where f j (σ; δ) is a scaled and shifted version of f j . The motive for this is that a
scaled signal convolved with a certain filter gives the same result as the non-scaled
signal convolved with a reciprocally scaled filter. The CCA step is not affected
by this and the phase analysis is performed as described above on each scale.
The correct scale is indicated by having the maximum real-valued correlation.
7.3 Experiments
129
The resolution in scale can be very coarse. In the experiments presented in the
following section, the filters have been scaled between +/- one octave in steps of
a quarter of an octave, which seems to be a quite sufficient resolution.
It should be noted that the disparity estimates measured in pixels will differ
in the two images since one of the images has a scaled filter as reference. But
given the filter scales, the interpretations in terms of depth are of course the same
in both images.
7.3 Experiments
In this section, some experiments are presented to illustrate the performance of the
stereo algorithm. Some results on artificial data are shown. Finally, the algorithm
is applied to two real stereo image pairs, both common test objects for stereo
images.
In all experiments presented here, a basis filter set consisting of two onedimensional horizontally oriented quadrature filters, both with a centre frequency
of π=4 and a bandwidth of two octaves has been used. The filters have 15 coefficients in the spatial domain and are shifted two pixels relative to each other. The
frequency function is approximately a squared cosine on a log scale:
F (u) cos2 (k ln(u=u0 )) ;
(7.16)
where k = π= (2 ln(2)) and u0 = π=4. The actual filter functions are illustrated in
figure 7.5.
For the experiments on the artificial data, the neighbourhood for the CCA has
been chosen to fit the problem reasonably well. This means that the neighbourhood is longer in the direction of constant disparity than in the direction where
the disparity changes. In the real images, a square neighbourhood has been used.
How the choice of neighbourhood can be made adaptive is discussed in section
7.4.
7.3.1 Discontinuities
The first experiment illustrates the algorithm’s ability to handle depth discontinuities. The test image is made of white noise shifted so that the disparity varies
between += , d along the horizontal axis and d varies as a ramp from ,5 pixels to +5 pixels along the vertical axis in order to get discontinuities between
+= , 10 pixels. A neighbourhood N of 13 7 pixels (horizontal vertical) was
used for the CCA. Figure 7.6 shows the estimated disparity for this test image.
Disparity estimates with corresponding canonical correlations less than 0.7 have
been removed. In figure 7.7, two lines of the disparity estimate are shown. To
130
Disparity estimation using CCA
Spectrum of original filter
Original filter
1.4
0.3
1.2
0.2
1
0.1
0.8
0.6
0
0.4
−0.1
−0.2
0.2
−5
0
5
0
−π
−π/2
0
π/2
Figure 7.5: The filter in the basis filter set.
Figure 7.6: Disparity estimate for different depth discontinuities.
π
7.3 Experiments
131
5
5
δ 0
δ 0
−5
0
10
20
30
40
50
60
70
80
−5
0
90
0.5
0
0
20
30
40
50
60
70
80
90
10
20
30
40
50
60
horizontal position
70
80
90
1
correlation
correlation
1
10
10
20
30
40
50
60
horizontal position
70
80
90
0.5
0
0
Figure 7.7: Top: Line 20 (left) and line 38 (right) from the disparity
estimates in figure 7.6 on the facing page. The small dots indicate the
disparity estimates with the second strongest correlations. Bottom: The
corresponding correlations.
the left, line 20 with a disparity of += , 2:5 pixels is shown and to the right, line
38 with a disparity of 1 pixel is shown. The figures at the top show the most
likely (large dots) and second most likely (small dots) disparity estimates along
these lines. The bottom figures show the corresponding canonical correlations at
the zero crossings. Figures 7.6 and 7.7 show that for small discontinuities, the
algorithm interpolates the estimates while for large discontinuities, there are two
overlapping estimates.
An interpolation or fusion for small disparity differences is also performed by
the human visual system. The depth interval for which all points are fused into a
single image is called Panum’s area (see for example Coren and Ward, 1989).
7.3.2 Scaling
The second experiment shows that the algorithm can estimate disparities between
images that are differently scaled. The test image here is white noise warped to
form a ramp along the horizontal axis. The warping is made so that the right
image is scaled to 50% of the original size which means that there is a scale
difference of one octave. For a human, this corresponds to looking at a point
on a surface with its normal rotated 67 from the observer at a distance of 20
centimetres. In this experiment, a neighbourhood N of 3 31 pixels was used. In
figure 7.8 on the next page the results are shown for the basic algorithm without
the scaling parameter (left) and for the extended algorithm that searches for the
132
Disparity estimation using CCA
Figure 7.8: Disparity estimate for a scale difference of one octave between
the images without scale analysis (left) and with scale analysis (right).
optimal scaling (right). The lines at the back of the graphs show the mean value.
The filters created by the CCA are illustrated in figure 7.9. The left-hand plots
show the filters in the spatial domain and the right-hand plots show them in the
frequency domain.
7.3.3 Semi-transparent images
This experiment illustrates the algorithm’s capability of multiple disparity estimates on semi-transparent images. The test images in this experiment were generated as a sum of two images with white uncorrelated noise. The images were
tilted in opposite directions around the horizontal axis. The disparity range was
+= , 5 pixels. Figure 7.10 illustrates the test scene. The stereo pair is shown in
figure 7.11 on page 134. Here, the averaging or fusion performed by the human
visual system for small disparities can be seen in the middle of the image. A
neighbourhood N of 31 3 pixels was used for the CCA. The result is shown
in figure 7.12. In figure 7.13 on page 136, the estimates are projected along the
horizontal axis. The results show that the disparities of both the planes are approximately estimated. In the middle, where the disparity difference is small, the
result is an average between the two disparities in accordance with the results
illustrated in figure 7.7 on the preceding page.
7.3 Experiments
133
Spectrum of left filter
Left filter
1
0.2
0.1
0.5
0
−0.1
−0.2
−8
−6
−4
−2
0
2
Right filter
4
6
8
0
−π
−π/2
0
π/2
Spectrum of right filter
π
π/2
π
0.8
0.2
0.6
0.1
0
0.4
−0.1
0.2
−0.2
−8
−6
−4
−2
0
2
4
6
8
0
−π
−π/2
0
Figure 7.9: The filter created by CCA. Solid lines show the real parts and
dashed lines show the imaginary parts.
Figure 7.10: The test image scene for semi-transparent images.
134
Disparity estimation using CCA
Left
Right
Figure 7.11: The stereo image pair for the semi-transparent images.
7.3.4 An artificial scene
This experiment tries to simulate a slightly more realistic case where both the
discontinuity problem and the scale problem are present. The scene can be thought
of as a pole or a tree in front of a wall. Figure 7.14 on page 136 illustrates the
scene from above. The distance from the wall to the centre of the tree was 2, the
radius of the tree was 1, the distance from the wall to the cameras was 5 and the
distance between the cameras was 0.4 length units. A texture of white noise was
applied on the wall and on the tree and a stereo pair of images was generated.
Each image had the size 200 31 pixels. The generated stereo images are shown
in figure 7.15. The disparity was only calculated for one line. Also in this case,
a neighbourhood N of 3 31 pixels was used for the CCA. The algorithm was
run 100 times on different noise images. The result is illustrated in figure 7.16
on page 137. Close to the edges of the tree, the images are differently scaled.
In figure 7.17 on page 138, the average scale difference used by the algorithm is
plotted. The scaling can be done in nine steps between +=, one octave and in
the figure, the average scaling (in octaves) is plotted. The plot illustrates how the
algorithm scales the images relative to each other in one way near the left edge of
the tree and in the opposite way at the other edge as expected. There is no scale
difference on the background and in the middle of the tree.
7.3.5 Real images
The two final experiments illustrate how the algorithm works on real stereo image
pairs. In both experiments, a neighbourhood N of 7 7 pixels was used.
7.3 Experiments
135
Disparity
10
0
150
−10
100
100
50
50
Vertical position
0 0
Horizontal position
Figure 7.12: The result for the semi-transparent images. The disparity
estimates are coloured to simplify the visualization.
The first stereo pair is two air photographs of Pentagon (see figure 7.18 on
page 139, upper row). The result is shown in the bottom row of the same figure. To
the left, the disparity estimates are shown. White means high disparity and black
means low disparity. The lower-right image shows a certainty image calculated
from the canonical correlation in each neighbourhood. The certainty used here is
the logarithm of the SNR according to equation 4.38 on page 71 plus an offset in
order to make it positive.
The second stereo pair is two images from a well-known image sequence, the
Sarnoff tree sequence. This stereo pair is shown in top of figure 7.19 on page 140
and the result and the certainty image are shown at the bottom of the same figure.
The results are also illustrated in colour in figures 7.20 on page 141 and 7.21
on page 142. The images at the top are generated so that the colour represents
disparity and the intensity represents the original (left) image. The images at the
bottom are 3-dimensional surface plots with height and colour representing the
disparity estimates. Note that the walls in the pentagon result are depth discontinuities and not just steep slopes.
136
Disparity estimation using CCA
6
4
Disparity
2
0
−2
−4
−6
0
20
40
60
Vertical position
80
100
Figure 7.13: A projection along the horizontal axis of the estimates in
figure 7.12 on the preceding page.
Figure 7.14: The artificial tree scene from above.
Left
Right
Figure 7.15: The stereo pair generated from the artificial tree scene.
7.3 Experiments
137
Disparity estimate
4
2
0
−2
−4
0
50
100
150
200
50
100
150
200
50
100
Horizontal position
150
200
True disparity
4
2
0
−2
−4
0
Disparity error
5
0
−5
0
Figure 7.16: The result for the artificial test scene. The top graph shows
the average disparity estimate. The dotted lines show the standard deviation. The middle graph shows the true disparity. The bottom graph shows
the mean disparity error and the standard deviation of the disparity error
(dotted line).
138
Disparity estimation using CCA
Scale
+1/4
0
−1/4
0
50
100
Horizontal position
150
200
Figure 7.17: The average scaling performed by the algorithm. 0 means
no scaling, +1=4 means that the left image is scaled +1=4 of an octave,
i.e. made smaller compared to the right image.
7.4 Discussion
The stereo algorithm described in this chapter is rather different from most image
processing algorithms. Common procedures are first to optimize a set of filters
and then to use these filters to analyse the image or performing statistical analysis
directly on the pixel data. This algorithm, however, first adapts the filters to a local
region in the image and then analyses the adapted filters.
In all the experiments presented in the preceding section, a filter set of two
filters differing only by a shift was used. A larger filter set with filters in different
scales would be able to handle larger disparity ranges. If such a set of filters was
used, the algorithm would simply select the proper scale to work on, i.e. the scale
that has the highest correlation. In general, a larger filter set offers the shapes of
the adapted filters more freedom. Hence, a larger filter set should make it easier to
handle multiple disparities, depth discontinuities and scale differences if the filter
set is chosen properly. A larger filter set covering a wider range of frequencies
would also reduce the risk of the signal in a region giving very weak filter output
because the filter does not fit the signal. With a larger filter set, the CCA would
only use the filters that have a high SNR for the current signal.
7.4 Discussion
Figure 7.18: Upper row: Stereo pair of Pentagon. Lower left: Resulting
disparity estimates. Lower right: Certainty image of the estimates.
139
140
Disparity estimation using CCA
Figure 7.19: Upper row: Stereo pair of the tree scene. Lower left: Resulting disparity estimates. Lower right: Certainty image of the estimates.
The filter set can be seen as the basis functions used for representing the signal. The simplest choice of basis functions is the pixels themselves. The canonical
correlation vector will then define the filters directly in the pixel base. A disadvantage with such an approach is that the analysis of the filter becomes expensive. The canonical correlation vectors in the experiments presented here were
two-dimensional since there were two basis filters. If the pixel basis is used, the
dimensionality is equal to the size of the filters that the algorithm is to construct.
This means, for example, that if the algorithm should be able to use 1 15 filter kernels, the canonical correlation vectors become 15-dimensional. In other
words, the pixel basis is not a good choice of signal representation in this problem
(see the discussion in chapter 3). Since we know a better representation for this
7.4 Discussion
Figure 7.20: Result for the pentagon images in colour. The upper image displays the disparity estimate as the colour overlaid on the original
intensity image.
141
142
Disparity estimation using CCA
Figure 7.21: Result for the tree images in colour. The upper image displays the disparity estimate as the colour overlaid on the original intensity
image.
7.4 Discussion
143
problem (i.e. quadrature filters), it would be unwise not to use it.
The choice of neighbourhood for the CCA is of course important for the result.
If there is a priori knowledge of the shape of the regions that have relatively constant depths, the neighbourhood should, of course, be chosen accordingly. This
means that if the disparity is known to be relatively constant along the vertical
axis, for example, the shape of the neighbourhood should be elongated vertically,
as in the experiments on artificial data in the previous section. It is, however,
possible to let the algorithm select a suitable neighbourhood shape automatically.
This may be done in two ways.
One way is to measure the canonical correlation for a few different neighbourhood shapes. These shapes could be, for example, one horizontally elongated, one
vertically elongated and one square. The algorithm should then use the result from
the neighbourhood that gave the highest canonical correlation to estimate the disparity.
Another way to automatically select neighbourhood shape is to begin with
relatively small square-shaped neighbourhoods to get a coarse disparity estimate.
Then the disparity estimates are segmented. A second run of the algorithm can
then use neighbourhood shapes selected according to the shape of the segmented
regions. It should be noted that the neighbourhoods can be arbitrarily shaped and
even non-connected. The only advantage with a rectangular neighbourhood is
that it is computationally efficient when calculating the covariance matrices for
the CCA. But if this is utilized in the first run and the covariance matrices are
stored, they can simply be added when forming the new larger neighbourhoods in
the second run. On the tree image for example, this approach would give vertically
elongated neighbourhoods on the tree and horizontally elongated neighbourhoods
on the ground.
144
Disparity estimation using CCA
Chapter 8
Epilogue
In this final chapter, the thesis is summed up and discussed. To conclude, some
ideas for future research are presented.
8.1 Summary and discussion
The thesis started with a discussion of learning systems. Three different principles
of learning were described. Supervised learning can be seen as function approximation. The need for a training set that has an associated set of desired output
restricts its use to tasks where such training data can be obtained. Reinforcement learning, on the other hand, is more general than supervised learning and
we believe that it is an important general learning principle in complex systems.
Its relation to learning among animals and to evolution can support this position.
Unsupervised learning is a way of finding a data dependent representation of the
signals that is useful according to some criterion. We do not believe that unsupervised learning is the highest general learning principle, since the performance
measure these methods are trying to maximize is related only to the internal data
representation and has nothing to do with the actual performance of the system in
terms of actions. Unsupervised learning can, however, be an important component which helps the system find a good signal representation. It should again be
pointed out that the difference between the three learning principles is not so clear
as it might seem at first, as discussed in section 2.6.
For unsupervised learning, we believe that methods based on maximizing information are important. If nothing else is known about the optimum choice of
representation, it is probably wise to preserve as much information (rather than
variance, for example) as possible. It is, however, not only the amount of information that is important. The information to be represented must be relevant for
the task. In other words, it must be related to information about possibly success-
146
Epilogue
ful responses; otherwise it is not useful. This makes methods based on maximum
mutual information good candidates.
The signal representation needs a model for the represented information. A
complex global model is not a realistic choice for large systems that have highdimensional input and output signals. The number of parameters to estimate
would be far too large and the structural credit assignment problem would be
unsolvable (see section 2.7.2). We believe that local low-dimensional linear models should be used. One reason for this is that only a small fraction of a highdimensional signal space will ever be visited by the signal. Furthermore, this
signal is (at least) piecewise continuous because of the dynamic of the real world,
which means that the signal can be represented arbitrarily well with local linear
models. How to distribute these models is only briefly mentioned in this thesis
(section 3.5). The interested reader is referred to the PhD thesis by Landelius
(1997) for a detailed investigation of this subject.
The choice of local linear models can be made according to different criteria
depending on the task. If maximum mutual information is the criterion, canonical
correlation analysis is a proper method for finding local linear models. CCA is
related to PCA, PLS and MLR which maximize other criteria, statistical or mean
square error (see chapter 4). An iterative algorithm for these four methods was
presented. The algorithm is more general and actually finds the solutions to the
generalized eigenproblem. An important feature of the proposed algorithm is that
it finds the solutions successively, beginning with the most significant one. This
enables low-rank versions of the solutions of the four methods which is necessary
if the signal dimensionality is high. Another nice feature is that the algorithm
gives the eigenvector and the corresponding eigenvalues and not only the normalized eigenvectors as is the case with many other iterative methods.
It was shown that CCA can be used for learning feature descriptors for computer vision. The proposed method allows the user to define what is equal in
two signals by giving the system examples. If other features are varied in an
uncorrelated way, the feature descriptors become invariant to these features. An
experiment showed that the system learned quadrature filters when it was trained
to represent orientation invariant to phase. When quadrature filter outputs are
used as input to the system, it learns to combine them in a way that is less sensitive to noise than vector averaging without losing spatial resolution. For a 5 5
neighbourhood, the angular error of the orientation estimate was reduced by 4 dB
which is quite a substantial improvement. This method will most likely replace
vector averaging in many applications where there is a conflict between the need
for noise reduction and spatial resolution.
Another application of CCA in computer vision is stereo. A novel stereo
algorithm was presented in chapter 7. The algorithm is a bit unusual since first it
8.2 Future research
147
adapts filters to an image neighbourhood an then it analyses the resulting filters.
A more common approach in computer vision is first to optimize filters and then
to use these filters to analyse the image. Some interesting features of the proposed
algorithm are that it can handle depth discontinuities, multiple depths in semitransparent images and image pairs that are differently scaled. Although only one
basis filter set with two shifted identical filters has been tested, the results look
very promising both on real an artificial images. We believe that the proposed
method can be useful also in motion estimation, in particular on x-ray images
where there are multiple motions in semi-transparent images.
8.2 Future research
There is a number of ideas left open for future research. One interesting question
is how to combine reinforcement learning and mutual information based unsupervised learning.
A rather ad hoc modification of the canonical correlation algorithm that can
handle very high-dimensional signals was presented. Other methods for handling
adaptive update factors should be investigated. Preliminary investigations indicate
that the RPROP algorithm (Riedmiller and Braum, 1993) can be modified to fit
our algorithm. Since the purpose of a gradient based algorithm is to handle very
high-dimensional signals, it is important that the algorithm is optimized to handle
such cases.
The theory for the gradient search method presented in chapter 4 was developed for real valued signals. In chapters 6 and 7, however, we have seen
that canonical correlation is useful also when analysing complex-valued signals.
Hence, an extension of the theory in chapter 4 to include complex-valued signals
is desirable.
One of the most interesting issues for future research based on this work is
to investigate how canonical correlation can be used in multidimensional signal
processing. The experiments in chapter 6 show that phase invariant orientation
filters can be learned by using this method. The use of this algorithm for detecting
other, higher-level, features should be investigated. Examples of such features are
line crossings, corners and even texture. Consider a pair of local neighbourhoods
with a given spatial relation as illustrated in figure 8.1. The spatial relation is defined by a displacement vector r. If data are collected from such neighbourhood
pairs in a larger region of the image, a CCA would give the linear combination
of one neighbourhood that is the most predictable and at the same time the linear combination of the other neighbourhood that is the best predictor. For each
displacement, this would give a measure of the best linear relation between the
image patches and a description of that relation. This can be performed directly
148
Epilogue
Image region
y
r
x
Figure 8.1: Illustration of how CCA can be used for generating a texture
descriptor by analysing the linear relation between two neighbourhoods x
and by with a spatial relationship defined by the displacement vector r.
on the pixel data or on a filtered image. Consider, for example, a sine wave pattern
without noise. The canonical correlation for such an image would be one for all
displacements between the neighbourhoods. This is logical, since the pattern is
totally predictable. An ordinary correlation analysis, however, would give zero
correlations where the phase of the patterns differs 90 . A matrix containing the
largest canonical correlations for different neighbourhood displacements defines
the displacement vectors for which the patterns are linearly predictable. Instead
of the matrix, a tensor that contains the canonical correlation vectors can be used.
Such a tensor would be a descriptor of the texture. The use of such descriptors in
texture analysis should be investigated.
The generalization of the canonical correlation method aiming to find maximum mutual information as illustrated in figure 6.2 on page 109 should be investigated. The non-linear functions fx and fy can be implemented as neural networks
with, for example, sigmoid or radial-basis functions. The neural networks are
trained, for example using back-propagation, in order to maximize the canonical
correlation ρ.
The method in chapter 6 is, of course, not limited to image data. Another
interesting application is speech recognition, where it is important to be invariant
with respect to how the words are pronounced.
Another very interesting issue is the extension of the stereo algorithm in chap-
8.2 Future research
149
ter 7 in order to estimate both vertical and horizontal shifts, i.e. two-dimensional
translations of the image. If the neighbourhoods are taken from different frames
in a temporal image sequence, the extended algorithm could be used for motion
estimation. The capability of handling multiple estimates in semi-transparent images would make this method interesting in medical applications. The problem of
estimating multiple motions exists for example in x-ray image sequences, where
different parts of the body move in different ways. The capability of handling scaling between the images would make it possible to handle motions more complex
than pure translations, for example 3-dimensional rotations and deformations.
150
Epilogue
Appendix A
Definitions
In this appendix, some useful non-standard functions are defined. “,” means
“equal by definition”.
A.1
The vec function
Consider an m n matrix M:
M = [m1 m2
:::
mn ]
(A.1)
where the columns mi are m-dimensional vectors. Then
0m 1
1
B
m
B 2C
C
v = vec(M) , B . C
@ .. A
(A.2)
mn
A.2
The mtx function
Consider an mn-dimensional vector v. Then
M = mtx(v; m; n) , [m1 m2
where the columns mi are m-dimensional vectors.
:::
mn ]
(A.3)
152
A.3
Definitions
Correlation for complex variables
Consider x, y 2 C 1 with mean x̄ and ȳ respectively. The correlation between x and
y is defined as
Corr(x; y) =
pEE[[(jxx,,x̄x̄j2)(] yE,[jyȳ,) ]ȳj2]
(A.4)
Appendix B
Proofs
This appendix contains all the proofs referred to in the text.
B.1 Proofs for chapter 2
B.1.1
The differential entropy of a multidimensional Gaussian variable
h(z) =
,
1
log (2πe)N jCj
2
(2.41)
;
where jCj is the determinant of the covariance matrix of z and N is the dimensionality of z.
Proof:
The Gaussian distribution for an N-dimensional variable z is
p(z) =
p(2π1)N jCj e, z C, z
1 T
2
1
(B.1)
The definition of differential entropy (equation 2.38 on page 29) then gives
Z
h(z) = ,
p(z) log p(z) dz
RN
Z
1 T ,1
(B.2)
=
p(z) log
(2π)N jCj + z C z dz
2
RN
N 1
N
= log (2π)N jCj +
= log (2πe) jCj :
2
2
q
q
,
Here, we have used the fact that
Z
p(z)zC,1 z dz = E [zT C,1 z] = E [tr(zzT C,1 )] = N
RN
(B.3)
154
Proofs
B.2 Proofs for chapter 3
B.2.1
The constant norm of the channel set
∑8k jck j2 = constant where
8 2 ,π
<cos 3 (x , k)
ck =
:0
if
jx , kj
<
3
2
(3.2)
otherwise
(page 41).
Proof: Consider the interval , π6 < x π6 . On this interval, all channels are zero
except for k = ,1; 0; 1. Hence, it is sufficient to sum over these three channels.
i
i
h
i
h
π
π
4 π
(x , 1) + cos
x + cos4 (x + 1)
3
3
3
2
2
2 !
2π
2π
2π
1 + cos
x
+ 1 + cos
(x , 1)
+ 1 + cos
(x + 1 )
3
3
3
∑ jck j2 = cos4
=
1
4
h
1
2π
2π
2 2π
=
3 + cos2
x + cos2
(x , 1) + cos
(x + 1)
4
3
3
3
2π
2π
2π
(x , 1) + 2 cos
(x + 1 )
+ 2 cos
x + 2 cos
3
3
3
=
1
4
3 + cos2
2π
x
3
+
cos
2π
2π
x cos
3
3
+ sin
2π
2π
x sin
3
3
2
2π
2π
2π
2π 2
2π
+ cos
x cos
,
sin
x sin
+ 2 cos
x
3
3
3
3
3
2π
2π
2π
2π
+ 2 cos
x cos
+ sin
x sin
3
3
3
3
2π
2π
2π
2π
+ 2 cos
x cos
, sin 3 x sin 3
3
3
0
1
2 2π
= @3 + cos
x
4
3
+
, 12 cos 2π3 x
+
,
1
2π
+ 2 , cos
x
2
3
p
!2
3
2π
sin
x
2
3
p
+
p
, 12 cos 2π
x
3
+
2π
+ 2 cos
x
3
!
3
2π
sin
x
2
3
!2
3
2π
sin
x
2
3
1
2π
+ 2 , cos
x
2
3
,
p
3
2π
sin
x
2
3
!!
B.2 Proofs for chapter 3
1
=
4
2
3 + cos
2π
x
3
155
1
2 2π
+ cos
x
4
3
3 2 2π
+ sin
x
4
3
p
,
p
3
2π
2π
cos
x sin
x
2
3
3
1
3
2π
3
2π
2π
2π
2 2π
+ cos
x + sin2
x +
cos
x sin
x + 2 cos
x
4
3
4
3
2
3
3
3
p
p
x
, cos 2π3 x + 3sin 2π3 x , cos 2π3 x , 3 sin 2π
3
1
2π
1
2π
3
2π
=
3 + cos2
x + cos2
x + sin2
x
4
3
2
3
2
3
9
= :
8
This case can be generalized for any x that is covered by three channels of this
shape that are separated by π3 .
B.2.2
The constant norm of the channel derivatives
d
ck j2 = constant (page 41).
∑8k j dx
Proof:
The derivative of a channel k with respect to x is
h
i h
2π
π
π
d
ck = , cos (x , k) sin (x , k)
dx
3
3
3
i
:
156
Proofs
The sum is then
∑
2 π i 2 hπ i
x sin
x
3
3
h
i
h
i
2 π
2 π
+ cos
(x , 1) sin
(x , 1)
3
3
h
i
h
i
2 π
2 π
(x + 1) sin
(x + 1 )
+ cos
3
3 2 π
2π
2
2 2π
2 2π
=
sin
x + sin
(x , 1) + sin
(x + 1)
3
3
3
3
d 2
ck =
dx =
2π
3
cos2
h
π 2
2π
x
sin2
3
3
2
+ sin
2π
x
3
+
+
sin
2π
2π
sin
x cos
3
3
2π
2π
x cos
3
3
, cos
2π
2π
+ cos
x sin
3
3
2π
2π
x sin
3
3
2
2 !
π 2
2π
2π
2π
2π
2 2π
sin2
x + 2 sin2
x cos2
x sin2
+ cos
3
3
3
3
3
3
2 π
2π
3
2π
1
2π
sin2
x +2
cos2
x + sin2
x
=
3
3
4
3
4
3
2
π 3
2π
2π
x + sin2
x
=
cos2
3 4
3
3
=
=
B.2.3
π2
:
12
Derivation of the update rule for the prediction matrix memory
r = p + akqk2 kvk2
Proof:
get
By inserting equation 3.17 on page 49 into equation 3.14 on page 49, we
r
=
=
=
=
B.2.4
(3.18)
hW + aqvT j qvT i
hW j qvT i + ahqvT j qvT i
p + a(qT qvT v)
p + akqk2 kvk2 :
One frequency spans a 2-D plane
One frequency component defines an ellipse and, hence, spans a two-dimensional
plane (page 51).
B.3 Proofs for chapter 4
Proof:
157
Consider a signal with frequency ω in an n-dimensional space:
0a sin(ωt + α )1
0a cos α 1
0a sin α 1
1
1
1
1
1
1
B
B
BBa2 sin(ωt + α2)CC
C
C
a
cos
α
a
sin
α
2
2
2
2C
B
C
B
sin
(
ωt
)
+
cos
(
ωt
)
=
[email protected]
C
B
C
B
..
A
@ ... A
@ ... C
A
.
an sin(ωt + αn )
an cos αn
an sin αn
(B.4)
= v1 sin(ωt ) + v2 cos(ωt )
Remark: It should be noted that the two-dimensionality is caused by the different phases αi . If all components have the same phase, the signal spans only one
dimension.
B.3 Proofs for chapter 4
B.3.1
Orthogonality in the metrics A and B
(
ŵTi Bŵ j =
Proof:
0
for
βi > 0 for
i 6= j
i= j
(
and
ŵTi Aŵ j =
0
ri βi
for
for
i 6= j
i= j
(4.6)
For solution i we have
Aŵi = ri Bŵi :
(B.5)
The scalar product with another eigenvector gives
ŵTj Aŵi = ri ŵTj Bŵi
(B.6)
ŵTi Aŵ j = r j ŵTi Bŵ j :
(B.7)
and of course also
Since A and B are Hermitian we can change positions of ŵi and ŵ j which gives
r j ŵTi Bŵ j = ri ŵTi Bŵ j
(B.8)
and hence
(ri
, r j )ŵTi Bŵ j = 0
:
(B.9)
158
Proofs
For this expression to be true when i 6= j, we have that ŵTi Bŵ j = 0 if ri 6= r j . For
i = j we now have that ŵTi Bŵi = βi > 0 since B is positive definite. In the same
way we have
1
1
,
ri r j
which means that ŵTi Aŵ j
ri ŵTi Bŵi = ri βi .
B.3.2
=
0 for i
ŵTi Aŵ j = 0;
6=
j. For i
=
(B.10)
j we know that ŵTi Aŵi
=
Linear independence
fwi g are linearly independent.
Proof: Suppose fwi g are not linearly independent. This would mean that we
could write an eigenvector wk as
ŵk =
∑ γ j ŵ j
j6=k
(B.11)
:
This means that for j 6= k,
wTj Bwk = γ j wTj Bw j 6= 0
(B.12)
which violates equation 4.6 on page 63. Hence, fwi g are linearly independent.
B.3.3
The range of r
rn r r1
Proof:
(4.7)
If we express a vector w in the base of the eigenvectors ŵi , i.e.
w = ∑ γi ŵi ;
(B.13)
i
we can write
r=
∑ γi ŵTi A ∑ γi ŵi
∑ γi ŵTi B ∑ γi ŵi
=
∑ γ2i αi
;
∑ γ2i βi
(B.14)
B.3 Proofs for chapter 4
159
where αi = ŵTi Aŵi and βi = ŵTi Bŵi , since ŵTi Aŵ j = ŵTi Bŵ j = 0 for i 6= j. Now,
since αi = βi ri (see equation 4.6 on page 63), we get
r=
∑ γ2i βi ri
:
∑ γ2i βi
(B.15)
Obviously this function has the maximum value r1 when γ1 6= 0 and γi = 0 8 i > 1
if r1 is the largest eigenvalue. The minimum value, rn , is obtained when γn 6= 0
and γi = 0 8 i < n if rn is the smallest eigenvalue.
B.3.4
The second derivative of r
Hi =
Proof:
tive as
∂2 r
∂w2
=
∂2 r
∂w2
w
=
=ŵi
2
(A
ŵTi Bŵi
, riB)
(4.8)
From the gradient in equation 4.3 on page 61 we get the second deriva-
2
(wT Bw)2
∂r T
A,
w B , rB wT Bw , (Aw , rBw)2wT B
∂w
:
(B.16)
If we insert one of the solutions ŵi , we have
∂r
∂w
w
=ŵi
and hence
∂2 r
∂w2
B.3.5
=
w
2
(Aŵi , rBŵi ) = 0
ŵTi Bŵi
(B.17)
2
(A , ri B) :
ŵTi Bŵi
(B.18)
=ŵi
=
Positive eigenvalues of the Hessian
There exists a w such that
w T Hi w > 0
8i
>
1
(4.9)
160
Proof:
get
Proofs
If we express a vector w as a linear combination of the eigenvectors we
βi T
w Hi w = wT (A , ri B)w
2
T
,1
= w B(B A , ri I)w
∑ γ j ŵTj B(,B,1A , riI) ∑ γ j ŵ j T
= ∑ γ j ŵ j B ∑ r j γ j ŵ j , ∑ ri γ j ŵ j
T
= ∑ γ j ŵ j B ∑(r j , ri )γ j ŵ j
2
= ∑ γ j β j (r j , ri )
=
(B.19)
;
where βi = ŵTi Bŵi > 0. Now, (r j , ri ) > 0 for j < i so if i > 1 there is at least one
choice of w that makes this sum positive.
B.3.6
The partial derivatives of the covariance
( ∂ρ
∂wx
∂ρ
∂wy
Proof:
= kw1 k (Cxy ŵy
x
1
= kw k (Cyx ŵx
y
(4.17)
:
The partial derivative of ρ with respect to wx is
∂ρ
∂wx
=
=
=
Cxy wy kwx kkwy k, wTx Cxy wy kwx k,1 wx kwy k
kwx k2kwy k2
Cxy ŵy
ρwx
kwx k , kwx k2
1
kwx k (Cxy ŵy , ρŵx)
The same calculations can be made for
B.3.7
, ρŵx)
, ρŵy)
∂ρ
∂wy
by exchanging x and y .
The partial derivatives of the correlation
8 ∂ρ
>
< ∂w
>
: ∂w∂ρ
x
a
= kw k
x
y
a
= kw k
y
ŵ C
Cxy ŵy , ŵ C
ŵ C
T
x
T
x
xy ŵy
Cxx ŵx
xx ŵx
Cyx ŵx , ŵyT Cyxyy ŵxy Cyy ŵy
T
y
ŵ
(4.25)
B.3 Proofs for chapter 4
Proof:
∂ρ
∂wx
161
The partial derivative of ρ with respect to wx is
=
T
1=2 C w
(wT
xy y
x Cxx wx wy Cyy wy )
wTx Cxx wx wTy Cyy wy
wTx Cxy wy (wTx Cxx wx wTy Cyy wy ),1=2 Cxx wx wTy Cyy wy
,
wTx Cxx wx wTy Cyy wy
,1=2
T
T
= (wx Cxx wx wy Cyy wy )
wT Cxy wy
Cxy wy , Tx
Cxx wx
wx Cxx wx
T
,1 T
,1=2 Cxy ŵy , ŵx Cxy ŵy Cxx ŵx
T
= kwx k (ŵx Cxx ŵx ŵy Cyy ŵy )
T
=
|
{z
}
0
ŵx Cxx ŵx
ŵT Cxy ŵy
a
Cxy ŵy , xT
Cxx ŵx
kwx k
ŵx Cxx ŵx
The same calculations can be made for
B.3.8
∂ρ
∂wy
a 0:
;
by exchanging x and y .
Invariance with respect to linear transformations
Canonical correlations are invariant with respect to linear transformations.
Proof:
Let
x = Ax x0
and
y = Ay y0 ;
(B.20)
where Ax and Ay are non-singular matrices. If we denote
C0 xx = E [x0 x0
T
];
(B.21)
the covariance matrix for x can be written as
Cxx = E [xxT ] = E [Ax x0 x0 ATx ] = Ax C0 xx ATx :
T
(B.22)
In the same way we have
Cxy = Ax C0 xy ATy
and
Cyy = Ay C0 yy ATy :
(B.23)
Now, the equation system 4.26 on page 68 can be written as
(
ATx C0xy Ay ŵy
ATy C0yx Ax ŵx
0
= ρλx AT
x Cxx Ax ŵx
0
= ρλy AT
y Cyy Ay ŵy
(B.24)
162
Proofs
(
or
C0xy ŵ0y
C0yx ŵ0x
= ρλx C0xx ŵ0x
(B.25)
= ρλy C0yy ŵ0y ;
where ŵ0x = ATx ŵx and ŵ0y = ATy ŵy . Obviously this transformation leaves the roots
ρ unchanged. If we look at the canonical variates,
(0
x
y0
T
,1
= w0 x x0 = wT
x Ax Ax x = x
T
,1
= w0 y y0 = wT
y Ay Ay y = y;
(B.26)
we see that these too are unaffected by the linear transformation.
B.3.9
Relationship between mutual information and canonical correlation
I (x; y) =
1
1
log
2
∏i (1 , ρ2i )
(4.32)
;
where x and y are N-dimensional Gaussian variables and ρi are the canonical
correlations.
Proof:
The differential entropy of a multidimensional Gaussian variable is
h(z) =
,
1
log (2πe)N jCj
2
(B.27)
;
where jCj is the determinant of the covariance matrix of z and N is the dimensionality of z (see proof B.1.1 on page 153). If z = xy , the covariance matrix C can
be written as
,
C=
C
xx
Cyx
Cxy
Cyy
(B.28)
:
By using the relation
jCj = jCxx j jCyy , CyxCxx,1Cxy j
(B.29)
(Kailath, 1980, page 650) and equation 2.42 on page 30, we get
jCxx j jCyy j
1
I (x; y) = log
2
jCj
jCyy , CyxC,xx1Cxy j
1
= , log
2
jCyyj
1
,1
,1
= , log jI , Cyy Cyx Cxx Cxy j
2
,
(B.30)
B.3 Proofs for chapter 4
163
assuming the covariance matrices Cxx and Cyy being non-singular. The eigenval1
,1
ues to C,
yy Cyx Cxx Cxy are the squared canonical correlations (see equation 4.28
on page 68). Hence, an eigenvalue decomposition gives
2ρ2
6 1 ρ2
1
6 2
I (x; y) = , log I , 6
4
2
0
1 1
=
2
3
77 1
75 = , 2 log ∏(1 , ρ2i )
i
ρn
0
..
.
(B.31)
∏i (1 , ρ2i )
log
since the eigenvalue decomposition does not change the identity matrix.
B.3.10
The partial derivatives of the MLR-quotient
8 ∂ρ
< ∂w
: ∂w∂ρ
Proof:
∂ρ
∂wx
, βCxx ŵx)
Cyx ŵx , ρβ ŵy
x
a
= kw k (Cxy ŵy
x
y
a
= kw k
x
(4.44)
2
:
The partial derivative of ρ with respect to wx is
=
T
1=2 C w
(wT
xy y
x Cxx wx wy wy )
wTx Cxx wx wTy wy
,
wTx Cxy wy (wTx Cxx wx wTy wy ),1=2 Cxx wx wTy wy
wTx Cxx wx wTy wy
T
T
,1=2
= (wx Cxx wx wy wy )
=
=
wT Cxy wy
Cxy wy , xT
Cxx wx
wx Cxx wx
kwx k,1(ŵTx Cxx ŵxŵTy ŵy ),1=2
|
{z
0
}
a
kwx k (Cxyŵy , βCxx ŵx) ;
Cxy ŵy ,
a 0:
ŵTx Cxy ŵy
Cxx ŵx
ŵTx Cxx ŵx
164
Proofs
The partial derivative of ρ with respect to wy is
∂ρ
∂wy
=
T
1=2 C w
(wT
yx x
x Cxx wx wy wy )
wTx Cxx wx wTy wy
T
T
,1=2 wT Cxx wx wy
T
, wx Cxy wy(wx CwxxT Cwxwwy wwyT)w
xx
x
kwy k,1(|ŵTx C{zxx ŵ}x ),1=2
=
B.3.11
0
a
ρ2
Cyx ŵx , ŵy
kwx k
β
y
y
wT Cxy wy wTx Cxx wx
Cyx wx , x T
wy
wx Cxx wx wTy wy
T
T
,1=2
= (wx Cxx wx wy wy )
=
x
x
,C
yx ŵx
;
, ŵTx Cxy ŵy ŵy
!
a 0:
The successive eigenvalues
H = G , λ1ê1 fT1
(4.59)
Proof: Consider a vector u which is expressed as the sum of one vector parallel
to the eigenvector ê1 , and another vector uo that is a linear combination of the
other eigenvectors and, hence, orthogonal to the dual vector f1 .
u = aê1 + uo;
(B.32)
where
fT1 ê1 = 1 and
Multiplying H with u gives
,
Hu = G , λ1ê1 fT1
= a (Gê1
fT1 ûo = 0:
(aê
1 + uo )
, λ1ê1 ) + (Guo , 0)
(B.33)
= Guo :
This shows that G and H have the same eigenvectors and eigenvalues except
for the largest eigenvalue and eigenvector of G. Obviously the eigenvector corresponding to the largest eigenvalue of H is ê2 .
B.4 Proofs for chapter 7
165
B.4 Proofs for chapter 7
B.4.1
Real-valued canonical correlations
The canonical correlations ρi are real valued.
Proof:
,1Cxy C,1 Cyx :
The canonical correlations are eigenvalues to the matrix Cxx
yy
,1Cxy C,1 Cyx wx = ρi wx
Cxx
yy
(B.34)
,1 is Hermitian and positive definite. B = Cxy C,1Cyx is Hermitian. Then
A = Cxx
yy
1
,1
AB = C,
xx Cxy Cyy Cyx is Hermitian (see proof B.4.2) and, hence, got real-valued
eigenvalues.
1
It should be noted that if A = C,
xx only is positive semidefinite, A and B can
be projected into a subspace spanned by the eigenvectors of A corresponding to
the non-zero eigenvalues. This will give two new matrices A0 and B0 with the
same non-zero eigenvalues as A and B but with A0 positive definite. In this way it
can be shown that all non-zero correlations are real valued.
B.4.2
Hermitian matrices
If A is Hermitian and positive definite and B is Hermitian then AB is Hermitian.
Proof:
By writing the singular value decomposition A = U DU we see that also
C = U D1=2 U = A1=2
(B.35)
is Hermitian and positive definite. Then
CBC = C B C = (CBC)
(B.36)
is Hermitian. But CBC and AB has got the same eigenvalues since
AB = C2 B = C(CBC)C,1
is only a change of basis which does not change the eigenvalues.
(B.37)
166
Proofs
Bibliography
Anderson, J. A. (1972). A simple neural network generating an interactive memory. Mathematical Biosciences, 14:197–220.
Anderson, J. A. (1983). Cognitive and psychological computation with neural
models. IEEE Transactions on Systems, Man, and Cybernetics, 14:799–815.
Anderson, T. W. (1984). An Introduction to Multivariate Statistical Analysis. John
Wiley & Sons, second edition.
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function
approximation. In Machine Learning: Proceedings of the Twelfth International
Conference, San Francisco, CA. Armand Prieditis and Stuart Russell, eds.
Baker, W. L. and Farell, J. A. (1992). Handbook of intelligent control, chapter
An introduction to connectionist learning control systems, pages 35–63. Van
Nostrand Reinhold, New York.
Ballard, D. H. (1987). Vision, Brain, and Cooperative Computation, chapter Cortical Connections and Parallel Processing: Structure and Function. MIT Press.
M. A. Arbib and A. R. Hanson, Eds.
Ballard, D. H. (1990). Computational Neuroscience, chapter Modular Learning
in Hierarchical Neural Networks. MIT Press. E. L. Schwartz, Ed.
Barlow, H. (1989). Unsupevised learning. Neural Computation, 1:295–311.
Barlow, H. B., Kaushal, T. P., and Mitchson, G. J. (1989). Finding minimum
entropy codes. Neural Computation, 1:412–423.
Barnard, S. T. and Fichsler, M. A. (1982). Computational Stereo. ACM Comput.
Surv., 14:553–572.
Barto, A. G. (1992). Handbook of Intelligent Control, chapter Reinforcement
Learning and Adaptive Critic Methods. Van Nostrand Reinhold, New York. D.
A. White and D. A. Sofge, Eds.
168
Bibliography
Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive
elements that can solve difficult learning control problems. IEEE Trans. on
Systems, Man, and Cybernetics, SMC-13(8):834–846.
Battiti, R. (1992). First and second-order methods for learning: Between steepest
descent and newton’s method. Neural Computation, 4:141–166.
Becker, S. (1996). Mutual information maximization: models of cortical selforganization. Network: Computation in Neural Systems, 7:7–31.
Becker, S. and Hinton, G. E. (1992). Self-organizing neural network that discovers
surfaces in random-dot stereograms. Nature, 355(9):161–163.
Becker, S. and Hinton, G. E. (1993). Learning mixture models of spatial coherence. Neural Computation, 5(2):267–277.
Bell, A. J. and Sejnowski, T. J. (1995). An information-maximization approach
to blind separation and blind deconvolution. Neural Computation, 7:1129–59.
Bellman, R. E. (1957). Dynamic Programming. Princeton University Press,
Princeton, NJ.
Bloom, F. E. and Lazerson, A. (1985). Brain, Mind, and Behavior. W. H. Freeman
and Company.
Bock, R. D. (1975). Multivariate Statistical Methods in Behavioral Research.
McGraw-Hill series in psychology. McGraw-Hill.
Borga, M. (1993). Hierarchical Reinforcement Learning. In Gielen, S. and Kappen, B., editors, ICANN’93, Amsterdam. Springer-Verlag.
Borga, M. (1995). Reinforcement Learning Using Local Adaptive Models. Thesis
No. 507, ISBN 91–7871–590–3.
Borga, M. and Knutsson, H. (1998). An adaptive stereo algorithm based on canonical correlation analysis. Submitted to ICIPS’98.
Borga, M., Knutsson, H., and Landelius, T. (1997a). Learning Canonical Correlations. In Proceedings of the 10th Scandinavian Conference on Image Analysis,
Lappeenranta, Finland. SCIA.
Borga, M., Landelius, T., and Knutsson, H. (1997b). A unified approach to PCA,
PLS, MLR and CCA. Information Sciences. Submitted. Revised for second
review.
Bibliography
169
Bower, G. H. and Hilgard, E. R. (1981). Theories of Learning. Prentice–Hall,
Englewood Cliffs, N.J. 07632, 5 edition.
Bracewell, R. (1986). The Fourier Transform and its Applications. McGraw-Hill,
2nd edition.
Bradtke, S. J. (1993). Reinforcement learning applied to linear quadratic regulation. In Advances in Neural Information Processing Systems 5, San Mateo,
CA. Morgan Kaufmann.
Bregler, C. and Omohundro, S. M. (1994). Surface learning with applications to
lipreading. In Advances in Neural Information Processing Systems 6, pages
43–50, San Francisco. Morgan Kaufmann.
Brooks, V. B. (1986). The Neural Basis of Motor Control. Oxford University
Press.
Broomhead, D. S. and Lowe, D. (1988). Multivariable functional interpolation
and adaptive networks. Complex Systems, 2:321–355.
Carson, J. and Fry, T. (1937). Variable frequency electric circuit theory with
application to the theory of frequency modulation. Bell Syste Tech. J., 16:513–
540.
Comon, P. (1994). Independent component analysis, a new concept?
Processing, 36(3):287–314.
Signal
Coren, S. and Ward, L. M. (1989). Sensation & Perception. Harcourt Brace
Jovanovich, Publishers, San Diego, USA, 3rd edition. ISBN 0–15–579647–X.
Das, S. and Sen, P. K. (1994). Restricted canonical correlations. Linear Algebra
and its Applications, 210:29–47.
Davis, L., editor (1987). Genetic Algorithms and Simulated Anealing. Pitman,
London.
Denoeux, T. and Lengellé, R. (1993). Initializing back propagation networks with
prototypes. Neural Networks, 6(3):351–363.
Derin, H. and Kelly, P. A. (1989). Discrete-index markov-type random processes.
In Proceedings of IEEE, volume 77.
Duda, R. O. and Hart, P. E. (1973). Pattern classification and scene analysis.
Wiley-Interscience, New York.
170
Bibliography
Fieguth, P. W., Irving, W. W., and Willsky, A. S. (1995). Multiresolution model
development for overlapping trees via canonical correlation analysis. In International Conference on Image Processing, pages 45–48, Washington DC.
IEEE.
Field, D. J. (1994). What is the goal of sensory coding? Neural Computation. in
press.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems.
Ann. Eugenics, 7(Part II):179–180. Also in Contributions to Mathematical
Statisitcs (John Wiley, New York, 1950).
Földiák, F. (1990). Forming sparse representations by local anti-hebbian learning.
Biological Cybernetics.
Fletcher, R. and Reeves, C. M. (1964). Function minimization by conjugate gradients. Computer Journal, 7:149–154.
Geladi, P. and Kowalski, B. R. (1986). Parial least-squares regression: a tutorial.
Analytica Chimica Acta, 185:1–17.
Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the
bias/variance dilemma. Neural Computation, 4:1–58.
Giles, G. L. and Maxwell, T. (1987). Learning, invariance, and generalization in
high-order neural networks. Applied Optics, 26(23):4972–4978.
Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley.
Golub, G. H. and Loan, C. F. V. (1989). Matrix Computations. The Johns Hopkins
University Press, second edition.
Granlund, G. H. (1978). In search of a general picture processing operator. Computer Graphics and Image Processing, 8(2):155–178.
Granlund, G. H. (1988). Integrated analysis-response structures for robotics systems. Report LiTH–ISY–I–0932, Computer Vision Laboratory, Linköping University, Sweden.
Granlund, G. H. (1989). Magnitude representation of features in image analysis.
In The 6th Scandinavian Conference on Image Analysis, pages 212–219, Oulu,
Finland.
Bibliography
171
Granlund, G. H. (1997). From multidimensional signals to the generation of responses. In Sommer, G. and Koenderink, J. J., editors, Algebraic Frames for
the Perception-Action Cycle, volume 1315 of Lecture Notes in Computer Science, pages 29–53, Kiel, Germany. Springer-Verlag. International Workshop,
AFPAC’97, invited paper.
Granlund, G. H. and Knutsson, H. (1982). Hierarchical processing of structural
information in artificial intelligence. In Proceedings of 1982 IEEE Conference
on Acoustics, Speech and Signal Processing, Paris. IEEE. Invited Paper.
Granlund, G. H. and Knutsson, H. (1983). Contrast of structured and homogenous
representations. In Braddick, O. J. and Sleigh, A. C., editors, Physical and
Biological Processing of Images, pages 282–303. Springer Verlag, Berlin.
Granlund, G. H. and Knutsson, H. (1990). Compact associative representation of
visual information. In Proceedings of The 10th International Conference on
Pattern Recognition. Report LiTH–ISY–I–1091, Linköping University, Sweden, 1990.
Granlund, G. H. and Knutsson, H. (1995). Signal Processing for Computer Vision.
Kluwer Academic Publishers. ISBN 0-7923-9530-1.
Gray, R. M. (1984). Vector quaantization. IEEE ASSP Magazine, 1:4–29.
Gray, R. M. (1990). Entropy and Information Theory. Springer-Verlag, New York.
Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning
real-valued functions. Neural Networks, 3:671–692.
Haykin, S. (1994). Neural Networks: A Comprehensive Foundation. Macmillan
College Publishing Company.
Hebb, D. O. (1949). The Organization of Behavior. Wiley, New York.
Heger, M. (1994). Consideration of risk in reinforcent learning. In Cohen, W. W.
and Hirsh, H., editors, Proceedings of the 11th International Conference on
Machine Learning, pages 105–111, Brunswick, NJ.
Held, R. and Bossom, J. (1961). Neonatal deprivation and adult rearrangement.
Complementary techniques for analyzing plastic sensory–motor coordinations.
Journal of Comparative and Physiological Psychology, pages 33–37.
Hertz, J., Krogh, A., and Palmer, R. G. (1991). Introduction to the Theory of
Neural Computation. Addison-Wesley.
172
Bibliography
Hinton, G. E. and Nowlan, S. J. (1987). How learning can guide evolution. Complex Systems, pages 495–502.
Hinton, G. E. and Sejnowski, T. J. (1983). Optimal perceptual inference. In
Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, pages 448–453, Washington DC.
Hinton, G. E. and Sejnowski, T. J. (1986). Learning and relearning in Boltzmann
machines. In Rummelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing: Explorations in Microstructures of Cognition. MIT Press,
Cambridge, MA.
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of
Michigan Press, Ann Arbor.
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational capabilities. Proceedings of the National Academy of
Sciences, 79:2554–2558.
Hornby, A. S. (1989). Oxford Advanced Learner’s Dictionary of Current English.
Oxford University Press, Oxford, fourth edition. A. P. Cowie (ed.).
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal
components. Journal of Educational Psychology, 24:417–441, 498–520.
Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28:321–
377.
Höskuldsson, A. (1988). PLS regression methods. Journal of Chemometrics,
2:211–228.
Hubel, D. H. (1988). Eye, Brain and Vision, volume 22 of Scientific American
Library. W. H. Freeman and Company. ISBN 0–7167–5020–1.
Hubel, D. H. and Wiesel, T. N. (1959). Receptive fields of single neurones in the
cat’s striate cortex. J. Physiol., 148:574–591.
Hubel, D. H. and Wiesel, T. N. (1962). Receptive fields, binocular interaction and
functional architecture in the cat’s striate cortex. J. Physiol., 160:106–154.
Izenman, A. J. (1975). Reduced-rank regression for the multivariate linear model.
Journal of Multivariate Analysis, 5:248–264.
Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:1185–
1201.
Bibliography
173
Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaption. Neural Networks, 1:295–307.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive
mixtures of local experts. Neural Computation, 3:79–87.
Jepson, A. D. and Fleet, D. J. (1990). Scale-space singularities. In Faugeras, O.,
editor, Computer Vision-ECCV90, pages 50–55. Springer-Verlag.
Johansson, B. (1997). Multidimensional signal recognition, invariant to affine
transformation and time-shift, using canonical correlation. Master’s thesis,
Linköpings universitet. LiTH-ISY-EX-1825.
Jolliffe, I. T. (1986). Principal Component Analysis. Springer-Verlag, New York.
Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the
em algorithm. Neural Computation, 6(2):181–214.
Kailath, T. (1980). Linear Systems. Information and System Sciences Series.
Prentice-Hall, Englewood Cliffs, N.J.
Karhunen, K. (1947). Uber lineare methoden in der Wahrsccheilichkeitsrechnung.
Annales Academiae Scientiarum Fennicae, Seried A1: Mathematica-Physica,
37:3–79.
Kay, J. (1992). Feature discovery under contextual supervision using mutual information. In International Joint Conference on Neural Networks, volume 4,
pages 79–84. IEEE.
Knutsson, H. (1982). Filtering and Reconstruction in Image Processing. PhD
thesis, Linköping University, Sweden. Diss. No. 88.
Knutsson, H. (1985). Producing a continuous and distance preserving 5-D vector
representation of 3-D orientation. In IEEE Computer Society Workshop on
Computer Architecture for Pattern Analysis and Image Database Management
- CAPAIDM, pages 175–182, Miami Beach, Florida. IEEE. Report LiTH–ISY–
I–0843, Linköping University, Sweden, 1986.
Knutsson, H. (1989). Representing local structure using tensors. In The 6th Scandinavian Conference on Image Analysis, pages 244–251, Oulu, Finland. Report
LiTH–ISY–I–1019, Computer Vision Laboratory, Linköping University, Sweden, 1989.
Knutsson, H., Borga, M., and Landelius, T. (1995). Learning Canonical Correlations. Report LiTH-ISY-R-1761, Computer Vision Laboratory, S–581 83
Linköping, Sweden.
174
Bibliography
Kohonen, T. (1972). Correlation matrix memories. IEEE Trans.s on Computers,
C-21:353–359.
Kohonen, T. (1982). Self-organized formation of topologically correct feature
maps. Biological Cybernetics, 43:59–69.
Kohonen, T. (1989). Self-organization and Associative Memory. Springer–Verlag,
Berlin, third edition.
Landelius, T. (1993). Behavior Representation by Growing a Learning Tree. Thesis No. 397, ISBN 91–7871–166–5.
Landelius, T. (1997). Reinforcement Learning and Distributed Local Model Synthesis. PhD thesis, Linköping University, Sweden, S–581 83 Linköping, Sweden. Dissertation No 469, ISBN 91–7871–892–9.
Landelius, T., Borga, M., and Knutsson, H. (1996). Reinforcement Learning
Trees. Report LiTH-ISY-R-1828, Computer Vision Laboratory, S–581 83
Linköping, Sweden.
Landelius, T., Knutsson, H., and Borga, M. (1995). On-Line Singular Value
Decomposition of Stochastic Process Covariances. Report LiTH-ISY-R-1762,
Computer Vision Laboratory, S–581 83 Linköping, Sweden.
Lapointe, F. J. and Legendre, P. (1994). A classification of pure malt scotch
whiskies. Applied Statistics, 43(1):237–257.
Lee, C. C. and Berenji, H. R. (1989). An intelligent controller based on approximate reasoning and reinforcement learning. Proccedings on the IEEE Int. Symposium on Intelligent Control, pages 200–205.
Li, P., Sun, J., and Yu, B. (1997). Direction finding using interpolated arrays in
unknown noise fields. Signal Processing, 58:319–325.
Linsker, R. (1988).
21(3):105–117.
Self-organization in a perceptual network.
Coputer,
Linsker, R. (1989). How to generate ordered maps by maximizing the mutual
information between input and output signals. Neural Computation, 1:402–
411.
Ljung, L. (1987). System Identification. Prentice-Hall.
Loéve, M. (1963). Probability Theory. Van Nostrand, New York.
Bibliography
175
Luenberger, D. G. (1969). Optimization by Vector Space Methods. Wiley, New
York.
Marr, D. (1982). Vision. W. H. Freeman and Company, New York.
McCulloch, W. S. and Pitts, W. (1943). A logical calculus of ideas immanent in
nervous activity. Bulletin of Mathematical Biophysics, 5:115–133.
Mikaelian, G. and Held, R. (1964). Two types of adaptation to an optically–rotated
visual field. American Journal of Psychology, 77:257–263.
Minsky, M. L. (1961). Steps towards artificial intelligence. In Proceedings of the
Institute of Radio Engineers, volume 49, pages 8–30.
Minsky, M. L. (1963). Computers and Thought, chapter Steps Towards Artificial
Intelligence, pages 406–450. McGraw–Hill. E. A. Feigenbaum and J. Feldman,
Eds.
Minsky, M. L. and Papert, S. (1969). Perceptrons. M.I.T. Press, Cambridge, Mass.
Montanarella, L., Bassani, M. R., and Breas, O. (1995). Chemometric classification of some European wines using pyrolysis mass spectometry. Rapid Communications in Mass Spectrometry, 9(15):1589–1593.
Moody, J. and Darken, C. J. (1989). Fast learning in networks of locally-tuned
processing units. Neural Computation, 1:281–293.
Munro, P. (1987). A dual back-propagation scheme for scalar reward learning.
In Proceedings of the 9th Annual Conf. of the Cognitive Science Society, pages
165–176, Seattle, WA.
Narendra, K. S. and Thathachar, M. A. L. (1974). Learning automata - a survey.
IEEE Trans. on Systems, Man, and Cybernetics, 4(4):323–334.
Nordberg, K., Granlund, G., and Knutsson, H. (1994). Representation and Learning of Invariance. Report LiTH-ISY-I-1552, Computer Vision Laboratory, S–
581 83 Linköping, Sweden.
Oja, E. (1982). A simplified neuron model as a principal component analyzer. J.
Math. Biology, 15:267–273.
Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural Systems, 1:61–68.
Oja, E. and Karhunen, J. (1985). On stochastic approximation of the eigenvectors
and eigenvalues of the expectation of a random matrix. Journal of Mathematical Analysis and Applications, 106:69–84.
176
Bibliography
Olds, J. and Milner, P. (1954). Positive reinforcement produced by electrical stimulation of septal area and other regions of rat brain. J. comp. physiol. psychol.,
47:419–427.
Pavlov, I. P. (1955). Selected Works. Foreign Languages Publishing House,
Moscow.
Pearlmutter, B. A. and Hinton, G. E. (1986). G-maximization: An unsupervised
learning procedure for discovering regularities. In Neural Networks for Computing: American Institute of Physics Conference Proceedings, volume 151,
pages 333–338.
Pearson, K. (1896). Mathematical contributions to the theory of evolution–III.
Regression, heridity and panmixia. Philosophical Transaction of the Royal
Society of London, Series A, 187:253–318.
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.
Philosophical Magazine, 2:559–572.
Pollen, D. A. and Ronner, S. F. (1983). Visual cortical neurons as localized spatial
frequency filters. IEEE Trans. on Syst. Man Cybern., 13(5):907–915.
Riedmiller, M. and Braum, H. (1993). A direct adaptive method for faster backpropagation learning: The rprop algorithm. In Proceedings of the IEEE International Conference on Neural Networks, San Francisco, CA.
Ritter, H. (1991). Asymptotic level density for a class of vector quantization
processes. IEEE Transactions on Neural Networks, 2:173–175.
Ritter, H., Martinetz, T., and Schulten, K. (1989). Topology conseving maps for
learning visuomotor-coordination. Neural Networks, 2:159–168.
Ritter, H., Martinetz, T., and Schulten, K. (1992). Neural Computation and SelfOrganizing Maps. Addison-Wesley.
Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the Theory
of Brain Mechanisms. Spartan Books, Washington, D.C.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323:533–536.
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM J. Res. Develop., 3(3):210–229.
Sanger, T. D. (1988). Stereo disparity computation using gabor filters. Biological
Cybernetics, 59:405–418.
Bibliography
177
Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer feedforward
neural network. Neural Networks, 12:459–473.
Schultz, W., Dayan, P., and Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275:1593–1599.
Shannon, C. E. (1948). A mahtematical theory of communication. The Bell System Technical Journal. Also in N. J. A. Sloane and A. D. Wyner (ed.) Claude
Elwood Shannon Collected Papers, IEEE Press 1993.
Skinner, B. F. (1938). The Behavior of Organisms: An Experimental Analysis.
Prentice–Hall, Englewood Cliffs, N.J.
Smith, R. E. and Goldberg, D. E. (1990). Reinforcement learning with classifier
systems. Proceedings. AI, Simulation and Planning in High Autonomy Systems,
6:284–192.
Steinbuch, K. and Piske, U. A. W. (1963). Learning matrices and their applications. IEEE Transactions on Electronic Computers, 12:846–862.
Stewart, D. K. and Love, W. A. (1968). A general canonical correlation index.
Psychological Bulletin, 70:160–163.
Stewart, G. W. (1976). A bibliographical tour of the large, sparse generalized
eigenvalue problem. In Bunch, J. R. and Rose, D. J., editors, Sparse Matrix
Computations, pages 113–130.
Sutton, R. S. (1984). Temporal Credit Assignment in Reinforcement Learning.
PhD thesis, University of Massachusetts, Amherst, MA.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences.
Machine Learning, 3:9–44.
Tesauro, G. (1990). Neurogammon: a neural network backgammon playing program. In IJCNN Proceedings III, pages 33–39.
Thorndike, E. L. (1898). Animal intelligence: An experimental study of the associative processes in animals. Psychological Review, 2(8). Monogr. Suppl.
Torres, L. and Kunt, M., editors (1996). Video Coding: The Second Generation
Approach. Kluwer Academic Publishers.
van den Wollenberg, A. L. (1977). Redundancy analysis: An alternative for
canonical correlation analysis. Psychometrika, 36:207–209.
178
Bibliography
van der Burg, E. (1988). Nonlinear Canonical Correlation and Some Related
Techniques. DSWO Press.
van der Pol, B. (1946). The fundamental principles of frequency modulation.
Proceedings of the IEEE, 93:153–158.
Watkins, C. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge
University.
Werbos, P. (1992). Handbook of Intelligent Control, chapter Approximate dynamic programming for real-time control and neural modelling. Van Nostrand
Reinhold. D. A. White and D. A. Sofge, Eds.
Werbos, P. J. (1974). Beyond Regression: New Tools for Prediction and Analysis
in the Behavoral Sciences. PhD thesis, Harvard University.
Werbos, P. J. (1990). Consistency of HDP applied to a simple reinforcement
learning problem. Neural Networks, 3:179–189.
Westelius, C.-J. (1995). Focus of Attention and Gaze Control for Robot Vision.
PhD thesis, Linköping University, Sweden, S–581 83 Linköping, Sweden. Dissertation No 379, ISBN 91–7871–530–X.
Whitehead, S. D. and Ballard, D. H. (1990a). Active perception and reinforcement
learning. Proceedings of the 7th Int. Conf. on Machine Learning, pages 179–
188.
Whitehead, S. D. and Ballard, D. H. (1990b). Learning to perceive and act. Technical report, Computer Science Department, University of Rochester.
Whitehead, S. D., Sutton, R. S., and Ballard, D. H. (1990). Advances in reinforcement learning and their implications for intelligent control. Proceedings of the
5th IEEE Int. Symposium on Intelligent Control, 2:1289–1297.
Williams, R. J. (1988). On the use of backpropagation in associative reinforcement learning. In IEEE Int. Conf. on Neural Networks, pages 263–270.
Wilson, R. and Knutsson, H. (1989). A multiresolution stereopsis algorithm based
on the Gabor representation. In 3rd International Conference on Image Processing and Its Applications, pages 19–22, Warwick, Great Britain. IEE. ISBN
0 85296382 3 ISSN 0537–9989.
Wold, S., Ruhe, A., Wold, H., and Dunn, W. J. (1984). The collinearity problem in linear regression. the partial least squares (pls) approach to generalized
inverses. SIAM J. Sci. Stat. Comput., 5(3):735–743.
Bibliography
179
Zadeh, L. A. (1968). Fuzzy algorithms. Information and Control, 12:94–102.
Zadeh, L. A. (1988). Fuzzy logic. Computer, pages 83–93.
180
Bibliography
Author index
Anderson, C. W., 16, 20–22
Anderson, J. A., 48
Anderson, T. W., 25
Baird, L. C., 20
Baker, W. L., 52
Ballard, D. H., 19, 33, 35, 37, 38, 40
Barlow, H., 31
Barto, A. G., 16, 17, 19–22
Bassini, M. R., 69
Battiti, R., 11
Becker, S., 31, 69, 105, 108
Bell, A. J., 31
Bellman, R. E., 18
Berenji, H. R., 38
Bernard, S. T., 105
Bienenstock, E., 38
Bloom, F. E., 13
Bock, R. D., 60
Borga, M., 3, 53, 60, 61, 69, 107,
121
Bossom, J., 8, 65
Bower, G. H., 8
Bracewell, R. N., 99, 103
Bradtke, S. J., 20
Braum, H., 12, 147
Breas, O., 69
Bregler, C., 52
Brooks, V. B., 8, 65
Broomhead, D. S., 45
van der Burg, E., 33
Carson, J., 103
Comon, P., 70
Coren, S., 131
Darken, C. J., 45
Das, S., 69
Davis, L., 23
Dayan, P., 13, 20
Denoeux, T., 41
Derin, H., 17
Doursat, R., 38
Duda, R. E., 62
Dunn, W. J., 60, 67
Farell, J. A., 52
Fichsler, M. A., 105
Fieguth, P. W., 69
Field, D. J., 42
Fisher, R. A., 62
Fleet, D. J., 105
Fletcher, R., 11
Fry, T., 103
Földiák, F., 31
Geladi, P., 67
Geman, S., 38
Giles, G. L., 109
Goldberg, D. E., 23, 37
Golub, G. H., 60
Granlund, G. H., 38–40, 51, 52, 99,
101, 103, 109
Gray, R. M., 26, 30
Gullapalli, V., 16, 21
Hart, P. E., 62
182
Haykin, S., 11, 24, 30
Hebb, D. O., 24, 48
Heger, M., 19
Held, R., 8, 65
Hertz, J., 24, 26, 38, 39
Hilgard, E. R., 8
Hinton, G. E., 23, 28, 31, 36, 45,
105, 108
Holland, J. H., 23
Hopfield, J. J., 45
Hornby, A. S., 7
Hotelling, H., 60, 64, 69
Hubel, D. H., 26, 40
Höskuldsson, A., 60, 67
Irving, W. W., 69
Izenman, A. J., 60
Jaakkola, T., 20
Jacobs, R. A., 12, 28, 36, 38
Jepson, A. D., 105
Joahnsson, B., 51
Jolliffe, I. T., 65
Jordan, M. I., 20, 28, 36, 38
Kailath, T., 162
Karhunen, J., 82
Karhunen, K., 64
Kaushal, T. P., 31
Kay, J., 69, 70, 85
Kelly, P. A., 17
Knutsson, H., 3, 38–41, 51–53, 60,
61, 69, 99, 101, 103, 105,
107, 109, 121
Kohonen, T., 26, 27, 48, 49, 53
Kowalski, B. R., 67
Krogh, A., 24, 26, 38, 39
Kunt, M., 65
Landelius, T., 3, 9, 20, 51–53, 60, 61,
69, 107, 146
Lapointe, F. J., 69
Author index
Lazerson, A., 13
Lee, C. C., 38
Legellé, R., 41
Legendre, P., 69
Li, P., 69
Linsker, R., 30, 31
Ljung, L., 49
Loéve, M., 64
Love, W. A., 75
Lowe, D., 45
Luenberger, D. G., 11
Marr, D., 105
Martinetz, T., 53
Maxwell, T., 109
McCulloch, W. S., 44
Mikaelian, G., 8, 65
Milner, P., 13
Minsky, M. L., 35, 45, 46
Mitchson, G. J., 31
Montague, P. R., 13, 20
Montanarella, L., 69
Moody, J., 45
Munro, P., 9, 16
Narendra, K. S., 7
Nordberg, K., 39, 109
Nowlan, S. J., 23, 28, 36
Oja, E., 24–26, 82
Olds, J., 13
Omohundro, S. M., 52
Palmer, R. G., 24, 26, 38, 39
Papert, S., 45, 46
Pavlov, I. P., 7
Pearlmutter, B. A., 31
Pearson, K., 25, 64
Piske, U. A. W., 48
Pitts, W., 44
van der Pol, B., 103
Pollen, D. A., 1
Author index
Reeves, C. M., 11
Riedmiller, M., 12, 147
Ritter, H., 27, 53
Ronner, S. F., 1
Rosenblatt, F., 45
Ruhe, A., 60, 67
Rumelhart, D. E., 36, 45
Samuel, A. L., 19
Sanger, T. D., 25, 105
Schulten, K., 53
Sejnowski, T. J., 31, 45
Sen, P. K., 69
Shannon, C. E., 28, 29
Singh, S. P, 20
Skinner, B. F., 8
Smith, R. E., 37
Steinbuch, K., 48
Stewart, D. K., 60, 75
Sun, J., 69
Sutton, R. S., 16, 19–22, 36, 55
Tesauro, G., 14
Thathachar, M. A. L., 7
Thorndike, E. L., 7
Torres, L., 65
Van Loan, C. F., 60
Ward, L. M., 131
Watkins, C., 9, 19, 20
Werbos, P. J., 19, 20, 45
Westelius, C-J., 105
Whitehead, S. D., 19, 33, 35, 37
Wiesel, T. N., 26, 40
Williams, R. J., 16, 32, 36, 45
Willsky, A. S., 69
Wilson, R., 105
Wold, H., 60, 67
Wold, S., 60, 67
Wolfram, S., 13, 20
van den Wollenberg, A. L., 60
183
Yu, B., 69
Zadeh, L. A., 37, 38
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement