Draft INTELLIGENT SYSTEMS Johan Westö SERIES L: TEACHING MATERIALS, 2/2014

Draft INTELLIGENT SYSTEMS Johan Westö SERIES L: TEACHING MATERIALS, 2/2014
INTELLIGENT SYSTEMS
Dr
aft
Johan Westö
I
PR
OF
E
SS
& FR A MSTE
N
G
O
K
GS
AN
G
R
ES
S
YR
Ö
ESH
L
KO
PR
PR
OF E S SION &
SERIES L: TEACHING MATERIALS, 2/2014
O
Abstract
This course material is meant as an introduction to neural networks. Based on a
mathematical representation of neurons, the material will try to present 1) how these
neurons can be used to build models, 2) how the models are dependent on the network’s
structure, and 3) how we can make the networks learn from data. During the course,
neural networks will be trained to solve simple examples problems as well as large scale
real problems in the form of image classification. The goal is to give students a basic
understanding for how neural networks can be used to solve both regression and classification
problems.
Dr
aft
Sammanfattning
Denna kurs är tänkt som en introduktion till neurala nätverk. Materialet kommer att
påvisa hur en matematisk representation av nervceller kan användas för att representera
modeller, samt hur dessa modeller är beroende av nätverkets struktur och hur de kan fås
att lära från data. Under kursen kommer neurala nätverk att användas för att lösa såväl
fiktiva exempel problem som riktiga problem i form av bild klassificering. Målet är att de
studerande skall erhålla en baskunskap för hur neurala nätverk kan användas för både
regressions- och klassificeringsproblem.
Publisher: Novia University of Applied Sciences, Wolffskavägen 35 B, 65200 Vasa, Finland
c Johan Westö & Novia University of Applied Sciences
Novia Publications and Productions, series L: Teaching materials 2/2014
ISSN: 1799-4195,
ISBN: 978-952-5839-86-9 (online)
Layout: Johan Westö
Contents
1 Course information
1
2 Introduction
2.1 Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Traditional machine learning versus deep learning
2.3 Neurons and neural networks . . . . . . . . . . . . . . . .
2.3.1 Artificial neuron . . . . . . . . . . . . . . . . . . .
2.4 Used notation . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
2
3
4
5
6
7
7
3 Linear regression
3.1 A quadratic error measure . . . . .
3.2 Example data (part 1) . . . . . . .
3.3 Finding optimal weight values . . .
3.3.1 Running gradient descent .
3.4 Example data (part 2) . . . . . . .
3.5 Final remarks on gradient descent
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
9
9
10
11
12
13
4 Softmax regression
4.1 Maximum likelihood . . . . .
4.2 Gradient checking . . . . . .
4.3 2 class example . . . . . . . .
4.4 Training and testing . . . . .
4.5 MNIST . . . . . . . . . . . .
4.6 Restrictions for linear models
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
15
17
17
18
19
21
.
.
.
.
.
22
24
26
27
29
30
5 Multilayer Perceptrons
5.1 Backpropagation . .
5.2 The XOR problem .
5.3 Non-linear regression
5.4 MNIST revisited . .
5.5 Deeper architectures
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 What is next?
31
References
32
i
CONTENTS
Appendices
A
Matlab code for solving the XOR problem
B
Suggested course structure . . . . . . . . .
B.1
Guidelines for reporting answers to
B.2
Homework 1 . . . . . . . . . . . .
B.3
Presentation . . . . . . . . . . . .
B.4
Homework 2 . . . . . . . . . . . .
B.5
Homework 3 . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
homework problems
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
34
37
37
38
39
39
41
Notation
43
Acronyms
44
Index
45
ii
1
Course information
his is a course on intelligent systems with a focus on Artificial Neural Networks (ANNs).
ANNs are currently experiencing something like a new golden age due to their recent
successes on problems related to image and speech recognition (Bengio, Courville, & Vincent,
2012). The purpose of this course is to provide the necessary background information
needed about ANNs in order to understand the kind of thinking that has led to their recent
successes. Hence, the course targets students in fields where models needs to be learned
from data, such as computer science and electrical engineering. Upon completing the
course, participating students are expected to have obtained knowledge about 1) how ANNs
can solve both regression and classification problems, 2) how gradient based optimization
methods can be used for training ANNs, 3) how layering allows networks to solve non-linear
problems, and 4) why deep ANNs are thought to be useful.
Most of the material used is taken from Haykin (2009), but today it is also possible
to find excellent free courses online. I would recommend visiting Coursera’s homepage.
There you can find a really good course on “Machine Learning” taught by Andrew Ng
(co-founder of Coursera by the way), and an equally good course on “Artificial Neural
Networks” taught by Geoffrey Hinton (a.k.a. the godfather of neural networks). I would
also recommend an article about deep neural networks and the human brain by Laserson
(2011) (found here) as inspiration.
T
Prerequisites: This material assumes that students are familiar with 1) linear algebra
(matrix operations), 2) calculus (the gradient and partial derivatives), 3) system modelling,
and 4) MATLAB.
1
2
Introduction
oday more and more appliances connect to the internet all the time, and this increased
connectivity is accompanied by an increased ability to collect and store data. Factories
have more sensors collecting data about their processes, and companies such as Facebook
and Google sit on a wealth of information about their users; but how do we make sense
of all this data? One way is to let computers build models from it, which is what this
course is about. More specifically, this course will give an introduction to how we can
teach machines to learn from data using ANNs. We will start by first looking at how linear
regression can be represented by one artificial neuron and how we can use gradient descent
to solve optimization problems. From here, we will move on to see how softmax regression
can complement linear regression in solving problems related to classification. Finally, we
will look into how multilayer perceptron networks emerges out of both linear and softmax
regression models by first first making a non-linear projection of the input data.
T
2.1
Intelligence
What is meant by intelligence? Different people will probably have different opinions, but
what follows is one that is currently receiving a lot of attention. Karl Friston, who is a
famous neuroscientist, proposed that several different brain theories could be gathered
under one concept (Friston, 2010). Within this concept, the brain’s functionality could be
looked upon as minimizing surprise. If the brain is thought to be intelligent, one meaning
of intelligence would then be the ability to make correct predictions. This definition is
by no means a new one, and when asked, “What is intelligence if it is not defined by
behaviour?” Jeff Hawkins replied (Hawkins, 2004, p. 6):
The brain uses vast amounts of memory to create a model of the world. Everything you know and have learned is stored in this model. The brain uses this
memory-based model to make continuous predictions of future events. It is the
ability to make predictions about the future that is the crux of intelligence. I will
describe the brain’s predictive ability in depth; it is the core idea in the book.
Similarly, in a translation on Sun Tzu’s “The Art of War” by Cleary (1988, p. xi), the
following two statements are found.
2
CHAPTER 2. INTRODUCTION
According to an old story, a lord of ancient China once asked his physician,
a member of a family of healers, which of them was the most skilled in the art.
The physician, whose reputation was such that his name became synonymous with
medical science in China, replied, “My eldest brother sees the spirit of sickness and
removes it before it takes shape, so his name does not get out of the house. My
elder brother cures sickness when it is still extremely minute, so his name does not
get out of the neighbourhood. As for me, I puncture veins, prescribe potions, and
massage skin, so from time to time my name gets out and is heard among the lords.”
Just as the eldest brother in the story was unknown because of his acumen and
the middle brother was hardly known because of his alacrity, Sun Tzu also affirms
that in ancient times those known as skilled warriors won when victory was still
easy, so the victories of skilled warriors were not known for cunning or rewarded for
bravery.
Despite that these Chinese stories refer to old texts (2500 years) they still agree with
the interpretation of intelligence being the ability to foresee events, even if it is disguised
as skill in this particular case. Going back to Friston’s idea that brain functionality tries
to minimize surprise, we see that predictions should not be restricted to only foreseeing
the future. A definition of intelligence based on predictive capability could also include
the ability to make correct judgements of data. That is, if an object recognized to be a
car actually is a car, then this also corresponds to a situation were surprise is minimized.
So, if a system is able to either make correct judgements of data or predict future events,
we could call it an intelligent system. However, people have discovered during the last
50 years that programming intelligent systems explicitly is very difficult, one often fails
to note the complexity of a task. For example, humans easily recognize Figure 2.1a as a
car, but it is very difficult to tell a computer how to infer the same thing from the RGB
matrices representing an image (see Figure 2.1b), and the task gets even more difficult if
one has to also allow for all different types of cars and viewpoints. One way to solve this
problem would then be to write code for instructing a computer on how to learn the task,
instead of trying to tell it explicitly how to do it. This way of thinking leads us straight to
the next section on machine learning.
2.2
Machine learning
The material in this course will mainly relate to a branch of Artificial Intelligence (AI)
called machine learning, a term defined by Mitchell (1997, p. 2) as:
Definition
“A computer program is said to learn from experience E with respect to some
task T and performance measure P, if its performance at task T, as measured by P,
improves with experience E.”
In practise, this means that machine learning applications include among other things:
face detection, object recognition, cluster analysis, recommender systems, fault detection,
spam detection, and automatic speech recognition systems. In all these situations, a
machine learning algorithm has learned to perform the task from experience (old data).
3
CHAPTER 2. INTRODUCTION
47
106
157
236
104
54
99
82
23
64
26
3
136
119 193 238 241
55
252
7
153
58
231
36
178
24
191
13
218
95
100 253
95
154 243
209
206
49
15
138
58
175 134 176 201
92
240 144
141 209
40
124
192
58
101
127 143
96
23
148
31
99
180 185
18
235 138
227
14
188 121
118 204
33
91
124 123
62
50
99
136 181
42
52
233 100 132
30
185
20
113 181
28
50
24
91
219
48
235
112
(a)
46
242
191
237
131
248
240
190
(b)
Figure 2.1: Image recognition: a) a image of a T-Ford as seen by a human, adapted from
Wikipedia (n.d.-a), and b) a color image as seen by a computer.
Despite the wide spread use, machine learning algorithms can normally be classified as
belonging to one of the following three categories:
Supervised learning
Represent situations were each data point is associated with a desired output.
Normal tasks include classification when the answer is a label (face / not a face)
and regression when the answer is a real value (function fitting).
Unsupervised learning
Seeks to find structure in unlabelled data, examples include cluster analysis and
dimensionality reduction.
Reinforcement learning
Focuses on on-line trial and error learning where a computer program tries to
improve its performance on a task by testing different actions and evaluates the
responses observed.
This course will only look at methods belonging to the first category, and focus is put
on methods based on ANNs. The reason for this is “deep learning” which is a collection
name for different types of ANNs with a “deep” hierarchical structure. These networks are
especially good at image and speech recognition, and they are currently receiving a lot of
attention from big companies, such as Microsoft, Google, and Facebook. Both Geoffrey
Hinton and Yuan LeCun, who are big names within the field, have recently been hired by
Google and Facebook respectively (Mcmilan, 2013; Metz, 2013).
2.2.1
Traditional machine learning versus deep learning
Several problems within machine learning faces high dimensional data, e.g. the dimensionality of image data corresponds to the number of pixels in the image. Richard Bellman
pointed out already 50 years ago that the learning complexity increases exponentially as
the dimensionality of the data increases linearly, and he named this problem the “curse of
dimensionality”. Traditionally machine learning methods have tried to avoid this problem
4
CHAPTER 2. INTRODUCTION
by first performing dimensionality reduction, or feature extraction as it is also commonly
called (Arel, Rose, & Karnowski, 2010).
In simpler terms, the above means that machine learning traditionally have operated
in two stages. First, the dimensionality of the problem have been decreased by extracting
features, whereupon these features have been fed to a machine learning algorithm selected
for the task at hand. Examples of feature extraction methods are “Bag of Words” for
text data and Fourier transforms for temporal and spatial data. As feature extraction is
performed first, the success of many machine learning algorithms is strongly dependent on
how well the extracted features can represent variation in the original data. Furthermore,
the feature extraction process is normally labour intensive and often performed by humans.
Taken together, the above can be seen as a serious weakness or obstacle in reaching a wide
spread deployment of intelligent systems (Bengio et al., 2012).
Deep learning methods differ in the way that they try to handle the curse of dimensionality. Instead of relying on human ingenuity, these methods strive to incorporate
the feature extraction process in the machine learning algorithm by taking inspiration
from the neocortex (Arel et al., 2010). This is the wrinkled outermost layer of the brain,
and it is thought to be responsible for our cognitive abilities. Hence, this is also where
the visual and auditory cortices are located. Investigations of these have revealed that
sensory information is progressed trough a hierarchical structure were higher level information is extracted as one moves up the hierarchy. For vision, this means that edges
are detected at the lowest levels, whereas more complex objects such as cars and faces
would detected at higher levels (Poggio & Ullman, 2013). In hierarchical systems, depth
can also be used as a measure of the number of levels, and hence, deep learning, as we
shall see, corresponds to ANNs with several levels or layers as it is also normally called.
2.3
Neurons and neural
networks
Your brain consist of around 86 billion
connected neurons, recently verified
by Herculano-Houzel (2012), and each
neuron can connect to up to 104 other
neurons (Squire & Kandel, 2009). In
more detail, each neuron is a type of
cell capable of sending and receiving
impulses. Functionally, the neuron receives impulses from other neurons on
its dendrites, and these can trigger the
neuron to send an electrical impulse
of its own down the axon, which in
turn connects to other neurons (see
Figure 2.2). The sites that connect
different neurons to each other are
called synapses, and contradictory to
the neuron’s electrical impulse, signals
are transmitted chemically within the
synapse using substances called neu- Figure 2.2: Biological neuron, adapted from Wikipedia
(n.d.-b).
rotransmitters.
5
CHAPTER 2. INTRODUCTION
The term neural network is used to describe networks of connected neurons, and one
example of a neural network is therefore the brain. But what is it about these types of
networks that make it possible to store memories or information? Ramón Cajal, a Nobel
laureate in Physiology and Medicine 1906, proposed that memories are stored trough
alterations of the synapses connecting neurons. In other words, he proposed that synapses
could form connections of various strengths, and that these strengths were plastic in the
sense that they could change over time. In 2000, Eric Kandel received the Nobel prize in
Physiology and Medicine for his work on verifying the above statement and for describing
how this process occurs in real synapses (Squire & Kandel, 2009).
Even if changes in the brain’s synapses constitute the basis for memories, one should
not imagine memories as being stored in any specific location. That is, there is no specific
place where a memory is stored, instead memories are stored in a distributed fashion
throughout the network. This property is maybe best captured by Lashley (1950) who
spent his carrier searching for a specific memory trace, or engram as he called it, and
arrived at the following conclusion:
This series of experiments has yielded a good bit of information about what and
where the memory trace is not. It has discovered nothing directly of the real nature
of the engram. I sometimes feel, in reviewing the evidence on the localization of the
memory trace, that the necessary conclusion is that learning just is not possible.
The message one should take home from the above is then not that learning is not possible,
but rather that finding one or more synapses that specifically code for a memory trace
might very well be.
Before moving on to ANNs, there is still one more detail of interest. As the brain
performs so many different tasks, neural networks clearly posses interesting features from
an intelligent systems viewpoint; but what is the chance of us being able to utilize them
to the same degree as the brain is doing? And, how do we know that we are not just
going to end up copying lots of different learning mechanism? That is, is it possible that
there could be one general learning method for neural networks, utilized by the brain, that
also could be implemented for ANNs? The real answer is that we do not know yet, but
findings such as the one by Von Melchner, Pallas, and Sur (2000) provide hope. In this
study, the signals coming from the eyes were rerouted to the auditory cortex in newborn
ferrets. The purpose was to investigate if the auditory cortex could learn to process visual
information and this seems to be the case. Therefore, there is some hope for the existence
of one general learning algorithm for neural networks.
2.3.1
Artificial neuron
Similarly to real neural networks, ANNs are built up from connected neurons; but in
this case from artificial neurons. Different models for artificial neurons exist, but for the
purpose of this course all neurons will be of the model described in Figure 2.3. This model
uses a vector (w) with connection weights to indicate synaptic strengths to inputs in a
vector (x), a summation for determining the induced field (v), and an activation function
(ϕ(.)) to calculate the output (ŷ). These calculations can be described mathematically as:
ŷ = ϕ
M
X
m=0
|
= ϕ(w x)
6
!
wm x m
(2.1)
CHAPTER 2. INTRODUCTION
x0 = 1
x1
x2
x3
..
.
w0
w1
w2
w3
v
ϕ(.)
ŷ
wM
xM
Figure 2.3: Artificial neuron.
2.4
Used notation
The notation used is as follows: upper case bold letters represent matrices, lower case bold
letters represents vectors, lower case letters with subscripts represent elements in matrices
or vectors (if the vector only contains one element the subscript is left out), and finally,
superscripts with Roman numerals represents depth in a hierarchy when not obvious from
context. A complete list of all used notations and symbols is found on page 43.
2.5
MATLAB
The Matrix Laboratory (MATLAB) environment is widely used within both academia
and industry, and it will be used throughout this course to illustrates different learning
examples. MATLAB is, as the name implies, well suited for matrix and vector operations;
and these type of operations should be favoured over loops whenever possible. One neat
feature of our used artificial neuron is therefore that the induced field is determined by the
dot product1 of the inputs and the weights. This means that the model output (Ŷ) for all
data points (X) can be easily determined as:
1
2
3
4
5
6
% Calculating the locally induced field
% Please note that W is transposed!
V = W’*X;
% Calculating yHat (assuming that the activation function
% can handle matrices
YHat = phi(V)
and these calculations still work even if there are several neurons attached to the inputs. In
the chapters that follow, we will use this model of a neuron to see 1) how we can represent
models, 2) how the model depends of the network’s structure, and 3) how we can make
the network learn from data.
1
P|a| The dot product between two vectors a and b is obtained as the sum of an element wise multiplication:
i=1
ai bi .
7
3
Linear regression
n linear2 regression, the aim is to fit a hyperplane3 to a set of data points. Such models
can be visualized as a single neuron connected to every element in the input vector. We
will begin by looking at a simple example with one dimensional input and output data, and
we will assume that the process generating the data can be described with the following
model:
I
ŷ = w0 + x1 w1 + ε
(3.1)
where ε is an error term representing the influence of unknown or not measured terms and
noise. This model can be described by a single neuron, of the same form as the one in
Figure 2.3, if we assume that the activation function (ϕ(.)) simply returns the argument
unchanged. In this case, we could then model the process with a neuron looking like the
one in Figure 3.1.
At this stage however, our input data (x) only has dimensionality one, whereas the
dimensions for the weight vector (w) would require it to have a dimensionality of two. As
can be seen from Equation 3.1 though, w0 should always be multiplied with one. w0 hence
represent what is called a bias term, and it is present in all models we will look at. For
this reason, the simplest fix is to always add a row of ones to the input data matrix (X)
(X then gets the dimensions M + 1 by N ). This is also the reason why the indexing here
start from zero when all other indexes starts from one.
x0 = 1
w0
v
w1
ϕ(v) = v
ŷ
x1
Figure 3.1: Linear regression neuron (x0 = 1 and w0 = bias).
2
A linear function must satisfy f (x + y) = f (x) + f (y) and f (ax) = af (x) for all a (Lay, 2012, p. 65).
A hyperplane is a plane with dimensionality D − 1 where D is the dimensionality of space (Lay,
2012, p. 440). In two dimensional space, the hyperplane then becomes a line.
3
8
CHAPTER 3. LINEAR REGRESSION
Important
Always remember to add a row of ones to the top of your observed data matrix
(X) (x0 should always equal 1).
3.1
A quadratic error measure
The problem in linear regression then boils down to choosing values for the weights (w),
so that the model can explain the observed output (y) as good as possible. In order to
know what is good and what is bad, we have to define some kind of error measure that
scores each possible combination of weight values. To this end, lets start of by defining an
error signal (e) as the difference between each desired output (y), indexed with n, and the
corresponding model output (ŷ).
e(n) = y(n) − ŷ(n)
(3.2)
Using the error signal, we then define an instantaneous error energy (E ) as:
1
(3.3)
E (n) = e2 (n)
2
where the term 12 is added for mathematical convenience (simpler derivative). Finally, we
define the average error energy (Eav ) to be:
Eav =
N
1 X
E (n)
N n=1
=
N
1 X
e2 (n)
2N n=1
=
N
1 X
[y(n) − ŷ(n)]2
2N n=1
N
M
X
1 X
=
y(n) −
wm xm (n)
2N n=1
m=0
"
#2
(3.4)
where it has been assumed that a linear regression output neuron is used with the activation
function ϕ(v) = v. As E squares the observed error signal, Eav will obtain positive
contributions from all errors signals, independently if they have a positive or negative sign.
Hence, the lower the value of Eav the better the model can explain the observed data. The
most optimal choice of model weights is therefore obtained at the minima of Eav with
respect to different weight combinations. However, one should keep in mind that these
model weights are only optimal for the currently used quadratic error measure.
3.2
Example data (part 1)
Lets look at an example to illustrate what have been said so far. Figure 3.2a plots 100
data points generated by the process:
y = 1 + 0.5x + r
(3.5)
where r is a random normally distributed variable with mean zero and standard deviation
0.5. To this data, we will try to fit the model in Figure 3.1. We are not yet in a position
9
CHAPTER 3. LINEAR REGRESSION
to determine what the optimal weight values are, but as there are only two weights in
this particular case, we can plot a surface illustrating how Eav varies for different weight
combinations. This is done in MATLAB by calculating Eav as:
1
2
3
E = Y - Yhat;
En = 1/2*E.*E;
En_av = mean(En);
Average error energy
over a grid of different weight combinations and visualizing the result as a surface plot.
Such a surface is shown in Figure 3.2b, and despite that the curvature along the w0 axes is
small, it is still possible to get an idea of where the minima is located.
4
y
2
0
−2
Training data
Original funtion
−4
−4
−2
0
x
2
4
50
0
−2
0
2
w1
(a)
2
−2
0
w0
(b)
Figure 3.2: Example data: a) 100 data points, selected randomly in the interval [-5 5],
generated by the process in Equation 3.5 (black circles) together with a blue line representing
the same process without the stochastic variable r, b) Eav as a function of w0 and w1 in the
interval [-3 3].
Important
Before proceeding, one should ask if this is the only minima that can be found.
It this case it is, but it is not always the case and we will therefore return to this
question later.
3.3
Finding optimal weight values
For simple problems, it is often possible to analytically determine the exact weight values
where Eav has its minima. Linear regression belongs to this group of simpler problems,
but in order to prepare for more difficult challenges ahead, we will look at a gradient
based iterative method called gradient descent (also known as the method of steepest
descent). Given any set of weights, this method evaluates the curvature of the energy
surface (Figure 3.2b); and adjusts the parameters so that a step is taken in the direction of
steepest descent. From calculus, we know that the gradient4 of a function gives the direction
of steepest ascent. Hence, with gradient descent we therefore want to update our current
weights so that a step is taken in the negative direction of the gradient. Mathematically,
we define this parameter updating process as:
4
The gradient to a function is a vector where each element is the partial derivatives of the function
with respect to a certain variable. In our case, the variables are represented by our model’s weights.
10
CHAPTER 3. LINEAR REGRESSION
new
old
wm
= wm
−η
∂Eav
old
∂wm
(3.6)
where η is a parameter that determines the step size. In order to simplify things further
ahead, we will at this point also introduce the following shorthand notation for the weight
change (∆w) in Equation 3.6.
∂Eav
(3.7)
∂wm
The next step is then to obtain an expression for the partial derivatives of Eav . We
begin by noting the similarity:
∆wm = η
N
∂Eav
1 X
∂E (n)
=
∂wm
N n=1 ∂wm (n)
(3.8)
From here, we derive partial derivatives of E (n) by implementing the chain rule5 as:
∂E (n) ∂e(n) ∂ ŷ(n) ∂v(n)
∂E (n)
=
∂wm (n)
∂e(n) ∂ ŷ(n) ∂v(n) ∂wm (n)
where
∂ 12 e2 (n)
∂E (n)
=
= e(n)
∂e(n)
∂e(n)
∂e(n)
∂y(n) − ŷ(n)
=
= −1
∂ ŷ(n)
∂ ŷ(n)
∂ ŷ(n)
∂ϕ(v(n))
∂v(n)
=
=
=1
∂v(n)
∂v(n)
∂v(n)
∂v(n)
∂
=
∂wm (n)
PM
m=0 wm (n)xm (n)
∂wm (n)
= xm (n)
therefore
∂E (n)
= −e(n)xm (n)
∂wm (n)
and
N
1 X
∂Eav
e(n)xm (n)
=−
∂wm
N n=1
(3.9)
Equation 3.9 then tells us in which direction, in weight space, we should move our weights
so that Eav decreases.
3.3.1
Running gradient descent
Using partial derivatives, we have a method for updating the model’s weights so that Eav
decreases; but this requires that we already have a set of weights that we are trying to
improve. To get started, we therefore need to select a set of initial weight values. Previous
knowledge about the problem can here be used, but if no such knowledge exist, it is
common to simply generate a set of random values. Based on this, the complete algorithm
for gradient descent is summarized in Algorithm 1, where it has been assumed that the
iterative process continues until convergence. That is, until a point where Eav is no longer
decreasing.
dz
dz dy
The chain rule says that dx
= dy
, assuming z to be a function of a variable y which int turn is a
dx
function of a variable x (Croft, Davison, & Hargreaves, 2001, p. 368).
5
11
CHAPTER 3. LINEAR REGRESSION
w ← random initial weights [-1 1]
i ← 1 {epocha counter}
Iteration loop for gradient descent
repeat
Loop over all training examples (replace with matrix operation in MATLAB)
for n = 1 to N do
v(n) ← w| x(n)
ŷ(n) ← v(n)
e(n) ← y(n) − ŷ(n)
E (n) ← 12 e2 (n)
Eav ← Eav + N1 E (n)
∆w ← ∆w − Nη x(n)e(n)
end for
Eav (i) ← Eav
Plot progress {check that Eav is decreasing}
w ← w − ∆w
until convergence {Eav is no longer decreasing}
a
An epoch is when training networks referred to one complete run trough the training set.
Algorithm 1: Linear regression using gradient descent.
We have seen earlier that the for loop, in Algorithm 1, can be replaced with matrix
multiplications when calculating V, Ŷ, E, and Eav . Similarly, ∆w can also be calculated
directly in MATLAB using
1
2
E = Y - Yhat;
dw = -eta * 1/N * X*E’;
This speeds up the calculations drastically when large datasets are used. Finally, using
Equation 3.6 we update the weights in MATLAB as:
1
w = w - dw;
It could here be noted that MATLAB also has built in optimization routines and that
several of these are gradient based. It is therefore possible to use these routines instead of
gradient descent, but we will stick with gradient descent throughout this course.
3.4
Example data (part 2)
Now, when we have a general method for finding the minima of Eav , we can implement
it on the example studied earlier. Starting from random initial weight values, the result
after 30 iterations with gradient descent is shown in Figure 3.3a. As the energy surface
has the form of a long valley, the direction of gradient descent will not necessary point
towards the minima. Too large η values can therefore make the algorithm take big leaps
that end up increasing Eav . In Figure 3.3a, η is on the verge of becoming too large and
this is illustrated by the zigzag path taken. Nevertheless, the algorithm was able to find
good model parameters after only 30 iterations (red line in Figure 3.3b).
12
CHAPTER 3. LINEAR REGRESSION
4
2
w1
2
y, ŷ
0
0
−2
−2
Training data
Original funtion
Trained network
−4
−2
0
w0
2
−4
(a)
−2
0
x
2
4
(b)
Figure 3.3: Gradient descent: a) 30 iterations from random initial parameters with η = 0.15.
b) 100 data points from Equation 3.5 (black circles), a blue line representing the same process
without the stochastic variable r, and a red line representing the model obtained from the
trained neuron.
Important
The gradient descent algorithm might become unstable and diverge if η is chosen
to large.
3.5
Final remarks on gradient descent
One might wonder about why we are using gradient descent for finding the minima when
several algorithms exist that are more efficient. The reason is that gradient descent is
very simple and intuitive, and it can also be implemented in batch or online mode. Batch
mode corresponds to the description given above where the contributions from all training
examples are summed up before the weights are updated. That is, the output for each
data point is evaluated using the same weights before the update is performed. In online
mode, the weights are instead updated continuously using the individual partial derivatives
obtained from each data point.
Batch and online mode are therefore both extreme cases. One uses all training examples
to update the weights, whereas the other uses only one. Between these two extremes
we find something called mini batch, and this is one of the main reasons why gradient
descent is still sometimes used. Imagine having a million training examples. If you now
implement gradient descent using batch mode, you will have to do a lot of calculations
before any progress can be done; but you will know exactly in which direction to move
the weights. Online mode requires a lot less calculations before any weights are updated,
but here you have only considered one data point; and the direction suggested is not
likely to be same as the one suggested by the whole batch. However, if you randomly
selected a subset consisting of ten to a hundred thousand training examples (a mini batch).
The direction suggested by this mini batch would most likely point in approximately the
direction suggested by the batch, but you would get this information at a fraction of the
cost that it would take to evaluate all data points. As Equation 3.9 includes a sum over all
data points, the presented gradient descent algorithm can be used very easily with mini
batches, as this only requires that the sum is restricted to a subset of the data points.
13
4
Softmax regression
nfortunately softmax regression is a bit of a misnomer. We saw already in chapter 2
that the term regression is used for real valued outputs, whereas the term classification
is used for labelled outputs. Softmax regression is, however, a classification algorithm
despite its name.
In softmax regression,6 each class (k) is represented by one neuron; and each neuron
represent the probability that a given input vector belongs to its corresponding class. The
vectors y and ŷ must therefore both sum up to 1 (probabilities have to sum to 1). For y,
the class label is known and this vector therefore contains a 1 (indicating 100 % confidence)
for the correct class and zeros for the rest. For ŷ, the summation constraint is in turn
satisfied by using the following activation function:
U
evk
φ(vk ) = PK
k=1 e
(4.1)
vk
The probability, given by our model, that an input vector belongs to class k is therefore
given by:
|
ewk x
ŷk = p(class = k|x) = PK
k=1 e
w|k x
(4.2)
In MATLAB, we can make use of built in functions and matrix operations to calculate Ŷ
from V as:
1
2
% Assuming V has dimensions K by N
Yhat = exp(V) ./ ( ones(size(V,1),1) * sum(exp(V)) );
The summation over all induces fields in Equation 4.2 requires that each output is
connected to all the induced fields. Hence, we will represent softmax regression with the
structure in Figure 4.1.
6
Softmax regression is a generalization of the more common logistic regression algorithm to more
than two classes. This generalization is also known as multinomial logistic regression.
14
CHAPTER 4. SOFTMAX REGRESSION
x0 = 1
wmk
vk
φ(vk ) = PKe
k=1
evk
x1
v1
..
.
..
.
..
.
xM
vK
ŷK
ŷ1
Figure 4.1: Softmax regression model for classification between K classes.
4.1
Maximum likelihood
Softmax regression output probabilities for different classes, and hence, weight selection
should be based upon how likely the observed combination of inputs and desired outputs
would be for a given set of weights. As an example, imagine that you are trying to
model human height by determining the mean (µ) and the standard deviation (σ), and
that you have been given a sample containing the heights of 1000 randomly selected
persons from the entire human population. Based upon our sample, we can calculate how
likely we are to observe it for different values of µ and σ with a likelihood function (L ),
defined as L (µ, σ|sample). With this definition, the most most logical choice of model
parameters would be found at the maxima of L (µ, σ|sample); and the task hence becomes
an optimization problem, which again could be solved using gradient descent.
Similarly, in softmax regression we are interested in finding the weights for our neurons
that would be the most likely ones given the available data, and for our case, the likelihood
that our model generated one data point is given by:
K
Y
yk (n)
L (w|x(n)) =
pk
(n)
(4.3)
k=1
Assuming independence between data points, the likelihood function over X then becomes
the product over all observed data points.
L (w|X) =
N Y
K
Y
yk (n)
pk
(n)
(4.4)
n=1 k=1
Large products are, however, cumbersome to work with. A normal trick is therefore to
work with the mean log likelihood function instead (l ), or the mean negative log likelihood
function7 as in this case. We obtain this function by simply taking the logarithm of
Equation 4.4 and multiplying the expression by − N1 , which gives:
l =−
N X
K
1 X
yk (n) ln(pk (n))
N n=1 k=1
=−
N X
K
1 X
yk (n) ln(ŷk (n))
N n=1 k=1
7
(4.5)
The negative log likelihood function is used in order to transform a maximization problem into
a minimization problem. Multiplying a function with negative one flips its surface so that the previous
maxima becomes a minima.
15
CHAPTER 4. SOFTMAX REGRESSION
In order to avoid unnecessary loops, it is a lot faster to calculate l directly in MATLAB
using:
1
2
% Mean negative log likelihood
l = -1/N * sum(sum( Y.*log(Yhat) ));
Just as Eav for the linear regression problem, l only have one minima in softmax
regression; following the gradient is therefore guaranteed to lead to a global minima.
Differentiating Equation 4.5 with respect to the model parameters is somewhat tricky. A
complete derivation can be found in Bishop (1995), but we will here just conclude that the
derivation in the end gives:
N
X
∂l
=−
ek (n)xm (n)
∂wmk
n=1
(4.6)
ek (n) = yk (n) − ŷk (n)
(4.7)
where ek (n) is defined as
Quite interestingly, Equation 4.6 is actually identical to what we obtained for linear
regression in the previous chapter. At this point, we now have all the information needed to
fit models, as the one in Figure 4.1, to labelled data. Using gradient descent, the complete
procedure is summarized in Algorithm 2
W ← random initial weights [-1 1]
i ← 1 {epoch counter}
Iteration loop for gradient descent
repeat
Loop over all training examples (replace with matrix operation in MATLAB)
for n = 1 to N do
v(n) ← W| x(n)
for k = 1 to K do
v (n)
ŷk (n) ← PKe k v (n)
k=1
e
k
end for
e(n) ← y(n) − ŷ(n)
P
l (n) ← − K
k=1 yk (n) ln(ŷk (n))
l ← l + N1 l (n)
∆W ← ∆W − Nη [x(n)e| (n)]
end for
l (i) ← l
Plot progress {check that l is decreasing}
W ← W − ∆W
until convergence {l is no longer decreasing}
Algorithm 2: Softmax regression using gradient descent.
16
CHAPTER 4. SOFTMAX REGRESSION
4.2
Gradient checking
Programming mistakes easily occur during implementation of the presented algorithms.
In the example we looked at in chapter 3, it was possible to confirm that our selected
parameter values moved in the intended direction by looking at how our positioned changed
on the Eav surface. As the dimensionality of X grows larger (more weights) this is no
longer possible, and we therefore need another method to verify our calculations. A nice
way of verifying that everything is working as intended is to compare the calculated partial
derivatives to approximate numerical estimates.
The partial derivative for each weight tells us how much Eav or l should change when
that weight is varied. We can easily verify this by adding or subtracting a small number
(κ) from a weight and calculating the induced change in Eav or l afterwards. To get an
unbiased estimate, its a good idea to take the average of the observed change when κ is both
added and subtracted. Mathematically we can describe the estimated partial derivative as:
estimated value for
−
+
l (wmk
) − l (wmk
)
∂l
=
∂wmk
2κ
(4.8)
where
+
wmk
= wmk + κ
and
−
wmk
= wmk − κ
Equation 4.8 can be repeated for every weight and the results can be compared to what
is obtained from either Equation 3.9 or Equation 4.6. For small values on κ (10−4 ), the
estimated and computed partial derivatives should be identical for atleast the first three or
four decimals.
4.3
2 class example
Lets start off with a two class example so we can get a feel for how softmax regression works.
Our problem consist of learning to correctly classify the 100 data points in Figure 4.3a.
These data points are generated by the processes:
(
Class 1
(
x1 = 2 + r
x2 = 2 + s
Class 2
x1 = −2 + r
x2 = −2 + s
(4.9)
where both r and s are random normally distributed numbers with mean 0 and standard
deviation 1. After adding a row of ones to X, its dimensions are now 3 by 100. Labels are
coded using the desired outputs as:
(
y=
[1 0]|
[0 1]|
if class = 1
if class = 2
(4.10)
Hence, the model that we will try to fit looks like the one illustrated in Figure 4.2. Even if
Equation 4.5 provides an error measure, it is still interesting to know how often data points
are classified incorrectly. For this purpose, we introduce a classification error ratio (C ) as:
N
1 X
C =
N n=1
(
0
1
if the predicted class is equal to the assigned class
otherwise
17
(4.11)
CHAPTER 4. SOFTMAX REGRESSION
x0 = 1
wmk
φ(vk ) = P2e
x1
v1
x2
v2
vk
k=1
evk
ŷ1
ŷ2
Figure 4.2: Softmax regression model for a two dimensional two class problem.
When calculating C in MATLAB, we can make use of built in relational operators,
but first we need a vector with class labels for both Y and Ŷ. As our model outputs
probabilities, we would like to assign a data point to the class with the highest probability.
With our matrix definitions, this corresponds to selecting the row in Y or Ŷ with the
maximum value in each column, and taking the row number as the defined or predicted
label for each data point. In the end, we can implement all of this using the following three
lines of code in MATLAB:
1
2
3
4
5
6
% Labels from Y
[~, Y_labels] = max(Y, [], 1);
% Labels from Yhat
[~, Yhat_labels] = max(Yhat, [], 1);
% The classification error ratio
C = mean( Y_labels ~= Yhat_labels );
Before running the procedure in Algorithm 2, we will check that our implementation is
correct by comparing the calculated and estimated gradient. For random initial weights [-1
1], this comparison showed that both the calculated and estimated partial derivatives are
identical for the first four decimals (see Table 4.1).
Table 4.1: Comparison between calculated and estimated partial derivatives for random
initial weights [-1 1] and with κ = 10−4
Calculated
Estimated
∂l
∂w01
∂l
∂w02
∂l
∂w11
∂l
∂w12
∂l
∂w21
∂l
∂w22
-0.0142
-0.0142
0.0142
0.0142
-0.3747
-0.3747
0.3747
0.3747
-0.4990
-0.4990
0.4990
0.4990
Figure 4.3b illustrates training progress for 30 epochs while running the procedure in
Algorithm 2 with η = 0.25. After these 30 epochs l is hardly decreasing any more, and
it was concluded that the algorithm had converged. The decision surface found by the
algorithm is illustrated in Figure 4.3c together with a decision surface found using linear
regression in Figure 4.3d. Linear regression is clearly not suited for classification tasks,
whereas softmax regression is meant to handle labelled data.
4.4
Training and testing
Our hope when training a model is that it will be able to generalize to unseen data.
However, as one starts to fit models with more and more weights, there is always a risk that
the model starts to learn patterns that are just found in the data used for training. This
phenomena is called overfitting, and it can make the model perform significantly worse
on unseen data. Even if overfitting has not occurred, models normally perform worse on
18
CHAPTER 4. SOFTMAX REGRESSION
4
x2
2
0
−2
Class 1
Class 2
−4
1.5
0.8
1
−2
0
x1
2
4
0.6
0.4
0.5
0.2
0
−4
1
Training
0
10
0
30
Training epoch
(a)
(b)
1
2
y, ŷ
y, ŷ2
20
Classification error
Neg a tive lo g likeliho o d
unseen data because it might contain patterns that were not present in the data used for
training. Model performance on training data is therefore not a good estimate on how well
the model will perform. Available data is therefore normally divided up into a training set
and a test set. The training set is used for training the model, whereas the test set is used
for getting an unbiased measure on how well the model performs. A common division is to
randomly select around 80 % of the data for training and leave the rest for testing.
0.5
0
5
Class 1
Class 2 −5
0
−5 5
x2
0
x1
0
5
Class 1
Class 2 −5
0
−5 5
x2
0
x1
(c)
(d)
Figure 4.3: a) 50 data points from both class 1 and class 2 generated in accordance with
Equation 4.9. b) Training results with random initial weights [-1 1] and η = 0.25. The blue
line represents l and the dashed blue line is the classification error as given by Equation 4.11.
c) Decision surface for the softmax regression network trained in Figure 4.3b. d) Decision
surface for a linear regression model trained on the data in Figure 4.3a.
4.5
MNIST
Different people and groups all over the world develop new or improve existing algorithms,
and it is often difficult to know which ones are the best. However, there exist several
“famous” datasets that people try out their algorithms on and comparisons are hence
possible. One well know dataset is the MNIST database of handwritten digits (LeCun,
Bottou, Bengio, & Haffner, 1998). This dataset contains 60 000 images (20 by 20 pixels)
that are to be used for training and 10 000 more for testing. As can be noted in Benenson
(2013), the best reported classification error ratio on the test set is 0.21 %. It could here
also be noted that deep learning networks are currently at the top for both this and the
other image classification datasets.
The MNIST datasets can be freely downloaded from the MNIST homepage and a
19
CHAPTER 4. SOFTMAX REGRESSION
function for reading them into MATLAB can be found from Matlab central. Figure 4.4b
illustrates 16 images taken from the training set.
Data in image format have to be concatenated into vector format before we can fit a
model using softmax regression. This is accomplished by simply taking each column in the
image matrix and placing them on top of each other to form a vector x with dimensionality
400. As before, we also have to add a row of ones; the dimensions for Xtraining then become
401 by 60 000. A similar process for the test set gives Xtest with dimensions 401 by 10 000.
Labels are coded using the desired outputs as:
y=


[1 0 0 0 0 0 0 0 0 0]|



 [0 1 0 0 0 0 0 0 0 0]|




 [0 0
if digit = 1
if digit = 2
..
.
..
.
0 0 0 0 0 0 0 1]|
(4.12)
if digit = 0
Ytraining and Ytest therefore have the dimensions 10 by 60 000 and 10 by 10 000 respectively.
In the end, this means that the model we will try to fit looks like the one illustrated in
Figure 4.4a. Training results after 100 epochs of gradient descent using random initial
weights [-1 1], η = 2, and mini batches of 20 % are shown in Figure 4.4c. Running gradient
descent for additional epochs could improve the results slightly, but overall it looks like the
algorithm has converged, and the final results are summarized in Table 4.2.
x0 = 1
wmk
v
k
φ(vk ) = P10e
k=1
ev k
x1
v1
..
.
..
.
..
.
x400
v10
ŷ10
ŷ1
10
8
1
Training
0.8
Test
6
0.6
4
0.4
2
0.2
0
0
50
Classification error
Nega tive lo g likeliho o d
(a)
0
100
Training epoch
(b)
(c)
Figure 4.4: a) Softmax regression model for classifying the MNIST data set. b) 16 images
from the MNIST training set. c) Training progress, on the MNIST dataset, using softmax
regression with random initial weights [-1 1], η = 2, and mini batches of size 20 %. Solid lines
represent l and the dashed line is C .
20
CHAPTER 4. SOFTMAX REGRESSION
Table 4.2: Numerical summary of the final training results from Figure 4.4c
ltraining
ltest
Ctraining
Ctest
0.4561
0.4695
0.1051
0.1053
Correctly classified training images
Incorrectly classified training images
Correctly classified test images
Incorrectly classified test images
4.6
53696
6304
8947
1053
Restrictions for linear models
Both linear and softmax regression represent artificial neural networks with a single layer
of neurons. This simplicity has the advantage that both problems are convex,8 but it also
brings restrictions on what is possible to achieve. Linear regression can only fit hyperplanes
to the observed data and softmax regression can only separate classes using hyperplanes.
Softmax regression can therefore only classify all data points correctly if the data is linearly
separable. In two dimensions, linearly separable then means that the classes should be
separable by a straight line. A simple example that is not linearly separable is the XOR
problem.
Important
Softmax regression can only try to separate classes using hyperplanes as boundaries and can therefore only obtain C = 0 on linearly separable problems.
If the models found using linear or softmax regression are unsatisfactory, more complex
models will have to be used. One such example are multilayer perceptron networks that
are presented in the next chapter.
8
Convex problems only have one minima, and hence gradient descent is guaranteed to find the global
minima for either Eav or l .
21
5
Multilayer Perceptrons
he previous chapter ended with a discussion about limitations for one layer networks. It
turns out that we can get around these limitations by adding another layer of neurons.
This gives us a Multilayer Perceptron (MLP) network (another misnomer unfortunately)9
and an example is given in Figure 5.1. Previously, we only had an input layer with inputs
and an output layer with neurons; but now we have added a hidden layer with neurons in
between these two layers (one could also add more than one hidden layer). These neurons in
the hidden layer will now function as inputs to the output layer. It is, however, important
that neurons in the hidden layer have a non-linear activation function. If not, the network
can be truncated down to a single layer and there is no gain in using a hidden layer. At
the same time, the activation functions must be differentiable, otherwise, as we shall see,
we will not be able to train the network using gradient descent.
T
Important
Neurons located in hidden layers must have differentiable non-linear activation
functions.
Input layer
x0 = 1
Hidden layer
i
wmk
i
ŷ0i = 1
Output layer
wkiii kii
x1
ŷ1i
ŷ1ii
..
.
..
.
..
.
xM
i
ŷK
i
ii
ŷK
ii
Figure 5.1: MLP neural network with one hidden layer (Roman numerals are used to indicate
depth).
9
The perceptron is actually a type of learning algorithm for linear classifiers and does not unfortunately
have anything to do with MLP networks.
22
CHAPTER 5. MULTILAYER PERCEPTRONS
1
1
0.75
ϕ′(v)
ϕ(v)
0.5
0
0.5
0.25
−0.5
0
−1
−3
−2
−1
0
v
1
2
3
−3
−2
(a)
−1
0
v
1
2
3
(b)
Figure 5.2: a) The hyperbolic tangent function. b) The derivative to the hyperbolic tangent
function.
What are then suitable activation functions? A common choice that fulfil the requirements is the hyperbolic tangent function. This function and its derivative are shown in
Figure 5.2a and Figure 5.2b, and they can be represented mathematically as:
2
1 + e−2v
4e2v
ϕ0tanh (v) = sech2 (v) = 2v
(e + 1)2
ϕtanh (v) = tanh(v) = −1 +
(5.1)
(5.2)
Independently on if we are performing regression or classification, we will here always
use the hyperbolic tangent function as activation function for all neurons in the hidden
layer. The output layer, on the other hand, will either consist of a linear regression neuron
or softmax regression neurons. We can therefore look upon the hidden layer as a non-linear
projection of the inputs before performing linear or softmax regression. Thinking about
the XOR problem, our objective then becomes to select weights for the hidden layer so
that the classes, after the non-linear projection, becomes linearly separable.
With multiple layers we have to calculate the outputs sequentially. That is, we first
determine the outputs for the neurons in the hidden layer, whereupon we determine the
outputs for the neurons in the output layer. Similarly, just as we always add a row of ones
to X before calculating Ŷi , we also have to add a row of ones to Ŷi before calculating Ŷii .
In MATLAB, this is done by:
1
2
3
4
5
% Calculating the locally induced field in the hidden layer
% Please note that Wi is transposed!
Vi = Wi’*X;
% Calculating outputs from the hidden layer
YiHat = tanh(Vi);
6
7
8
% Adding a row of ones to yiHat
YiHat = [ones(1,size(YiHat,2)); YiHat];
9
10
11
12
% Calculating induced fields in the output layer
% Please note that Wii is transposed!
Vii = Wii’*YiHat;
23
CHAPTER 5. MULTILAYER PERCEPTRONS
13
14
% phi(Vii) is determined by the output layer type
YiiHat = phi(Vii);
Now when we know how to calculate the output from the network, the next step is to
look at how we should update the weights.
5.1
Backpropagation
As we will use either a linear regression neuron or softmax neurons in the output layer,
the layer (ii) weights can be updated as before using the partial derivatives from either
Equation 3.9 or Equation 4.6. However, the output layer now obtains its inputs from the
hidden layer; we therefore have to change xm (n) to ŷki i (n) so that we get:
N
∂Eav
1 X
=
−
ŷ i i (n)e(n)
N n=1 k
∂wkiii
(5.3)
N
∂l
1 X
=
−
ŷ i i (n)ekii (n)
N n=1 k
∂wkiii kii
(5.4)
Getting the partial derivatives for the layer (i) weights is a bit trickier, but, assuming
we are using a linear regression neuron in the output layer, we can again expand the partial
derivatives for E (n) using the chain rule as:
∂E (n) ∂e(n) ∂ ŷ ii (n) ∂v ii (n) ∂ ŷki i (n) ∂vki i (n)
∂E (n)
=
i
i
∂e(n) ∂ ŷ ii (n) ∂v ii (n) ∂ ŷki i (n) ∂vki i (n) ∂wmk
∂wmk
i (n)
i (n)
where
∂E (n)
∂e(n)
∂e(n)
∂ ŷ ii (n)
∂ ŷ ii (n)
∂v ii (n)
∂v ii (n)
∂ ŷki i (n)
= e(n)
= −1
=1
= wkiii
h
i
∂ ŷki i (n)
= sech2 vki i (n)
i
∂vki (n)
∂vki i (n)
= xm
i
∂wmk
i (n)
therefore
h
i
∂E (n)
2
i
=
−x
(n)sech
v
(n)
e(n)wkiii
i
m
k
i
∂wmk
i (n)
and (using Equation 3.8)
N
h
i
∂Eav
1 X
=
−
xm (n)sech2 vki i (n) e(n)wkiii
i
N n=1
∂wmki
24
(5.5)
CHAPTER 5. MULTILAYER PERCEPTRONS
Differentiating l is more difficult, but we again obtain an almost identical results (you
can verify it using gradient checking). As shown in Equation 5.6, the only difference is a
sum over all the neurons in the output layer.
ii
N
K
h
i X
1 X
∂l
2
i
=
−
x
(n)sech
v
(n)
ekii (n)wkiii kii
i
m
k
i
N n=1
∂wmk
i
ii
(5.6)
k =1
From both Equation 5.5 and Equation 5.6 we see that the observed error, in the output
layer, is propagated backwards along the connecting weights to the hidden layer; it is this
mechanism that has given the backpropagation learning algorithm its name. At the same
time, we should note that the derivative of the activation function for the neurons in the
hidden layer is present in both equations. This is then the reason for the previous statement
that the selected activation function must be differentiable. Finally, when summarizing all
the above, we obtain the MLP learning method in Algorithm 3.
Wi ← random initial weights
Wii ← random initial weights
i ← 1 {epoch counter}
Iteration loop for gradient descent
repeat
Loop over all training examples (replace with matrix operation in MATLAB)
for n = 1 to N do
Forward pass
|
vi (n) ← Wi x(n)
ŷi (n) ← tanh(vi )
|
vii (n) ← Wii ŷi (n)
ŷii (n) ← ϕ(vii ) {select activation function based on used output layer}
Error
e(n) ← y(n) − ŷii (n)
if Regression then
E (n) ← 21 e2 (n)
Eav ← Eav + N1 E (n)
else
P ii
ii
l (n) ← − K
kii =1 ykii (n) ln(ŷkii (n))
1
l ← l + N l (n)
end if
Backward pass
| (n)
∆Wii ← ∆Wii − Nη ŷi (n)e
h
i
|
∆Wi ← ∆Wi − Nη x(n) sech2 vi (n) Wii| e(n)
end for
Eav (i) ← Eav or l (i) ← l {regression or classification}
Plot progress {check that l or Eav is decreasing}
Wii ← Wii − ∆Wii
Wi ← Wi − ∆Wi
until convergence {l or Eav is no longer decreasing}
Algorithm 3: Training MLP networks with gradient descent.
25
CHAPTER 5. MULTILAYER PERCEPTRONS
Earlier, we have seen how we can implement the forward pass and the error caclulations
in MATLAB without having to do a for loop over the whole training set. As before, we
can also do the backward pass without a for loop using:
1
2
3
4
5
6
% Error signal
E = Y - YiiHat;
% Weight change for layer (ii)
dWii = -eta * 1/N * YiHat*E’;
% Weight change for layer (i)
dWi = -eta * 1/N * X*( (Wii(2:end,:)*E) .* sech(Vi).^2 )’;
7
8
9
10
% Updating weights
Wi = Wi - dWi;
Wii = Wii - dWii;
independently on if the output layer consists of a linear regression neuron or of softmax
regression neurons. At this stage, we are now then ready to start solving problems using
MLP networks.
5.2
The XOR problem
The XOR problem is the simplest non-linearly separable problem, and it is therefore a nice
starting point for trying to understand the capabilities of MLP networks. In order to solve
the problem, we will try to fit the model in Figure 5.3 to the data in Figure 5.4a.
Running Algorithm 3 for 100 epochs with random initial weights [−1 1] and η = 1.5
resulted in the progress shown in Figure 5.4b. We stated earlier that, in order to solve
the problem, the hidden layer have to perform a non-linear projection that makes the
data linearly separable. Figure 5.4c shows the actual projections performed by the trained
network, and as can be seen, the classes are now linearly separable in the space spanned
by ŷ1i and ŷ2i . This makes it possible for the sotfmax regression neurons in the output layer
to learn a model that can classify all the data points correctly. We can also see from the
dotted line in Figure 5.4b that this is indeed the case, and furthermore, Figure 5.4d shows
how the network generalizes by plotting its decision surface.
The addition of a hidden layer have also brought with it an unwanted consequence,
the surface formed by Eav (W) or l (W) is no longer convex. This means that there is no
guarantee that the gradient will guide us towards a global minima. In Figure 5.4e and
Figure 5.4f, both the training progress and the decision surface are shown for a case when
the algorithm got stuck in a suboptimal minima. This is unfortunately a built in feature
of gradient based algorithms and, therefore, a problem that we have to be aware of.
Input layer
x0 = 1
Hidden layer
i
wmk
i
ŷ0i = 1
Output layer
wkiii kii
x1
ŷ1i
ŷ1ii
x2
ŷ2i
ŷ2ii
Figure 5.3: MLP network used for solving the XOR problem (Roman numerals are used to
indicate depth).
26
CHAPTER 5. MULTILAYER PERCEPTRONS
x2
1
0
Class 1
Class 2
−1
−1
0
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
1
1
Training
2
0
Classification error
Nega tive lo g likeliho o d
2
0
100
50
Training epoch
x1
(a)
(b)
2
1
y, ŷ2ii
ŷ2i
1
0
0.5
−1
Class 1
Class 2
−2
−2
−1
0
ŷ1i
Class 1
Class 2
0
2
1
0
2
x2
x1
(d)
1
Training
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0
50
1
y, ŷ2ii
1.2
Classification error
Neg a tive lo g likeliho o d
(c)
0.2
2
0
0.5
0
2
0
100
Class 1
Class 2 0
1
0
−1 2
x2
x1
Training epoch
(f)
(e)
Figure 5.4: a) The XOR problem. b) Training progress for the XOR problem after 100
epochs using Algorithm 3 with 2 hidden neurons, 2 softmax output neurons, random initial
weights [-1 1], and η = 1.5. c) Learned projections by the hidden layer after training (the two
data points belonging to class 1 actually lie on top of each other). d) Decision surface for the
trained MLP network. e) Training progress for network stuck in a suboptimal minima. f)
Decision surface corresponding to a suboptimal minima.
5.3
Non-linear regression
In the previous case, we saw that the hidden layer could learn to perform a non-linear
mapping that turned a previously non-linearly separable problem into linearly separable
one. An interesting question is therefore if the hidden layer also could learn to perform a
mapping that mapped non-linear data onto a hyperplane. We know from chapter 3 that
27
CHAPTER 5. MULTILAYER PERCEPTRONS
a linear regression output neuron can fit hyperplanes. So, if the hidden layer could map
non-linear input data onto a hyperplane, it would be possible for a linear regression output
neuron to model this hyperplane. In order to test this, we will fit the simple model in
Figure 5.5a to the data points shown in Figure 5.5b. These data points have been generated
by the non-linear process:
y = cos x
π
5
+r
(5.7)
where r is random normally distributed variable with mean zero and standard deviation
0.25. Training the network, using Algorithm 3, for 300 epoch with random initial weights
in the interval [−1 1] and η = 0.25 resulted in Figure 5.5c.
Input layer
Hidden layer
x0 = 1
Output layer
ŷ0i = 1
i
wmk
i
x1
wkiii kii
ŷ1i
ŷ1ii
ŷ2i
Average error energy
(a)
y
2
0
Training data
Original funtion
−2
0
x
2
Training
1
0.5
0
−2
−4
1.5
4
0
100
200
300
Training epoch
(b)
(c)
Figure 5.5: a) MLP network used for non-linear regression (Roman numerals are used to
indicate depth). b) 100 data points generated by Equation 5.7 together with a blue line
representing the same process with its stochastic element excluded. c) Training results for
fitting the model in Figure 5.5a to the data in Figure 5.5b, using Algorithm 3 with η = 0.25
and starting from random initial weights in the interval [−1 1].
When we plot the outputs from the hidden layer (ŷ i ) for each x value together with the
desired outputs (y), we notice that the hidden layer has indeed learned to map the inputs
onto a hyperplane. This is shown in Figure 5.6a together with the hyperplane learned
by the linear regression output neuron. Finally, in Figure 5.6b the learned model for the
whole network is illustrated together with the training data.
28
CHAPTER 5. MULTILAYER PERCEPTRONS
0
y, ŷii
y, ŷ ii
2
−5
1
0
−1
Training data
Original funtion
Trained network
Projected data
−1
0
1
ŷ1i
0
−2
−4
ŷ2i
−2
(a)
0
x
2
4
(b)
Figure 5.6: a) All data points, in Figure 5.5b, mapped to the space spanned by the neurons
in the hidden layer. The hidden neurons in the trained network has here learned to map
non-linear input data onto hyperplane that the output neuron in turn has learned to represent.
b) The obtained model by the MLP network together with the data used for training.
5.4
MNIST revisited
In chapter 4, we obtained a classification error ratio of just above 0.1 on the MNIST test
set using softmax regression. As softmax regression only can separate linearly separable
classes, it makes sense to assume that we should be able to obtain a better classification
error ratio with an MLP network. In order to test this, we will attempt to fit the model
in Figure 5.7 to the MNIST training set. Notice that this model actually contains 1000
hidden neurons.
Starting from random initial weights in the interval [−10−4 10−4 ], Algorithm 3 produces
the learning progress in Figure 5.8a with η = 0.5 and mini batches of size 0.2. After 300
epochs it still looks like both l and C are decreasing, but at this stage Ctest has already
decreased to 0.036 which is far better than what we obtained with softmax regression alone.
A complete summary of the results after 300 epoch is given in Table 5.1, and Figure 5.8b
illustrates 16 images that were wrongly classified in the test set.
Input layer
x0 = 1
Hidden layer
i
wmk
i
ŷ0i = 1
Output layer
wkiii kii
x1
ŷ1i
ŷ1ii
..
.
..
.
..
.
x400
ŷ1000
ŷ10
Figure 5.7: MLP network used for non-linear regression (Roman numerals are used to indicate
depth).
29
2.5
2
1
Training
0.8
Test
1.5
0.6
1
0.4
0.5
0.2
0
0
100
200
Classification error
Neg a tive lo g likeliho o d
CHAPTER 5. MULTILAYER PERCEPTRONS
0
300
Training epoch
(a)
(b)
Figure 5.8: a) Fitting the model in Figure 5.7 to the MNIST training set using Algorithm 3
with η = 0.5, mini batches of size 0.2, and randomly generated initial weights in the interval
[−10−4 10−4 ]. b) 16 out of 360 wrongly classified images from the test set.
Table 5.1: Numerical summary of the final training results from Figure 5.8a
ltraining
ltest
Ctraining
Ctest
0.1056
0.1243
0.0291
0.0360
Correctly classified training images
Incorrectly classified training images
Correctly classified test images
Incorrectly classified test images
5.5
58254
1746
9640
360
Deeper architectures
So far in this chapter we have seen that the incorporation of a hidden layer gives us
non-linear models, but with the additional complexity that both l and Eav now have
multiple minima. An interesting question to ask is, therefore, are there any constraint on
what MLP networks can model. Hornik, Stinchcombe, and White (1989) have actually
answered this question by proofing that a MLP network, with only one hidden layer, can
actually approximate any function to any degree of accuracy. This sounds like good news
but there is a catch. Even if a MLP network can approximate any function, the proof
says nothing about how many hidden neurons are needed, if it is possible to learn such a
network using gradient based methods, or most importantly, if such a model is an efficient
representation of the function. Many think that deeper networks with more layers can give
rise to more efficient representations, and this idea with additional inspiration from the
human brain (remember that hierarchical structures are thought to exist in the cortex) has
lead people to try to train deeper neural networks. Methods that can successfully train
networks with more than two hidden layers exist today and these are generally referred to
as “Deep Learning” methods (Bengio, 2009).
30
6
What is next?
e have now seen how both regression and classification models can be visualized as
ANNs, and we have also seen that these models can be trained using gradient descent.
As you continue to learn more and more about machine learning, you will notice that this
is only the beginning. Several different types of ANNs exist as well as completely different
techniques. However, the presented material should have given you basic knowledge to
continue your path with any of the following subjects.
W
Support Vector Machines: In chapter 5, we used a MLP network with depth
2 (hidden and output layer of neurons) and concluded that such a network is
capable of solving classification problems that are not linearly separable. We
also noticed that this added complexity came with a cost that prevented us from
finding optimal weights. Support Vector Machines (SVMs) can also be looked
upon as a network of depth 2, but with added constraints that allow us to find
an optimal solution. This has made SVMs very popular and a nice introduction
is given in Hearst, Dumais, Osman, Platt, and Scholkopf (1998).
Convolutional Neural Networks: The ANNs examined here did not take spatial information into account when classifying images. Contrary to this, neurons
in the visual cortex have receptive fields, and this property is copied by Convolutional Neural Networks (CNNs). These are currently state of the art for
image classification and LeCun et al. (1998) provide a good introduction.
Hessian free optimization: Gradient descent, which we used throughout this
course, is a very crude optimization method. It completely ignores the curvature
of the error surface and simply looks a the gradient for determining new weights.
Clearly, it would be useful if the curvature also could be taken into account.
This way one could estimate how far one should proceed in any direction along
the surface. Such information, however, comes at a cost, and might not always
computationally feasible to obtain. Recently though, Martens (2010) proposed
an efficient Hessian free optimization method for deep neural networks that do
take the curvature into account.
31
References
Arel, I., Rose, D. C., & Karnowski, T. P. (2010). Deep machine learning-a new frontier in
artificial intelligence research [research frontier]. Computational Intelligence Magazine,
IEEE, 5 (4), 13–18.
Benenson, R. (2013, December 18). What is the class of this image? [GitHub]. Retrieved
December 21, 2013, from http://rodrigob.github.io/are_we_there_yet/build/classif
ication_datasets_results.html
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine
Learning, 2 (1), 1–127.
Bengio, Y., Courville, A., & Vincent, P. (2012). Representation learning: a review and new
perspectives. arXiv: 1206.5538 [cs.LG]
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford university press.
Cleary, T. (1988). The art of war. Shambhala.
Croft, A., Davison, R., & Hargreaves, M. (2001). Engineering mathematics. Pearson
Education.
Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews
Neuroscience, 11 (2), 127–138.
Hawkins, J. (2004). On intelligence. Macmillan.
Haykin, S. S. (2009). Neural networks and learning machines. Prentice Hall New York.
Hearst, M., Dumais, S., Osman, E., Platt, J., & Scholkopf, B. (1998, July). Support vector
machines. Intelligent Systems and their Applications, IEEE, 13 (4), 18–28.
Herculano-Houzel, S. (2012). The remarkable, yet not extraordinary, human brain as a
scaled-up primate brain and its associated cost. Proceedings of the National Academy
of Sciences, 109 (Supplement 1), 10661–10668.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are
universal approximators. Neural networks, 2 (5), 359–366.
Laserson, J. (2011). From neural networks to deep learning: zeroing in on the human brain.
XRDS: Crossroads, The ACM Magazine for Students, 18 (1), 29–34.
32
REFERENCES
Lashley, K. S. (1950). In search of the engram. In Symposia of the society for experimental
biology (Vol. 4, 454-482, p. 30).
Lay, D. C. (2012). Linear algebra and its applications. Pearson.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied
to document recognition. In Proceedings of the ieee (Vol. 86, 11, pp. 2278–2324).
Martens, J. (2010). Deep learning via hessian-free optimization. In Proceedings of the 27th
international conference on machine learning (icml).
Mcmilan, R. (2013, February 18). How google retooled android with help from your brain.
Wired. Retrieved December 21, 2013, from http://www.wired.com/wiredenterprise
/2013/02/android-neural-network/
Metz, C. (2013, December 12). Facebook’s ‘deep learning’ guru reveals the future of ai.
Wired. Retrieved December 21, 2013, from http://www.wired.com/wiredenterprise
/2013/12/facebook-yann-lecun-qa/
Mitchell, T. (1997). Machine learning. McGraw-Hill.
Poggio, T. & Ullman, S. (2013). Vision: are models of object recognition catching up with
the brain? Annals of the New York Academy of Sciences.
Squire, L. R. & Kandel, E. R. (2009). Memory from mind to molecules. Roberts and
Company Publishers.
Von Melchner, L., Pallas, S. L., & Sur, M. (2000). Visual behaviour mediated by retinal
projections directed to the auditory pathway. Nature, 404 (6780), 871–876.
Wikipedia. (n.d.-a). Automobile. Retrieved December 22, 2013, from http://en.wikipedia.o
rg/wiki/Automobile
Wikipedia. (n.d.-b). Neuron. Retrieved December 20, 2013, from http://en.wikipedia.org
/wiki/Neuron
33
APPENDIX
A
1
Matlab code for solving the XOR problem
clear all; close all; clc;
2
3
4
5
% Paramters
eta = 0.25;
nEpochs = 200;
6
7
8
9
% Generating data
X = [1 1 1 1; 1 0 1 0; 1 0 0 1];
Y = [1 1 0 0; 0 0 1 1];
10
11
12
% Number of data points
N = size(X,2);
13
14
15
16
% Generating random initial weights for 5 hidden neurons
Wi = randn(3,5);
Wii = randn(6,2);
17
18
19
20
% Defining l and C vectors
l = nan(1,nEpochs);
C = nan(1,nEpochs);
21
22
23
24
25
26
27
% Initializing progress figure
f1 = figure();
[ah,ph1,ph2] = plotyy(0:(nEpochs-1), l, 0:(nEpochs-1), C);
set(ph1, ’LineWidth’, 2); set(ph2, ’LineWidth’, 2);
set(ah(2), ’YLim’, [0 1]);
set(ah(1), ’XLim’, [0 nEpochs]); set(ah(2), ’XLim’, [0 nEpochs]);
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
% Gradient descent
for i = 1:nEpochs
% Calculating the locally induced field in the hidden layer
% Please note that Wi is transposed!
Vi = Wi’*X;
% Calculating outputs from the hidden layer
YiHat = tanh(Vi);
% Adding a row of ones to yiHat
YiHat = [ones(1,size(YiHat,2)); YiHat];
% Calculating induced fields in the output layer
% Please note that Wii is transposed!
Vii = Wii’*YiHat;
% phi(Vii) is determined by the output layer type
YiiHat = exp(Vii) ./ ( ones(size(Vii,1),1) * sum(exp(Vii)) );
43
44
45
46
47
48
49
50
51
% l and C
l(i) = -1/N * sum(sum( Y.*log(YiiHat) ));
% Labels from Y
[~, Y_labels] = max(Y, [], 1);
% Labels from Yhat
[~, Yhat_labels] = max(YiiHat, [], 1);
% The classification error ratio
C(i) = mean( Y_labels ~= Yhat_labels );
52
34
APPENDIX
53
54
55
% Updating progress figure
set(ph1, ’YData’, l); set(ph2, ’YData’, C);
drawnow
56
57
58
59
60
61
62
% Error signal
E = Y - YiiHat;
% Weight change for layer (ii)
dWii = -eta * 1/N * YiHat*E’;
% Weight change for layer (i)
dWi = -eta * 1/N * X*( (Wii(2:end,:)*E) .* sech(Vi).^2 )’;
63
64
65
66
67
% Updating weights
Wi = Wi - dWi;
Wii = Wii - dWii;
end
68
69
70
71
72
73
74
75
% Fixing labels after training is complete (faster)
xlabel(’Epochs’, ’Interpreter’, ’LaTex’);
ylabel(ah(1), ’Mean negative log likelihood ($l$)’,...
’Interpreter’, ’LaTex’);
ylabel(ah(2), ’Classification error ratio ($C$)’,...
’Interpreter’,’LaTex’);
leg = legend(’$l$’, ’$C$’); set(leg, ’Interpreter’, ’latex’);
76
77
78
79
80
% Calculating a decision surface for output neuron 1
xLim = [-1 2];
dx = 0.1;
[X1, X2] = meshgrid( xLim(1):dx:xLim(2), xLim(1):dx:xLim(2) );
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
for i = 1:size(X1,1)
for j = 1:size(X1,2)
% Creating a datapoint
X_tmp = [1; X1(i,j); X2(i,j)];
% Calculating the locally induced field in the hidden layer
% Please note that Wi is transposed!
Vi = Wi’*X_tmp;
% Calculating outputs from the hidden layer
YiHat = tanh(Vi);
% Adding a row of ones to yiHat
YiHat = [ones(1,size(YiHat,2)); YiHat];
% Calculating induced fields in the output layer
% Please note that Wii is transposed!
Vii = Wii’*YiHat;
% phi(Vii) is determined by the output layer type
YiiHat = exp(Vii) ./ ( ones(size(Vii,1),1) * sum(exp(Vii)) );
% Storing the obtained output
YiiHat1(i,j) = YiiHat(1);
end
end
102
103
104
105
106
% Plotting a decision surface and data points
f2 = figure();
hold on
% Class 1
35
APPENDIX
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
c1 = plot3(X(2,1:2), X(3,1:2), Y(1,1:2), ’ko’,...
’MarkerSize’, 10, ’LineWidth’, 2);
% Class 2
c2 = plot3(X(2,3:4), X(3,3:4), Y(1,3:4), ’kx’,...
’MarkerSize’, 10, ’LineWidth’, 2);
% Decision surface
sh = surf(X1, X2, YiiHat1, ’EdgeColor’, ’none’);
% Transparency and colormap
alpha(0.5)
colormap(spring);
% Legend and labels
leg = legend(’Class 1’, ’Class 2’, ’Decision surface’);
set(leg, ’Interpreter’, ’latex’);
xlabel(’$x_1$’, ’Interpreter’, ’LaTex’);
ylabel(’$x_2$’, ’Interpreter’, ’LaTex’);
zlabel(’$y$, $\hat{y}_1^{ii}$’, ’Interpreter’, ’LaTex’);
36
APPENDIX
B
Suggested course structure
What here follows is a suggestion for how the course could be structured.
Lectures and assignments
A big portion of the course is reserved for implementations in MATLAB; lecture time
is therefore divided 50/50 to presentation of the material and programming assistance.
However, assigned programming time during lectures will not be enough to complete the
homework assignments. Students are therefore also expected to work on their own.
Requirements for passing the course
In order to pass the course every student will have to complete 3 homework assignments
and give a short presentation (no exam). Each homework assignment will be worth 10
points in total, and atleast 3.5 points is needed from each one in order to pass the course.
10.5 points therefore gives the grade 1, and from here the scale goes linearly upwards to
the grade 5. Homework problems can be discussed freely with other student, but every
one have to hand in a report of their own. The report should always present answers to
solved questions together with the code used to solve the problem (if the answer required
programming). If there are cases with clear plagiarism, all students involved will have to
redo their report and the best attainable grade for the course will be lowered to 3. Besides
the homework problems, everyone will also have to give a short presentation (10 min) in a
group with one or two other students. More information about the homework assignments
and the presentation is found in accompanying appendices.
B.1
Guidelines for reporting answers to homework problems
There is no need to write an essay in order to present your answers, but you should present
your answers in a well structured manner. That is, each question should be answered
separately, and the answer should include comments on what you have done together
with the results you obtained. Any MATLAB code used to reach the results should
either be appended after the answer or at the end of your report (choose the method that
makes everything as clear as possible). The code may not include any of your answers
as comments, all obtained results that you want to present should be presented in your
answer to the question. The person reading your answers should not have to go trough
your m-code in detail in order to see what you have done. The m-code is simply appended
so that there is a possibility to check what kind of errors you have done if your answers
are looking strange.
Figures make up a big part of your answers and it is therefore important that they
are well presented. You should therefore make sure that each figure fulfil the following
requirements:
• There are labels on all axes.
• Axes limits are chosen in such a way that what you want to show is clearly visible.
• Use legends if necessary to clear out any ambiguities.
• Each figure or table should always be accompanied with a descriptive text.
37
APPENDIX
As a rule of thumb, each figure description should include all the information needed for
another person to redo the experiment and validate your results. Finally, always hand in
your answers as one pdf file, and name the file first name_surname_HomeworkX.pdf
where X is the number of the homework assignment (1, 2, or 3).
B.2
Homework 1
The purpose of this homework is to solve a linear regression problem in order to get familiar
with gradient descent. You are free to discuss the homework tasks with the rest of the class,
but you have to hand in a report with answers (code appended as well) individually. Please
make sure that the report is well structured and follows the guidelines in Appendix B.1.
Points will be subtracted for poorly structured reports and ambiguous answers.
Generate data (1 p)
Generate and plot 100 data points, where x is drawn randomly in the interval [-5 5], from
the process:
y = a + bx + r
(B-1)
where r is a normally distributed random number with mean 0 and variance 0.5, a is the
month you where born divided by ten, and b the day of the month you were born divided by
ten. Complement your plot with a straight line given by Equation B-1 when r is ignored.
Surface plot of Eav (2 p)
Assuming that you will try to fit a model as the one in Figure 3.1, plot a surface plot of
Eav as function of both w0 and w1 in a suitable interval so that the minima can be seen.
Gradient descent (3 p)
Implement Algorithm 1 and find the minima of Eav . Plot the algorithms progress on top
of the surface plot obtained from the previous task. Verify that the algorithms seem to be
moving in the direction of steepest descent.
Varying η (1 p)
When varying η, you should see that for some values the algorithms diverges, for other
values it moves in a zigzag pattern towards the minima, and yet for other values it moves
in a smooth trajectory towards the minima. Examine the limits for η where these three
different regimes occur and plot an example from each.
Linear algebra (1 p)
Linear algebra provides the following method to determine the optimal values for the
weights.
w = (XX| )−1 Xy|
(B-2)
Why does the above calculation give us the right answer? You probably have to google
up where this equations comes from, or find a book on linear algebra. The above definition
is adapted to how matrices are defined here, so the equation you will find in books or
online most likely looks something like:
38
APPENDIX
x̂ = (A| A)−1 A| y
(B-3)
Preprocessing (2 p)
Data is often preprocessed before any attempts to fit a model are done. A normal
preprocessing stage consists of mean centering followed up by variance scaling. In our case,
this means that when you calculate the means for the rows in X you should get a vector of
zeros as your answer. Similarly, when calculating the variance for the rows you should get
a vector of ones. However, this procedure does not have to be done for x0 , so here you just
have to do it for x1 .
Incorporate preprocessing into your previous solutions and do another surface plot of
Eav . What has now changed and how does this change affect how the gradient descent
algorithm moves towards the minima?
B.3
Presentation
This presentation will strive to introduce you to people whose names are encountered quite
often when reading material on both artificial and real neural networks. In groups of 2 to 3
persons, you are to select one of the persons in the list below and give a short presentation
(10 min) about what this persons has done.
1. Donald Hebb
6. Eric Kandel
2. Geoffrey Hinton
7. Henry Markram
3. Andrew Ng
8. Yoshua Bengio
4. Yann LeCun
9. Jeff Hawkins
5. Tomaso Poggio
B.4
10. Ramón Cajal
Homework 2
In this homework you will implement softmax regression to solve different classification
problems. As before, you are free to discuss your solutions with other students, but
everyone have to hand in an individual report containing their answers. Please make sure
that the report is well structured and follows the guidelines in Appendix B.1. Points will
be subtracted for poorly structured reports and ambiguous answers.
Basic math
a Explain how matrix multiplications work, and illustrate your understanding by
calculating an example of your own without a computer. (0.25 p)
b Explain and illustrate why matrix multiplications are useful for determining the
induced field for the artificial neuron in Figure 2.3. (0.25 p)
39
APPENDIX
c Why are we not affecting the localization of the a minima when we take the
logarithm of the likelihood function (L )? (0.25 p)
d In what way are your models restricted if you forget to add a row of ones to X?
(0.25 p)
A two dimensional two class example
a Generate two dimensional data with 100 data points for two classes (50 data
points per class) so that the classes are linearly separable. Besides from the
previously mentioned constraint, you are free to come up with your own function
for generating the data, just remember that Y now is two dimensional (one
output for each class). Finally, plot the data you generated. (1 p)
b Implement Algorithm 2 and make sure that you have calculated the partial
derivatives correctly by also implementing gradient checking. (1 p)
c Run Algorithm 2 to fit your model to the data and plot l and C for each epoch.
(0.5 p)
d Visualize the obtained model by plotting a decision surface together with the
data you used for training. (1 p)
e Use your knowledge from the previous homework to also fit a model using linear
regression to your data. Compare the obtained decision surfaces for these two
models. (0.75 p)
f Adjust your function for generating the two dimensional data so that the classes
no longer are linearly separable. Train an additional model for this new data
and see what kind of decision surface you obtain. (0.75 p)
Handwritten digits
a Download the MNIST dataset and the corresponding function for reading the
data files into MATLAB (see section 4.5 for details).
b Use the downloaded function to store the MNIST datasets into the matrices
Xtraining , Xtest , Ytraining , and Ytest . The original data only contains a label for
y so you have to generate both Y matrices your self. Show that this step is
accomplished by plotting the first 16 images from the training set. (1 p)
c Fit a softmax regression model to the MNIST data, illustrate how the training
progresses, and present how well your best found model performs. Use your
implementation of gradient checking here as well to convince your self that your
code is correct. (2.5 p)
40
APPENDIX
d Visualize 16 wrongly classified images. (0.5 p)
B.5
Homework 3
In this homework you will investigate how multilayer perceptron networks perform a
non-linear projection of your data, followed up by either linear or softmax regression inside
a new space spanned by neurons in the hidden layer. As before, you are free to discuss
your solutions with other students, but everyone have to hand in an individual report
containing their answers. Please make sure that the report is well structured and follows
the guidelines in Appendix B.1. Points will be subtracted for poorly structured reports
and ambiguous answers.
The XOR problem (3 p)
You are to replicate the experiment presented in section 5.2. Points are handed out for the
following steps:
a Plot your data points in the space spanned by your inputs (x1 and x2 ), and
in the space spanned by the two neurons in your hidden layer (ŷ1i and ŷ2i ) for
random initial weights. (0.5 p)
b Write a function that returns the gradient and the mean negative log likelihood,
and verify that the gradient is calculated correctly using gradient checking.
(0.5 p)
c Train the network using gradient descent and plot l and C for each epoch.
(0.5 p)
d Plot a decision surface for one of your output neurons and see how the model
generalises. (0.5 p)
e Show that your model has learnt to do a non-linear mapping from (x1 , x2 ) to
(ŷ1i , ŷ2i ) so that the classes now are linearly separable. (0.5 p)
f Increase the number of hidden neurons and see if you can get a different decision
surface. (0.5 p)
Non-linear regression (2 p)
You are to replicate the experiment presented in section 5.3. Points are handed out for the
following steps:
a Plot your original data points (x, y) and the non-linear projection (ŷ1i , ŷ2i , y) for
random initial weights. (0.5 p)
b Train the network using gradient descent and plot E for each epoch. (0.5 p)
41
APPENDIX
c Plot your data points again after the non-linear projection (ŷ1i , ŷ2i , y), and add
the plane learnt by your linear regression neuron in the output layer. (0.75 p)
d Visualize the model learned together with your data points, (x, y) and (x, ŷ ii ).
(0.25 p)
Is this an eye? (5 p)
Your task is to tell if an image represents an eye or not (positive = image centred on
the eye, negative = eye not in centre or not at all in the image). To this end, you can
implement any method of your choosing (presented in this course), but you will get points
for how well you succeed. That is, points are awarded for presenting the data, presenting
your model, verifying that your model works, and for presenting how well it works.
You will receive a mat file containing the data. This file contains two matrices X
(15x15x2255) and Y (2x2255). Hence, each image is of size 15 by 15 pixels and there are
2255 images in total. Divide the images into training and testing sets in order to get an
unbiased estimate of how well your model works. Good luck!
42
Notation
X
x
xmn
M
m
N
n
V
v
vkn
Y
y
ykn
Ŷ
ŷ
ŷkn
E
e
ekn
K
k
W
w
wmk
∆W
∆w
∆wmk
L
l
ϕ
ε
E
Eav
η
L
l
C
Matrix with input data (dimensions: M + 1 by N )
One column from X (represents a data point as a vector)
One element in X
The dimensionality of x
Index for input dimensionality, going from 0 to M
The number of data points in X
Index for data points, going from 1 to N
Matrix with induced fields (dimensions: K by N )
One column from V (induced fields by x)
One element in V
Matrix with desired output data (dimensions: K by N )
One column from Y (desired output vector for x)
One element in Y
Matrix with model output data (dimensions: K by N
One column from Ŷ (obtained model output vector for x)
One element in Ŷ
Error signals as defined by Equation 3.2 (dimensions: K by N )
One column from E (represents the observed error signal vector for x)
One element in E
The dimensionality of y, ŷ, v, and e
Index for output dimensionality, going from 1 to K
Weight matrix (dimensions: # of layer inputs by # of layer outputs)
One column from W (represents weights to a single neuron)
One element in W
Matrix with weight changes, same dimensions as W
One column from ∆W (represents weights changes for a single neuron)
One element in ∆W
Network depth
Index for depth, going from 1 to L
Activation function
Error term
Error energy
Average error energy
Learning rate parameter
Likelihood function
Mean negative log likelihood
Classification error ratio
43
Acronyms
AI
ANN
CNN
MATLAB
MLP
SVM
Artificial Intelligence
Artificial Neural Network
Convolutional Neural Network
Matrix Laboratory
Multilayer Perceptron
Support Vector Machine
44
Index
Input layer, 22
Activation function, 6
Axon, 5
Layers, 5
Likelihood function, 15
Linearly separable, 21
Log likelihood, 15
Classification, 4
Classification error ratio, 17
Connection weights, 6
Curse of Dimensionality, 4
Method of steepest descent, 10
Decision surface, 18
Dendrite, 5
Dot product, 7
Output layer, 22
Overfitting, 18
Regression, 4
Reinforcement learning, 4
Error energy, 9
Error signal, 9
Error term, 8
Suboptimal minima, 26
Supervised learning, 4
Synapse, 5
Global minima, 26
Gradient, 10
Gradient descent, 10
Hidden layer, 22
Test set, 19
Training set, 19
Induced field, 6
Unsupervised learning, 4
45
Novia University of Applied Sciences is the largest Swedish-speaking
UAS in Finland. Novia UAS has about 4000 students and a staff
workforce of 360 people. Novia has five educational units or
campuses in Vaasa (Seriegatan and Wolffskavägen), Jakobstad,
Raseborg and Turku. High-class and state-of-the-art degree programs
provide students with a proper platform for their future careers.
NOVIA UNIVERSITY OF
APPLIED SCIENCES
Tel +358 (0)6 328 5000
Fax +358 (0)6 328 5110
www.novia.fi
ADMISSIONS OFFICE
PO BOX 6
FI-65201 Vaasa, Finland
Tel +358 (0)6 328 5055
Fax +358 (0)6 328 5117
[email protected]
ISBN 978-952-5839-86-9
9 789525 839869
Read our latest publication at www.novia.fi/FoU/publikation-och-produktion
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement