# [Slides_compressed] ```Learning Deep Structured Models
Raquel Urtasun
University of Toronto
August 21, 2015
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
1 / 128
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
2 / 128
1
Part I: Deep learning
2
Part II: Deep Structured Models
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
3 / 128
Part I: Deep Learning
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
4 / 128
Deep Learning
Supervised models
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
5 / 128
Binary Classification
Given inputs x, and outputs t ∈ {−1, 1}
We want to fit a hyperplane that divides the space into half
y∗ = sign(wT x∗ + w0 )
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
6 / 128
Binary Classification
Given inputs x, and outputs t ∈ {−1, 1}
We want to fit a hyperplane that divides the space into half
y∗ = sign(wT x∗ + w0 )
SVMs try to maximize the margin
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
6 / 128
Non-linear Predictors
How can we make our classifier more powerful?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
7 / 128
Non-linear Predictors
How can we make our classifier more powerful?
Compute non-linear functions of the input
y∗ = F (x∗ , w)
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
7 / 128
Non-linear Predictors
How can we make our classifier more powerful?
Compute non-linear functions of the input
y∗ = F (x∗ , w)
Two types of approaches:
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
7 / 128
Non-linear Predictors
How can we make our classifier more powerful?
Compute non-linear functions of the input
y∗ = F (x∗ , w)
Two types of approaches:
Kernel Trick: Fixed functions and optimize linear parameters on non-linear
mapping
y∗ = sign(wT φ(x∗ ) + w0 )
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
7 / 128
Non-linear Predictors
How can we make our classifier more powerful?
Compute non-linear functions of the input
y∗ = F (x∗ , w)
Two types of approaches:
Kernel Trick: Fixed functions and optimize linear parameters on non-linear
mapping
y∗ = sign(wT φ(x∗ ) + w0 )
Deep Learning: Learn parametric non-linear functions
y∗ = F (x∗ , w)
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
7 / 128
Why ”Deep”?
Supervised Learning: Examples
Classification
“dog”
c
at i
ific
s
las
on
Denoising
n
sio
es
r
reg
OCR
“2 3 4 5”
red
ctu ion
u
r
st dict
e
pr
3
Ranzato
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
8 / 128
Why ”Deep”?
Supervised Deep Learning
Classification
“dog”
Denoising
OCR
“2 3 4 5”
4
Ranzato
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
8 / 128
Neural Networks
Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid,
tanh, max
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
9 / 128
Neural Networks
Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid,
tanh, max
Note: a composite of linear functions is linear!
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
9 / 128
Neural Networks
Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid,
tanh, max
Note: a composite of linear functions is linear!
Example: 2 layer NNet
h1
x
max(0, W1T x)
R. Urtasun (UofT)
h2
max(0, W2T h1 )
Deep Structured Models
W3T h2
y
August 21, 2015
9 / 128
Neural Networks
Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid,
tanh, max
Note: a composite of linear functions is linear!
Example: 2 layer NNet
h1
x
max(0, W1T x)
h2
max(0, W2T h1 )
W3T h2
y
August 21, 2015
9 / 128
x is the input
R. Urtasun (UofT)
Deep Structured Models
Neural Networks
Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid,
tanh, max
Note: a composite of linear functions is linear!
Example: 2 layer NNet
h1
x
max(0, W1T x)
h2
max(0, W2T h1 )
W3T h2
y
August 21, 2015
9 / 128
x is the input
y is the output (what we want to predict)
R. Urtasun (UofT)
Deep Structured Models
Neural Networks
Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid,
tanh, max
Note: a composite of linear functions is linear!
Example: 2 layer NNet
h1
x
max(0, W1T x)
h2
max(0, W2T h1 )
W3T h2
y
August 21, 2015
9 / 128
x is the input
y is the output (what we want to predict)
hi is the i-th hidden layer
R. Urtasun (UofT)
Deep Structured Models
Neural Networks
Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid,
tanh, max
Note: a composite of linear functions is linear!
Example: 2 layer NNet
h1
x
max(0, W1T x)
h2
max(0, W2T h1 )
W3T h2
y
August 21, 2015
9 / 128
x is the input
y is the output (what we want to predict)
hi is the i-th hidden layer
W i are the parameters of the i-th layer
R. Urtasun (UofT)
Deep Structured Models
Evaluating the Function
Forward Propagation: compute the output given the input
h1
h2
x
max(0, W1T x)
max(0, W2T h1 )
W3T h2
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
y
10 / 128
Evaluating the Function
Forward Propagation: compute the output given the input
h1
h2
x
max(0, W1T x)
max(0, W2T h1 )
W3T h2
y
Fully connected layer: Each hidden unit takes as input all the units from
the previous layer
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
10 / 128
Evaluating the Function
Forward Propagation: compute the output given the input
h1
h2
x
max(0, W1T x)
max(0, W2T h1 )
W3T h2
y
Fully connected layer: Each hidden unit takes as input all the units from
the previous layer
The non-linearity is called a ReLU (rectified linear unit), with x ∈ <D ,
b i ∈ <Ni the biases and W i ∈ <Ni ×Ni−1 the weights
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
10 / 128
Evaluating the Function
Forward Propagation: compute the output given the input
h1
h2
x
max(0, W1T x)
max(0, W2T h1 )
W3T h2
y
Fully connected layer: Each hidden unit takes as input all the units from
the previous layer
The non-linearity is called a ReLU (rectified linear unit), with x ∈ <D ,
b i ∈ <Ni the biases and W i ∈ <Ni ×Ni−1 the weights
Do it in a compositional way,
h1 = max(0, W 1 x + b 1 )
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
10 / 128
Evaluating the Function
Forward Propagation: compute the output given the input
h1
x
h2
max(0, W2T h1 )
max(0, W1T x)
W3T h2
y
Fully connected layer: Each hidden unit takes as input all the units from
the previous layer
The non-linearity is called a ReLU (rectified linear unit), with x ∈ <D ,
b i ∈ <Ni the biases and W i ∈ <Ni ×Ni−1 the weights
Do it in a compositional way
h1
h2
R. Urtasun (UofT)
= max(0, W 1 x + b 1 )
= max(0, W 2 h1 + b 2 )
Deep Structured Models
August 21, 2015
11 / 128
Evaluating the Function
Forward Propagation: compute the output given the input
h1
x
h2
max(0, W2T h1 )
max(0, W1T x)
W3T h2
y
Fully connected layer: Each hidden unit takes as input all the units from
the previous layer
The non-linearity is called a ReLU (rectified linear unit), with x ∈ <D ,
b i ∈ <Ni the biases and W i ∈ <Ni ×Ni−1 the weights
Do it in a compositional way
h1
h2
y
R. Urtasun (UofT)
= max(0, W 1 x + b 1 )
= max(0, W 2 h1 + b 2 )
= max(0, W 3 h2 + b 3 )
Deep Structured Models
August 21, 2015
12 / 128
Alternative Graphical Representation
h
k
max 0, W
k1
hk
k
h 
h
k 1
h
k
hk 1
W
k 1
W
k
h1
h k2
h k3
h k4
hk 1
k 1
k1
w 1,1
k1
w 3,4
h k1 1
h k2 1
h k3 1
12
Ranzato
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
13 / 128
Relu Interpretation
Piece-wise linear tiling: mapping is locally linear.
Figure : by M. Ranzato
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
14 / 128
Why Hierarchical?
Interpretation
[1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1… ]
motorbike
[0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 … ]
truck
15
Ranzato
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
15 / 128
Why Hierarchical?
Interpretation
...
prediction of class
high-level
parts
distributed representations
feature sharing
compositionality
mid-level
parts
low level
parts
Input image
16
Lee et al. “Convolutional DBN's ...” ICML 2009
R. Urtasun (UofT)
Deep Structured Models
Ranzato
August 21, 2015
16 / 128
Learning
h1
x
max(0, W1T x)
R. Urtasun (UofT)
h2
max(0, W2T h1 )
Deep Structured Models
W3T h2
y
August 21, 2015
17 / 128
Learning
h1
x
max(0, W1T x)
h2
max(0, W2T h1 )
W3T h2
y
We want to estimate the parameters, biases and hyper-parameters (e.g.,
number of layers, number of units) such that we do good predictions
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
17 / 128
Learning
h1
x
max(0, W1T x)
h2
max(0, W2T h1 )
W3T h2
y
We want to estimate the parameters, biases and hyper-parameters (e.g.,
number of layers, number of units) such that we do good predictions
Collect a training set of input-output pairs {x, t}
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
17 / 128
Learning
h1
x
max(0, W1T x)
h2
max(0, W2T h1 )
W3T h2
y
We want to estimate the parameters, biases and hyper-parameters (e.g.,
number of layers, number of units) such that we do good predictions
Collect a training set of input-output pairs {x, t}
Encode the output with 1-K encoding t = [0, · · · , 1, · · · , 0]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
17 / 128
Learning
h1
x
max(0, W1T x)
h2
max(0, W2T h1 )
W3T h2
y
We want to estimate the parameters, biases and hyper-parameters (e.g.,
number of layers, number of units) such that we do good predictions
Collect a training set of input-output pairs {x, t}
Encode the output with 1-K encoding t = [0, · · · , 1, · · · , 0]
Define a loss per training example and minimize the empirical risk
N
1 X
L(w) =
`(w, x(i) , t (i) ) + R(w)
N
i=1
with N number of examples, R a regularizer, and w contains all parameters
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
17 / 128
Learning
h1
x
max(0, W1T x)
h2
max(0, W2T h1 )
W3T h2
y
We want to estimate the parameters, biases and hyper-parameters (e.g.,
number of layers, number of units) such that we do good predictions
Collect a training set of input-output pairs {x, t}
Encode the output with 1-K encoding t = [0, · · · , 1, · · · , 0]
Define a loss per training example and minimize the empirical risk
N
1 X
L(w) =
`(w, x(i) , t (i) ) + R(w)
N
i=1
with N number of examples, R a regularizer, and w contains all parameters
What do we want to use as `?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
17 / 128
Learning
h1
x
max(0, W1T x)
h2
max(0, W2T h1 )
W3T h2
y
We want to estimate the parameters, biases and hyper-parameters (e.g.,
number of layers, number of units) such that we do good predictions
Collect a training set of input-output pairs {x, t}
Encode the output with 1-K encoding t = [0, · · · , 1, · · · , 0]
Define a loss per training example and minimize the empirical risk
N
1 X
L(w) =
`(w, x(i) , t (i) ) + R(w)
N
i=1
with N number of examples, R a regularizer, and w contains all parameters
What do we want to use as `?
The task loss: how we are going to evaluate at test time
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
17 / 128
Loss Functions
L(w) =
1 X
`(w, x(i) , t (i) ) + R(w)
N
i
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
18 / 128
Loss Functions
L(w) =
1 X
`(w, x(i) , t (i) ) + R(w)
N
i
The task loss is too difficult to compute, so one uses a surrogate that its
typically convex
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
18 / 128
Loss Functions
L(w) =
1 X
`(w, x(i) , t (i) ) + R(w)
N
i
The task loss is too difficult to compute, so one uses a surrogate that its
typically convex
Probability of class k given input (softmax):
exp(yk )
p(ck = 1|x) = PC
j=1 exp(yj )
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
18 / 128
Loss Functions
L(w) =
1 X
`(w, x(i) , t (i) ) + R(w)
N
i
The task loss is too difficult to compute, so one uses a surrogate that its
typically convex
Probability of class k given input (softmax):
exp(yk )
p(ck = 1|x) = PC
j=1 exp(yj )
Cross entropy is the most used loss function for classification
X
`(x, t, w) = −
t (i) log p(ci |x)
i
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
18 / 128
Loss Functions
L(w) =
1 X
`(w, x(i) , t (i) ) + R(w)
N
i
The task loss is too difficult to compute, so one uses a surrogate that its
typically convex
Probability of class k given input (softmax):
exp(yk )
p(ck = 1|x) = PC
j=1 exp(yj )
Cross entropy is the most used loss function for classification
X
`(x, t, w) = −
t (i) log p(ci |x)
i
Use gradient descent to train the network
1 X
min
`(w, x(i) , t (i) ) + R(w)
w N
i
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
18 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
h1
x
max(0, W1T x)
R. Urtasun (UofT)
∂`
∂y
h2
max(0, W2T h1 )
Deep Structured Models
W3T h2
August 21, 2015
y
19 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
h1
x
max(0, W2T h1 )
max(0, W1T x)
p(ck = 1|x)
R. Urtasun (UofT)
∂`
∂y
h2
=
W3T h2
y
exp(yk )
PC
j=1 exp(yj )
Deep Structured Models
August 21, 2015
19 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
h1
x
max(0, W2T h1 )
max(0, W1T x)
p(ck = 1|x)
`(x, t, w)
∂`
∂y
h2
W3T h2
y
exp(yk )
PC
j=1 exp(yj )
X
= −
t (i) log p(ci |x)
=
i
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
19 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
h1
x
∂`
∂y
h2
max(0, W2T h1 )
max(0, W1T x)
p(ck = 1|x)
`(x, t, w)
W3T h2
y
exp(yk )
PC
j=1 exp(yj )
X
= −
t (i) log p(ci |x)
=
i
Compute the derivative of loss w.r.t. the output
∂`
= p(c|x) − t
∂y
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
19 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
h1
x
∂`
∂y
h2
max(0, W2T h1 )
max(0, W1T x)
p(ck = 1|x)
`(x, t, w)
W3T h2
y
exp(yk )
PC
j=1 exp(yj )
X
= −
t (i) log p(ci |x)
=
i
Compute the derivative of loss w.r.t. the output
∂`
= p(c|x) − t
∂y
Note that the forward pass is necessary to compute
R. Urtasun (UofT)
Deep Structured Models
∂`
∂y
August 21, 2015
19 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
x
max(0, W1T x)
∂`
∂y
∂`
∂h2
h1
max(0, W2T h1 )
W3T h2
y
Compute the derivative of loss w.r.t the output
∂`
= p(c|x) − t
∂y
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
20 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
max(0, W1T x)
x
∂`
∂y
∂`
∂h2
h1
max(0, W2T h1 )
W3T h2
y
Compute the derivative of loss w.r.t the output
∂`
= p(c|x) − t
∂y
Given
∂`
∂y
if we can compute the Jacobian of each module
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
20 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
max(0, W1T x)
x
∂`
∂y
∂`
∂h2
h1
max(0, W2T h1 )
W3T h2
y
Compute the derivative of loss w.r.t the output
∂`
= p(c|x) − t
∂y
Given
∂`
∂y
if we can compute the Jacobian of each module
∂`
=
∂W 3
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
20 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
max(0, W1T x)
x
∂`
∂y
∂`
∂h2
h1
max(0, W2T h1 )
W3T h2
y
Compute the derivative of loss w.r.t the output
∂`
= p(c|x) − t
∂y
Given
∂`
∂y
if we can compute the Jacobian of each module
∂`
∂` ∂y
=
=
∂W 3
∂y ∂W 3
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
20 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
max(0, W1T x)
x
∂`
∂y
∂`
∂h2
h1
max(0, W2T h1 )
W3T h2
y
Compute the derivative of loss w.r.t the output
∂`
= p(c|x) − t
∂y
Given
∂`
∂y
if we can compute the Jacobian of each module
∂`
∂` ∂y
=
= (p(c|x) − t)(h2 )T
∂W 3
∂y ∂W 3
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
20 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
max(0, W1T x)
x
∂`
∂y
∂`
∂h2
h1
max(0, W2T h1 )
W3T h2
y
Compute the derivative of loss w.r.t the output
∂`
= p(c|x) − t
∂y
Given
∂`
∂y
if we can compute the Jacobian of each module
∂`
∂` ∂y
=
= (p(c|x) − t)(h2 )T
∂W 3
∂y ∂W 3
∂`
=
∂h2
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
20 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
max(0, W1T x)
x
∂`
∂y
∂`
∂h2
h1
max(0, W2T h1 )
W3T h2
y
Compute the derivative of loss w.r.t the output
∂`
= p(c|x) − t
∂y
Given
∂`
∂y
if we can compute the Jacobian of each module
∂`
∂` ∂y
=
= (p(c|x) − t)(h2 )T
∂W 3
∂y ∂W 3
∂`
∂` ∂y
=
=
∂h2
∂y ∂h2
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
20 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
max(0, W1T x)
x
∂`
∂y
∂`
∂h2
h1
max(0, W2T h1 )
W3T h2
y
Compute the derivative of loss w.r.t the output
∂`
= p(c|x) − t
∂y
Given
∂`
∂y
if we can compute the Jacobian of each module
∂`
∂` ∂y
=
= (p(c|x) − t)(h2 )T
∂W 3
∂y ∂W 3
∂`
∂` ∂y
=
= (W 3 )T (p(c|x) − t)
∂h2
∂y ∂h2
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
20 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
max(0, W1T x)
x
∂`
∂y
∂`
∂h2
h1
max(0, W2T h1 )
W3T h2
y
Compute the derivative of loss w.r.t the output
∂`
= p(c|x) − t
∂y
Given
∂`
∂y
if we can compute the Jacobian of each module
∂`
∂` ∂y
=
= (p(c|x) − t)(h2 )T
∂W 3
∂y ∂W 3
∂`
∂` ∂y
=
= (W 3 )T (p(c|x) − t)
∂h2
∂y ∂h2
Need to compute gradient w.r.t. inputs and parameters in each layer
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
20 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
∂`
∂h1
max(0, W1T x)
x
∂`
∂y
∂`
∂h2
max(0, W2T h1 )
W3T h2
y
∂`
∂` ∂y
=
= (W 3 )T (p(c|x) − t)
∂h2
∂y ∂h2
Given
∂`
∂h2
if we can compute the Jacobian of each module
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
21 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
∂`
∂h1
max(0, W1T x)
x
∂`
∂y
∂`
∂h2
max(0, W2T h1 )
W3T h2
y
∂`
∂` ∂y
=
= (W 3 )T (p(c|x) − t)
∂h2
∂y ∂h2
Given
∂`
∂h2
if we can compute the Jacobian of each module
∂`
=
∂W 2
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
21 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
∂`
∂h1
max(0, W1T x)
x
∂`
∂y
∂`
∂h2
max(0, W2T h1 )
W3T h2
y
∂`
∂` ∂y
=
= (W 3 )T (p(c|x) − t)
∂h2
∂y ∂h2
Given
∂`
∂h2
if we can compute the Jacobian of each module
∂` ∂h2
∂`
=
∂W 2
∂h2 ∂W 2
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
21 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
∂`
∂h1
max(0, W1T x)
x
∂`
∂y
∂`
∂h2
max(0, W2T h1 )
W3T h2
y
∂`
∂` ∂y
=
= (W 3 )T (p(c|x) − t)
∂h2
∂y ∂h2
Given
∂`
∂h2
if we can compute the Jacobian of each module
∂` ∂h2
∂`
=
∂W 2
∂h2 ∂W 2
∂`
=
∂h1
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
21 / 128
Backpropagation
Efficient computation of the gradients by applying the chain rule
∂`
∂h1
max(0, W1T x)
x
∂`
∂y
∂`
∂h2
max(0, W2T h1 )
W3T h2
y
∂`
∂` ∂y
=
= (W 3 )T (p(c|x) − t)
∂h2
∂y ∂h2
Given
∂`
∂h2
if we can compute the Jacobian of each module
∂` ∂h2
∂`
=
∂W 2
∂h2 ∂W 2
∂`
∂` ∂h2
=
1
∂h
∂h2 ∂h1
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
21 / 128
Gradient descent is a first order method, where one takes steps proportional
to the negative of the gradient of the function at the current point
xn+1 = xn − γn ∇F (xn )
Example: f (x) = x 4 − 3x 3 + 2
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
22 / 128
Use gradient descent to train the network
N
1 X
min
`(w, x(i) , t (i) ) + R(w)
w N
i=1
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
23 / 128
Use gradient descent to train the network
N
1 X
min
`(w, x(i) , t (i) ) + R(w)
w N
i=1
We need to compute at each iteration
wn+1 = wn − γn ∇L(wn )
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
23 / 128
Use gradient descent to train the network
N
1 X
min
`(w, x(i) , t (i) ) + R(w)
w N
i=1
We need to compute at each iteration
wn+1 = wn − γn ∇L(wn )
Use the backward pass to compute ∇L(wn ) efficiently
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
23 / 128
Use gradient descent to train the network
N
1 X
min
`(w, x(i) , t (i) ) + R(w)
w N
i=1
We need to compute at each iteration
wn+1 = wn − γn ∇L(wn )
Use the backward pass to compute ∇L(wn ) efficiently
Recall that the backward pass requires the forward pass first
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
23 / 128
Toy Code (Matlab): Neural Net Trainer
% F-PROP
for i = 1 : nr_layers - 1
[h{i} jac{i}] = nonlinearity(W{i} * h{i-1} + b{i});
end
h{nr_layers-1} = W{nr_layers-1} * h{nr_layers-2} +
b{nr_layers-1};
prediction = softmax(h{l-1});
% CROSS ENTROPY LOSS
loss = - sum(sum(log(prediction)
.*
target)) / batch_size;
% B-PROP
dh{l-1} = prediction - target;
for i = nr_layers – 1 : -1 : 1
dh{i-1} = (W{i}' * dh{i}) .* jac{i-1};
end
% UPDATE
for i = 1 : nr_layers - 1
W{i} = W{i} – (lr / batch_size)
b{i} = b{i} – (lr / batch_size)
end
*
*
28
Ranzato
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
24 / 128
Dealing with Big Data
min
w
N
1 X
`(w, x(i) , t (i) ) + R(w)
N
i=1
We need to compute at each iteration
wn+1 = wn − γn ∇L(wn )
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
25 / 128
Dealing with Big Data
min
w
N
1 X
`(w, x(i) , t (i) ) + R(w)
N
i=1
We need to compute at each iteration
wn+1 = wn − γn ∇L(wn )
with
∇L(wn ) =
N
1 X
∇`(w, x(i) , t (i) ) + ∇R(w)
N
i=1
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
25 / 128
Dealing with Big Data
min
w
N
1 X
`(w, x(i) , t (i) ) + R(w)
N
i=1
We need to compute at each iteration
wn+1 = wn − γn ∇L(wn )
with
∇L(wn ) =
N
1 X
∇`(w, x(i) , t (i) ) + ∇R(w)
N
i=1
Too expensive when having millions of examples
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
25 / 128
Dealing with Big Data
min
w
N
1 X
`(w, x(i) , t (i) ) + R(w)
N
i=1
We need to compute at each iteration
wn+1 = wn − γn ∇L(wn )
with
∇L(wn ) =
N
1 X
∇`(w, x(i) , t (i) ) + ∇R(w)
N
i=1
Too expensive when having millions of examples
N
X 1
1 X
∇`(w, x(i) , t (i) ) ≈
∇`(w, x(i) , t (i) )
N
|S|
i=1
R. Urtasun (UofT)
i∈S
Deep Structured Models
August 21, 2015
25 / 128
Dealing with Big Data
min
w
N
1 X
`(w, x(i) , t (i) ) + R(w)
N
i=1
We need to compute at each iteration
wn+1 = wn − γn ∇L(wn )
with
∇L(wn ) =
N
1 X
∇`(w, x(i) , t (i) ) + ∇R(w)
N
i=1
Too expensive when having millions of examples
N
X 1
1 X
∇`(w, x(i) , t (i) ) ≈
∇`(w, x(i) , t (i) )
N
|S|
i=1
i∈S
This is called stochastic gradient descent
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
25 / 128
wn+1 = wn − γn ∇L(wn )
with
∇L(wn ) =
X 1
∇`(w, x(i) , t (i) ) + ∇R(w)
|S|
i∈S
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
26 / 128
wn+1 = wn − γn ∇L(wn )
with
∇L(wn ) =
X 1
∇`(w, x(i) , t (i) ) + ∇R(w)
|S|
i∈S
We can also use momentum
w ← w − γ∆
∆ ← κ∆ + ∇L
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
26 / 128
wn+1 = wn − γn ∇L(wn )
with
∇L(wn ) =
X 1
∇`(w, x(i) , t (i) ) + ∇R(w)
|S|
i∈S
We can also use momentum
w ← w − γ∆
∆ ← κ∆ + ∇L
Many other variants exist
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
26 / 128
How to deal with large Input Spaces
Images can have millions of pixels, i.e., x is very high dimensional
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
27 / 128
How to deal with large Input Spaces
Images can have millions of pixels, i.e., x is very high dimensional
Prohibitive to have fully-connected layer
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
27 / 128
How to deal with large Input Spaces
Images can have millions of pixels, i.e., x is very high dimensional
Prohibitive to have fully-connected layer
We can use a locally connected layer
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
27 / 128
How to deal with large Input Spaces
Images can have millions of pixels, i.e., x is very high dimensional
Prohibitive to have fully-connected layer
We can use a locally connected layer
This is good when the input is registered
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
27 / 128
Locally Connected Layer
Example: 200x200 image
40K hidden units
Filter size: 10x10
4M parameters
Note: This parameterization is good
when input image is registered (e.g., 34
face recognition).
Ranzato
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
28 / 128
Locally Connected Layer
STATIONARITY? Statistics is similar at
different locations
Example: 200x200 image
40K hidden units
Filter size: 10x10
4M parameters
Note: This parameterization is good
when input image is registered (e.g., 35
face recognition).
Ranzato
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
29 / 128
Convolutional Neural Net
Idea: statistics are similar at different locations (Lecun 1998)
Connect each hidden unit to a small input patch and share the weight across
space
This is called a convolution layer and the network is a convolutional
network
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
30 / 128
Convolutional Layer
Ranzato
hjn = max(0,
K
X
hkn−1 ∗ wjkn )
k=1
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
31 / 128
Convolutional Layer
Ranzato
hjn = max(0,
K
X
hkn−1 ∗ wjkn )
k=1
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
32 / 128
Convolutional Layer
Ranzato
hjn = max(0,
K
X
hkn−1 ∗ wjkn )
k=1
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
33 / 128
Convolutional Layer
Ranzato
hjn = max(0,
K
X
hkn−1 ∗ wjkn )
k=1
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
34 / 128
Convolutional Layer
Ranzato
hjn = max(0,
K
X
hkn−1 ∗ wjkn )
k=1
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
35 / 128
Convolutional Layer
Ranzato
hjn = max(0,
K
X
hkn−1 ∗ wjkn )
k=1
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
36 / 128
Convolutional Layer
Learn multiple filters.
E.g.: 200x200 image
100 Filters
Filter size: 10x10
10K parameters
54
Ranzato
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
37 / 128
Pooling Layer
By “pooling” (e.g., taking max) filter
responses at different locations we gain
robustness to the exact spatial location
of features.
61
Ranzato
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
38 / 128
Pooling Options
Max Pooling: return the maximal argument
Average Pooling: return the average of the arguments
Other types of pooling exist: L2 pooling
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
39 / 128
Pooling Layer: Receptive Field Size
hn
hn−1
Conv.
layer
hn1
Pool.
layer
If convolutional filters have size KxK and stride 1, and pooling layer
has pools of size PxP, then each unit in the pooling layer depends
upon a patch (at the input of the preceding conv. layer) of size:
(P+K-1)x(P+K-1)
67
Ranzato
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
40 / 128
Now let’s make this very deep
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
41 / 128
Convolutional Neural Networks (CNN)
filtering?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Convolutional Neural Networks (CNN)
If our filter was [−1, 1], we got a vertical edge detector
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Convolutional Neural Networks (CNN)
Now imagine we want to have many filters (e.g., vertical, horizontal, corners,
one for dots). We will use a filterbank.
[Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Convolutional Neural Networks (CNN)
So applying a filterbank to an image yields a cube-like output, a 3D matrix
in which each slice is an output of convolution with one filter.
[Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Convolutional Neural Networks (CNN)
So applying a filterbank to an image yields a cube-like output, a 3D matrix
in which each slice is an output of convolution with one filter.
[Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Convolutional Neural Networks (CNN)
Do some additional tricks. A popular one is called max pooling. Any idea
why you would do this?
[Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Convolutional Neural Networks (CNN)
Do some additional tricks. A popular one is called max pooling. Any idea
why you would do this? To get invariance to small shifts in position.
[Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Convolutional Neural Networks (CNN)
Now add another “layer” of filters. For each filter again do convolution, but
this time with the output cube of the previous layer.
[Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Convolutional Neural Networks (CNN)
Keep adding a few layers. Any idea what’s the purpose of more layers? Why
can’t we just have a full bunch of filters in one layer?
[Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Convolutional Neural Networks (CNN)
In the end add one or two fully (or densely) connected layers. In this layer,
we don’t do convolution we just do a dot-product between the “filter” and
the output of the previous layer.
[Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Convolutional Neural Networks (CNN)
Add one final layer: a classification layer. Each dimension of this vector
tells us the probability of the input image being of a certain class.
[Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Convolutional Neural Networks (CNN)
The trick is to not hand-fix the weights, but to train them. Train them such
that when the network sees a picture of a dog, the last layer will say “dog”.
[Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Convolutional Neural Networks (CNN)
Or when the network sees a picture of a cat, the last layer will say “cat”.
[Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Convolutional Neural Networks (CNN)
Or when the network sees a picture of a boat, the last layer will say
“boat”... The more pictures the network sees, the better.
[Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
42 / 128
Classification
Once trained we feed in an image or a crop, run through the network, and
read out the class with the highest probability in the last (classif) layer.
[Slide Credit: Sanja Fidler]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
43 / 128
Classification Performance
Imagenet, main challenge for object classification:
http://image-net.org/
1000 classes, 1.2M training images, 150K for test
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
44 / 128
Architecture for Classification
category
prediction
LINEAR
FULLY CONNECTED
FULLY CONNECTED
MAX POOLING
CONV
CONV
CONV
MAX POOLING
LOCAL CONTRAST NORM
CONV
MAX POOLING
LOCAL CONTRAST NORM
CONV
input
95
Krizhevsky et al. “ImageNet Classification with deep CNNs” NIPS 2012
R. Urtasun (UofT)
Deep Structured Models
Ranzato
August 21, 2015
45 / 128
Architecture for Classification
Total nr. params: 60M
category
prediction
Total nr. flops: 832M
4M
LINEAR
4M
16M
FULLY CONNECTED
16M
37M
FULLY CONNECTED
37M
MAX POOLING
442K
CONV
74M
1.3M
CONV
224M
884K
CONV
149M
MAX POOLING
LOCAL CONTRAST NORM
307K
CONV
223M
MAX POOLING
LOCAL CONTRAST NORM
35K
CONV
input
105M
96
Krizhevsky et al. “ImageNet Classification with deep CNNs” NIPS 2012
R. Urtasun (UofT)
Deep Structured Models
Ranzato
August 21, 2015
46 / 128
The 2012 Computer Vision Crisis
(Classification)
R. Urtasun (UofT)
(Detection)
Deep Structured Models
August 21, 2015
47 / 128
Neural Networks as Descriptors
What vision people like to do is take the already trained network (avoid one
week of training), and remove the last classification layer. Then take the top
remaining layer (the 4096 dimensional vector here) and use it as a descriptor
(feature vector).
[Slide Credit: Sanja Fidler]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
48 / 128
Neural Networks as Descriptors
What vision people like to do is take the already trained network, and
remove the last classification layer. Then take the top remaining layer (the
4096 dimensional vector here) and use it as a descriptor (feature vector).
Now train your own classifier on top of these features for arbitrary classes.
[Slide Credit:
Sanja Fidler]
R. Urtasun
(UofT)
Deep Structured Models
August 21, 2015
48 / 128
Neural Networks as Descriptors
What vision people like to do is take the already trained network, and
remove the last classification layer. Then take the top remaining layer (the
4096 dimensional vector here) and use it as a descriptor (feature vector).
Now train your own classifier on top of these features for arbitrary classes.
This is quite hacky, but works miraculously well.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
48 / 128
Neural Networks as Descriptors
What vision people like to do is take the already trained network, and
remove the last classification layer. Then take the top remaining layer (the
4096 dimensional vector here) and use it as a descriptor (feature vector).
Now train your own classifier on top of these features for arbitrary classes.
This is quite hacky, but works miraculously well.
Everywhere where we were using SIFT (or anything else), you can use NNs.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
48 / 128
Caltech 256
Caltech Results
Zeiler & Fergus, Visualizing and Understanding Convolutional Networks, arXiv 1311.2901, 2013
75
70
65
\$FFXUDF\
60
55
50
45
40
35
%RHWDO
6RKQHWDO
30
25
0
R. Urtasun (UofT)
10
20
30
40
7UDLQLQJ,PDJHVSHUïFODVV
Deep Structured Models
50
60
August 21, 2015
49 / 128
Caltech 256
Caltech Results
Zeiler & Fergus, Visualizing and Understanding Convolutional Networks, arXiv 1311.2901, 2013
75
70
65
6 training examples
\$FFXUDF\
60
55
50
45
40
2XU0RGHO
%RHWDO
6RKQHWDO
35
30
25
0
R. Urtasun (UofT)
10
20
30
40
7UDLQLQJ,PDJHVSHUïFODVV
Deep Structured Models
50
60
August 21, 2015
50 / 128
And Detection?
For classification we feed in the full image to the network. But how can we
perform detection?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
51 / 128
The Era Post-Alex Net: PASCAL VOC detection
Extract object proposals with bottom up grouping
and then classify them using your big net
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
52 / 128
Detection Performance
PASCAL VOC challenge:
http://pascallin.ecs.soton.ac.uk/challenges/VOC/.
Figure : PASCAL has 20 object classes, 10K images for training, 10K for test
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
53 / 128
Detection Performance a Year Ago: 40.4%
A year ago, no networks:
Results on the main recognition benchmark, the PASCAL VOC challenge.
Figure : Leading method segDPM (ours). Those were the good times...
S. Fidler, R. Mottaghi, A. Yuille, R. Urtasun, Bottom-up Segmentation for Top-down Detection, CVPR’13
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
54 / 128
The Era Post-Alex Net: PASCAL VOC detection
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
55 / 128
So Neural Networks are Great
So networks turn out to be great.
Everything is deep, even if it’s shallow!
Companies leading the competitions: ImageNet, KITTI, but not yet PASCAL
· · · and a lot of our good students :(
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
56 / 128
So Neural Networks are Great
But to train the networks you need quite a bit of computational power. So
what do you do?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
56 / 128
So Neural Networks are Great
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
56 / 128
So Neural Networks are Great
And train more layers. 16 instead of 7 before. 144 million parameters.
[Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky]
Figure :
K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image
Recognition. arXiv 2014
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
56 / 128
The Era Post-Alex Net: PASCAL VOC detection
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
57 / 128
What if we Want Semantic Segmentation?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
58 / 128
What if we Want Semantic Segmentation?
Every layer, even fully connected can be treated as a convolutional layer, and
then we can deal with arbitrary dimensions of the input
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
58 / 128
What if we Want Semantic Segmentation?
Every layer, even fully connected can be treated as a convolutional layer, and
then we can deal with arbitrary dimensions of the input
The network can work on super pixels, or can directly operate in pixels
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
58 / 128
What if we Want Semantic Segmentation?
Every layer, even fully connected can be treated as a convolutional layer, and
then we can deal with arbitrary dimensions of the input
The network can work on super pixels, or can directly operate in pixels
Due to pooling, the output is typically lower dimensional than the input, use
interpolation.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
58 / 128
What if we Want Semantic Segmentation?
Every layer, even fully connected can be treated as a convolutional layer, and
then we can deal with arbitrary dimensions of the input
The network can work on super pixels, or can directly operate in pixels
Due to pooling, the output is typically lower dimensional than the input, use
interpolation.
PASCAL VOC, 65% IOU
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
58 / 128
What if we Want Semantic Segmentation?
Every layer, even fully connected can be treated as a convolutional layer, and
then we can deal with arbitrary dimensions of the input
The network can work on super pixels, or can directly operate in pixels
Due to pooling, the output is typically lower dimensional than the input, use
interpolation.
PASCAL VOC, 65% IOU
More to come in Part II
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
58 / 128
Practical Tips
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
59 / 128
How to choose Hyperparameters?
Hyperparameters: architecture, learning rate, num layers, num features, etc
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
60 / 128
How to choose Hyperparameters?
Hyperparameters: architecture, learning rate, num layers, num features, etc
How to choose them?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
60 / 128
How to choose Hyperparameters?
Hyperparameters: architecture, learning rate, num layers, num features, etc
How to choose them?
1
Cross-validation
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
60 / 128
How to choose Hyperparameters?
Hyperparameters: architecture, learning rate, num layers, num features, etc
How to choose them?
1
2
Cross-validation
Grid search (need lots of GPUs)
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
60 / 128
How to choose Hyperparameters?
Hyperparameters: architecture, learning rate, num layers, num features, etc
How to choose them?
1
2
3
Cross-validation
Grid search (need lots of GPUs)
Random [Bergstra & Bengio JMLR 2012]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
60 / 128
How to choose Hyperparameters?
Hyperparameters: architecture, learning rate, num layers, num features, etc
How to choose them?
1
2
3
4
Cross-validation
Grid search (need lots of GPUs)
Random [Bergstra & Bengio JMLR 2012]
Bayesian optimization [Whetlab Toronto]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
60 / 128
Good to Know
ALWAYS check gradients numerically by finite differences!
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
61 / 128
Good to Know
ALWAYS check gradients numerically by finite differences!
Measure error on both training and validation set, NEVER TEST
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
61 / 128
Good to Know
ALWAYS check gradients numerically by finite differences!
Measure error on both training and validation set, NEVER TEST
Test on a small subset of the data and check that you can over fit (i.e., error
→ 0)
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
61 / 128
Good to Know
ALWAYS check gradients numerically by finite differences!
Measure error on both training and validation set, NEVER TEST
Test on a small subset of the data and check that you can over fit (i.e., error
→ 0)
Visualize features (feature maps need to be uncorrelated) and have high
variance.
(good)
Figure : from M. Ranzato
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
61 / 128
Good to Know
ALWAYS check gradients numerically by finite differences!
Measure error on both training and validation set, NEVER TEST
Test on a small subset of the data and check that you can over fit (i.e., error
→ 0)
Visualize features (feature maps need to be uncorrelated) and have high
variance.
Visualize parameters
Figure : from M. Ranzato
[Slide credit:
M. Ranzato]
R. Urtasun
(UofT)
Deep Structured Models
August 21, 2015
62 / 128
What if it doesn’t work?
Training diverges
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
63 / 128
What if it doesn’t work?
Training diverges
Decrease learning rate
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
63 / 128
What if it doesn’t work?
Training diverges
Decrease learning rate
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
63 / 128
What if it doesn’t work?
Training diverges
Decrease learning rate
Parameters collapse / loss is minimized but accuracy is low
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
63 / 128
What if it doesn’t work?
Training diverges
Decrease learning rate
Parameters collapse / loss is minimized but accuracy is low
Appropriate loss function?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
63 / 128
What if it doesn’t work?
Training diverges
Decrease learning rate
Parameters collapse / loss is minimized but accuracy is low
Appropriate loss function?
Does loss-function have degenerate solutions?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
63 / 128
What if it doesn’t work?
Training diverges
Decrease learning rate
Parameters collapse / loss is minimized but accuracy is low
Appropriate loss function?
Does loss-function have degenerate solutions?
Network is underperforming
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
63 / 128
What if it doesn’t work?
Training diverges
Decrease learning rate
Parameters collapse / loss is minimized but accuracy is low
Appropriate loss function?
Does loss-function have degenerate solutions?
Network is underperforming
Make it bigger
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
63 / 128
What if it doesn’t work?
Training diverges
Decrease learning rate
Parameters collapse / loss is minimized but accuracy is low
Appropriate loss function?
Does loss-function have degenerate solutions?
Network is underperforming
Make it bigger
Visualize hidden units/params and fix optimization
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
63 / 128
What if it doesn’t work?
Training diverges
Decrease learning rate
Parameters collapse / loss is minimized but accuracy is low
Appropriate loss function?
Does loss-function have degenerate solutions?
Network is underperforming
Make it bigger
Visualize hidden units/params and fix optimization
Network is too slow
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
63 / 128
What if it doesn’t work?
Training diverges
Decrease learning rate
Parameters collapse / loss is minimized but accuracy is low
Appropriate loss function?
Does loss-function have degenerate solutions?
Network is underperforming
Make it bigger
Visualize hidden units/params and fix optimization
Network is too slow
GPU,distrib. framework, make net smaller
[Slide credit: M. Ranzato]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
63 / 128
Improving Generalization
Weight sharing (Reduce the number of parameters)
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
64 / 128
Improving Generalization
Weight sharing (Reduce the number of parameters)
Data augmentation (e.g., jittering, noise injection, tranformations)
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
64 / 128
Improving Generalization
Weight sharing (Reduce the number of parameters)
Data augmentation (e.g., jittering, noise injection, tranformations)
Dropout [Hinton et al.]: randomly drop units (along with their
connections) from the neural network during training. Use for the
fully connected layers only
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
64 / 128
Improving Generalization
Weight sharing (Reduce the number of parameters)
Data augmentation (e.g., jittering, noise injection, tranformations)
Dropout [Hinton et al.]: randomly drop units (along with their
connections) from the neural network during training. Use for the
fully connected layers only
Regularization: Weight decay (L2, L1)
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
64 / 128
Improving Generalization
Weight sharing (Reduce the number of parameters)
Data augmentation (e.g., jittering, noise injection, tranformations)
Dropout [Hinton et al.]: randomly drop units (along with their
connections) from the neural network during training. Use for the
fully connected layers only
Regularization: Weight decay (L2, L1)
Sparsity in the hidden units
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
64 / 128
Improving Generalization
Weight sharing (Reduce the number of parameters)
Data augmentation (e.g., jittering, noise injection, tranformations)
Dropout [Hinton et al.]: randomly drop units (along with their
connections) from the neural network during training. Use for the
fully connected layers only
Regularization: Weight decay (L2, L1)
Sparsity in the hidden units
[Slide credit: M. Ranzato]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
64 / 128
Software
Torch7: learning library that supports neural net training
http://www.torch.ch
http://code.cogbits.com/wiki/doku.php (tutorial with demos by C.
Farabet)
https://github.com/sermanet/OverFeat
Python-based learning library (U. Montreal)
http://deeplearning.net/software/theano/ (does automatic
differentiation
Efficient CUDA kernels for ConvNets (Krizhevsky)
Caffe (Yangqing Jia)
http://caffe.berkeleyvision.org
Deep Structured Models
http://www.alexander-schwing.de/ (soon available)
[Slide Credit: M. Ranzato]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
65 / 128
Part II: Deep Structured Learning
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
66 / 128
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
67 / 128
What’s next?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
68 / 128
What’s next?
1
Theoretical Understanding
2
Unsupervised Learning
3
Structured models
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
69 / 128
Structure!
Many Vision Problems are complex and involve predicting many random
variables that are statistically related
Scene understanding
Tag prediction
Segmentation
x = image
x = image
x = image
y : room layout
y : tag ”combo”
y : segmentation
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
70 / 128
Deep Learning
Complex mapping F (x, y , w) to predict output y given input x through a
series of matrix multiplications, non-linearities and pooling operations
Figure : Imagenet CNN [Krizhevsky et al. 13]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
71 / 128
Deep Learning
Complex mapping F (x, y , w) to predict output y given input x through a
series of matrix multiplications, non-linearities and pooling operations
Figure : Imagenet CNN [Krizhevsky et al. 13]
We typically train the network to predict one random variable (e.g.,
ImageNet) by minimizing cross-entropy
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
71 / 128
Deep Learning
Complex mapping F (x, y , w) to predict output y given input x through a
series of matrix multiplications, non-linearities and pooling operations
Figure : Imagenet CNN [Krizhevsky et al. 13]
We typically train the network to predict one random variable (e.g.,
ImageNet) by minimizing cross-entropy
Multi-task extensions: sum the loss of each task, and share part of the
features (e.g., segmentation)
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
71 / 128
Deep Learning
Complex mapping F (x, y , w) to predict output y given input x through a
series of matrix multiplications, non-linearities and pooling operations
Figure : Imagenet CNN [Krizhevsky et al. 13]
We typically train the network to predict one random variable (e.g.,
ImageNet) by minimizing cross-entropy
Multi-task extensions: sum the loss of each task, and share part of the
features (e.g., segmentation)
Use an MRF as a post processing step
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
71 / 128
PROBLEM: How can we take into account complex dependencies when
predicting multiple variables?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
72 / 128
PROBLEM: How can we take into account complex dependencies when
predicting multiple variables?
SOLUTION: Graphical models
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
72 / 128
Graphical Models
Convenient tool to illustrate dependencies among random variables
X
X
X
E (y) = −
fi (yi ) −
f (yi , yj ) −
fα (yα )
i
α
i,j∈E
| {z }
unaries
|
{z
}
pairwise
|
{z
}
high−order
High-order
Potential
Pairwise
Potential
Unary
Potential
Widespread usage among different fields: vision, NLP, comp. bio, · · ·
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
73 / 128
Compact Notation
In Computer Vision we usually express
X
X
X
E (y) = −
fi (yi ) −
f (yi , yj ) −
fα (yα )
i
unaries
R. Urtasun (UofT)
α
i,j∈E
| {z }
|
{z
pairwise
Deep Structured Models
}
|
{z
}
high−order
August 21, 2015
74 / 128
Compact Notation
In Computer Vision we usually express
X
X
X
E (y) = −
fi (yi ) −
f (yi , yj ) −
fα (yα )
i
α
i,j∈E
| {z }
unaries
|
{z
pairwise
}
|
{z
}
high−order
For the purpose of this talk we are going to use a more compact notation
X
E (y, w) = −
fr (yr , w)
r ∈R
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
74 / 128
Compact Notation
In Computer Vision we usually express
X
X
X
E (y) = −
fi (yi ) −
f (yi , yj ) −
fα (yα )
i
α
i,j∈E
| {z }
unaries
|
{z
pairwise
}
|
{z
}
high−order
For the purpose of this talk we are going to use a more compact notation
X
E (y, w) = −
fr (yr , w)
r ∈R
r is a region and R is the set of all regions
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
74 / 128
Compact Notation
In Computer Vision we usually express
X
X
X
E (y) = −
fi (yi ) −
f (yi , yj ) −
fα (yα )
i
α
i,j∈E
| {z }
unaries
|
{z
pairwise
}
|
{z
}
high−order
For the purpose of this talk we are going to use a more compact notation
X
E (y, w) = −
fr (yr , w)
r ∈R
r is a region and R is the set of all regions
yr is of any order
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
74 / 128
Compact Notation
In Computer Vision we usually express
X
X
X
E (y) = −
fi (yi ) −
f (yi , yj ) −
fα (yα )
i
α
i,j∈E
| {z }
unaries
|
{z
pairwise
}
|
{z
}
high−order
For the purpose of this talk we are going to use a more compact notation
X
E (y, w) = −
fr (yr , w)
r ∈R
r is a region and R is the set of all regions
yr is of any order
The functions fr are a function of parameters w
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
74 / 128
Continuous vs Discrete MRFs
E (y, w) = −
X
fr (yr , w)
r ∈R
Discrete MRFs:
yi ∈ {1, · · · , Ci }
Continuous MRFs:
yi ∈ Y ⊆ R
Hybrid MRFs with continuous and discrete variables
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
75 / 128
Continuous vs Discrete MRFs
E (y, w) = −
X
fr (yr , w)
r ∈R
Discrete MRFs:
yi ∈ {1, · · · , Ci }
Continuous MRFs:
yi ∈ Y ⊆ R
Hybrid MRFs with continuous and discrete variables
Today’s talk: only discrete MRFs
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
75 / 128
Probabilistic Interpretation
The energy is defined as
E (y, w) = −F (y, w) = −
X
fr (yr , w)
r ∈R
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
76 / 128
Probabilistic Interpretation
The energy is defined as
E (y, w) = −F (y, w) = −
X
fr (yr , w)
r ∈R
We can construct a probability distribution over the outputs
!
X
1
p(y; w) = exp
fr (yr , w)
Z
r ∈R
with Z (w) =
R. Urtasun (UofT)
P
y
exp
P
r ∈R fr (yr , w)
the partition function
Deep Structured Models
August 21, 2015
76 / 128
Probabilistic Interpretation
The energy is defined as
E (y, w) = −F (y, w) = −
X
fr (yr , w)
r ∈R
We can construct a probability distribution over the outputs
!
X
1
p(y; w) = exp
fr (yr , w)
Z
r ∈R
with Z (w) =
P
y
exp
P
r ∈R fr (yr , w)
the partition function
CRFs vs MRFs
1
p(y|x; w) =
exp
Z (x)
with Z (x, w) =
R. Urtasun (UofT)
P
y
exp
P
!
X
fr (x, yr , w)
r ∈R
r ∈R fr (x, yr , w)
Deep Structured Models
the partition function
August 21, 2015
76 / 128
MAP: maximum a posteriori estimate, or minimum energy configuration
X
y∗ = arg max
fr (yr , w)
y
R. Urtasun (UofT)
r ∈R
Deep Structured Models
August 21, 2015
77 / 128
MAP: maximum a posteriori estimate, or minimum energy configuration
X
y∗ = arg max
fr (yr , w)
y
r ∈R
Probabilistic Inference: We might want to compute p(yr ) for any possible
subset of variables r , or p(yr |yp ) for any subset r and p
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
77 / 128
MAP: maximum a posteriori estimate, or minimum energy configuration
X
y∗ = arg max
fr (yr , w)
y
r ∈R
Probabilistic Inference: We might want to compute p(yr ) for any possible
subset of variables r , or p(yr |yp ) for any subset r and p
M-best configurations (e.g., top-k)
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
77 / 128
MAP: maximum a posteriori estimate, or minimum energy configuration
X
y∗ = arg max
fr (yr , w)
y
r ∈R
Probabilistic Inference: We might want to compute p(yr ) for any possible
subset of variables r , or p(yr |yp ) for any subset r and p
M-best configurations (e.g., top-k)
Very difficult tasks in general (i.e., NP-hard). Some exceptions, e.g., low-tree
width models and binary MRFs with sub-modular energies
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
77 / 128
Learning in CRFs
Given a training set of N pairs (x, y) ∈ D, we want to estimate the functions
fr (x, yr , w)
As these functions are parametric, this is equivalent to estimating w
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
78 / 128
Learning in CRFs
Given a training set of N pairs (x, y) ∈ D, we want to estimate the functions
fr (x, yr , w)
As these functions are parametric, this is equivalent to estimating w
We would like to do this by minimizing the empirical loss
min
w
1
N
X
(x,y)∈D
where `task is the loss that we’ll be evaluated on
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
78 / 128
Learning in CRFs
Given a training set of N pairs (x, y) ∈ D, we want to estimate the functions
fr (x, yr , w)
As these functions are parametric, this is equivalent to estimating w
We would like to do this by minimizing the empirical loss
min
w
1
N
X
(x,y)∈D
where `task is the loss that we’ll be evaluated on
Very difficult, instead we minimize the sum of a surrogate (typically convex)
loss and a regularizer
min R(w) +
w
R. Urtasun (UofT)
C
N
X
¯ y, w)
`(x,
(x,y)∈D
Deep Structured Models
August 21, 2015
78 / 128
More on Learning in CRFs
Given a training set of N pairs (x, y) ∈ D, we want to estimate the functions
fr (y, x, w)
Minimize a surrogate (typically convex) loss and a regularizer
C X ¯
min R(w) +
`(x, y, w)
w
N
(x,y)∈D
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
79 / 128
More on Learning in CRFs
Given a training set of N pairs (x, y) ∈ D, we want to estimate the functions
fr (y, x, w)
Minimize a surrogate (typically convex) loss and a regularizer
C X ¯
min R(w) +
`(x, y, w)
w
N
(x,y)∈D
¯ hinge-loss, log-loss
The surrogate loss `:
`¯log (x, y, w) = − ln px,y (y; w).
`¯hinge (x, y, w) = max `(y, ŷ) − w> Φ(x, ŷ) + w> Φ(x, y)
ŷ∈Y
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
79 / 128
More on Learning in CRFs
Given a training set of N pairs (x, y) ∈ D, we want to estimate the functions
fr (y, x, w)
Minimize a surrogate (typically convex) loss and a regularizer
C X ¯
min R(w) +
`(x, y, w)
w
N
(x,y)∈D
¯ hinge-loss, log-loss
The surrogate loss `:
`¯log (x, y, w) = − ln px,y (y; w).
`¯hinge (x, y, w) = max `(y, ŷ) − w> Φ(x, ŷ) + w> Φ(x, y)
ŷ∈Y
The assumption is that the model is log-linear
E (x, y, w) = −F (x, y, w) = −wT φ(x, y)
and the features decompose in a graph
X
wT φ(x, y) =
wrT φ(x, y)
r ∈R
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
79 / 128
PROBLEM: How can we remove the log-linear restriction?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
80 / 128
PROBLEM: How can we remove the log-linear restriction?
SOLUTION: Deep Structured Models
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
80 / 128
With Pictures ;)
Standard CNN
y1
CNN
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
81 / 128
With Pictures ;)
Standard CNN
y1
CNN
Deep Structured Models
y1,2
y2,3
CNN4
CNN5
y1
y2
y3
CNN1
CNN2
CNN3
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
81 / 128
Learning
Probability of a configuration y:
1
exp F (x, y, w)
Z (x, w)
X
exp F (x, ŷ, w)
Z (x, w) =
p(y | x; w) =
ŷ∈Y
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
82 / 128
Learning
Probability of a configuration y:
1
exp F (x, y, w)
Z (x, w)
X
exp F (x, ŷ, w)
Z (x, w) =
p(y | x; w) =
ŷ∈Y
Maximize the likelihood of training data via
Y
w∗ = arg max
p(y|x; w)
w
(x,y)∈D


=
arg max
X
F (x, y, w) − ln
w
(x,y)∈D
R. Urtasun (UofT)
Deep Structured Models
X
exp F (x, y, w)
ŷ∈Y
August 21, 2015
82 / 128
Learning
Probability of a configuration y:
1
exp F (x, y, w)
Z (x, w)
X
exp F (x, ŷ, w)
Z (x, w) =
p(y | x; w) =
ŷ∈Y
Maximize the likelihood of training data via
Y
w∗ = arg max
p(y|x; w)
w
(x,y)∈D


=
arg max
X
F (x, y, w) − ln
w
(x,y)∈D
X
exp F (x, y, w)
ŷ∈Y
Maximum likelihood is equivalent to maximizing cross-entropy when the target
distribution p(x,y),tg (ŷ) = δ(ŷ = y)
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
82 / 128
Program of interest:
max
X
p(x,y),tg (ŷ) ln p(ŷ | x; w)
w
(x,y)∈D,ŷ
X
∂
p(x,y),tg (ŷ) ln p(ŷ | x; w)
∂w
(x,y)∈D,ŷ
∂
p(x,y),tg (ŷ) − p(ŷ | x; w)
F (ŷ, x, w)
∂w
(x,y)∈D,ŷ
X ∂
∂
=
Ep(x,y),tg
F (ŷ, x, w) − Ep(x,y)
F (ŷ, x, w)
∂w
∂w
(x,y)∈D
|
{z
}
=
X
moment matching
Compute predicted distribution p(ŷ | x; w)
Use chain rule to pass back difference between prediction and observation
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
83 / 128
Deep Structured Learning (algo 1)
[Peng et al. NIPS’09]
Repeat until stopping criteria
1
Forward pass to compute F (y, x, w)
2
Compute p(y | x, w)
3
Backward pass via chain rule to obtain gradient
4
Update parameters w
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
84 / 128
Deep Structured Learning (algo 1)
[Peng et al. NIPS’09]
Repeat until stopping criteria
1
Forward pass to compute F (y, x, w)
2
Compute p(y | x, w)
3
Backward pass via chain rule to obtain gradient
4
Update parameters w
What is the PROBLEM?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
84 / 128
Deep Structured Learning (algo 1)
[Peng et al. NIPS’09]
Repeat until stopping criteria
1
Forward pass to compute F (y, x, w)
2
Compute p(y | x, w)
3
Backward pass via chain rule to obtain gradient
4
Update parameters w
What is the PROBLEM?
How do we even represent F (y, x, w) if Y is large?
How do we compute p(y | x, w)?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
84 / 128
Use the Graphical Model Structure
1
Use the graphical model F (y, x, w) =
∂
∂w
X
P
r fr (yr , x, w)
p(x,y),tg (ŷ) ln p(ŷ | x; w)
(x,y)∈D,ŷ
X
=
(x,y)∈D,r
R. Urtasun (UofT)
Ep(x,y),r ,tg
∂
∂
fr (ŷr , x, w) − Ep(x,y),r
fr (ŷr , x, w)
∂w
∂w
Deep Structured Models
August 21, 2015
85 / 128
Use the Graphical Model Structure
1
Use the graphical model F (y, x, w) =
∂
∂w
X
r fr (yr , x, w)
p(x,y),tg (ŷ) ln p(ŷ | x; w)
(x,y)∈D,ŷ
X
=
Ep(x,y),r ,tg
(x,y)∈D,r
2
P
∂
∂
fr (ŷr , x, w) − Ep(x,y),r
fr (ŷr , x, w)
∂w
∂w
Approximate marginals pr (ŷr |x, w) via beliefs br (ŷr |x, w) computed by:
Sampling methods
Variational methods
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
85 / 128
Deep Structured Learning (algo 2)
[Schwing & Urtasun Arxiv’15, Zheng et al. Arxiv’15]
Repeat until stopping criteria
1
Forward pass to compute the fr (yr , x, w)
2
Compute the br (yr | x, w) by running
approximated inference
3
Backward pass via chain rule to obtain gradient
4
Update parameters w
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
86 / 128
Deep Structured Learning (algo 2)
[Schwing & Urtasun Arxiv’15, Zheng et al. Arxiv’15]
Repeat until stopping criteria
1
Forward pass to compute the fr (yr , x, w)
2
Compute the br (yr | x, w) by running
approximated inference
3
Backward pass via chain rule to obtain gradient
4
Update parameters w
PROBLEM: We have to run inference in the graphical model every time we want
to update the weights
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
86 / 128
How to deal with Big Data
Dealing with large number |D| of training examples:
Parallelized across samples (any number of machines and GPUs)
Usage of mini batches
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
87 / 128
How to deal with Big Data
Dealing with large number |D| of training examples:
Parallelized across samples (any number of machines and GPUs)
Usage of mini batches
Dealing with large output spaces Y:
Variational approximations
Blending of learning and inference
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
87 / 128
Approximated Deep Structured Learning
[Schwing & Urtasun Arxiv’15]
Sample parallel implementation:
Partition data D onto compute nodes
Repeat until stopping criteria
1
Each compute node uses GPU for CNN Forward pass to
compute fr (yr , x, w)
2
Each compute node estimates beliefs br (yr | x, w) for assigned
samples
3
Backpropagation of difference using GPU to obtain machine
4
Synchronize gradient across all machines using MPI
5
Update parameters w
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
88 / 128
Better Option: Interleaving Learning and Inference

min
w
X

(x,y)∈D
max

X
b(x,y ) ∈C(x,y )
R. Urtasun (UofT)
b(x,y ),r (ŷr )fr (x, ŷr ; w) +
X
r
r ,ŷr
Deep Structured Models



cr H(b(x,y ),r ) −F (x, y; w)

August 21, 2015
89 / 128
Better Option: Interleaving Learning and Inference

min
w
X

(x,y)∈D
max

X
b(x,y ) ∈C(x,y )
b(x,y ),r (ŷr )fr (x, ŷr ; w) +
X
r
r ,ŷr



cr H(b(x,y ),r ) −F (x, y; w)

More efficient algorithm by blending min. w.r.t. w and max. of the beliefs b
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
89 / 128
Better Option: Interleaving Learning and Inference

min
X

w
(x,y)∈D
max

X
b(x,y ) ∈C(x,y )
b(x,y ),r (ŷr )fr (x, ŷr ; w) +
X
r
r ,ŷr



cr H(b(x,y ),r ) −F (x, y; w)

More efficient algorithm by blending min. w.r.t. w and max. of the beliefs b
After introducing Lagrange multipliers λ, the dual becomes
fr (x, ŷr ; w) +
min
w,λ
X
cr ln
(x,y),r
with F (w) =
R. Urtasun (UofT)
X
exp
ŷr
P
(x,y)∈D
P
λ(x,y ),c→r (ŷc ) −
c∈C (r )
P
λ(x,y),r →p (ŷr )
p∈P(r )
cr
− F (w).
F (x, y; w) the sum of empirical function observations
Deep Structured Models
August 21, 2015
89 / 128
Better Option: Interleaving Learning and Inference

min
X

w
(x,y)∈D
max

X
b(x,y ) ∈C(x,y )
b(x,y ),r (ŷr )fr (x, ŷr ; w) +
X
r
r ,ŷr



cr H(b(x,y ),r ) −F (x, y; w)

More efficient algorithm by blending min. w.r.t. w and max. of the beliefs b
After introducing Lagrange multipliers λ, the dual becomes
fr (x, ŷr ; w) +
min
w,λ
X
cr ln
(x,y),r
with F (w) =
X
exp
ŷr
P
(x,y)∈D
P
λ(x,y ),c→r (ŷc ) −
c∈C (r )
P
λ(x,y),r →p (ŷr )
p∈P(r )
cr
− F (w).
F (x, y; w) the sum of empirical function observations
We can then do block coordinate descent to solve the minimization problem,
and we get the following algorithm · · ·
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
89 / 128
Deep Structured Learning (algo 3)
[Chen & Schwing & Yuille & Urtasun ICML’15]
Repeat until stopping criteria
1
Forward pass to compute the fr (yr , x, w)
2
Update (some) messages λ
3
Backward pass via chain rule to obtain gradient
4
Update parameters w
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
90 / 128
Deep Structured Learning (algo 4)
[Chen & Schwing & Yuille & Urtasun ICML’15]
Sample parallel implementation:
Partition data D onto compute nodes
Repeat until stopping criteria
1
Each compute node uses GPU for CNN Forward pass to
compute fr (yr , x, w)
2
Each compute node updates (some) messages λ
3
Backpropagation of difference using GPU to obtain machine
4
Synchronize gradient across all machines using MPI
5
Update parameters w
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
91 / 128
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
92 / 128
Application 1: Character Recognition
Task: Word Recognition from a fixed vocabulary of 50 words, 28 × 28 sized
image patches
Characters have complex backgrounds and suffer many different distortions
Training, validation and test set sizes are 10k, 2k and 2k variations of words
banal
julep
resty
drein
yojan
mothy
snack
feize
porer
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
93 / 128
Results
Graphical model has 5 nodes, MLP for each unary and non-parametric
pairwise potentials
Joint training, structured, deep and more capacity helps
Grap
MLP
1st
1lay
2nd
1lay
1st
2lay
2nd
2lay
Method
Unary only
JointTrain
PwTrain
PreTrainJoint
JointTrain
PwTrain
PreTrainJoint
H1 = 512
Unary only
JointTrain
PwTrain
PreTrainJoint
JointTrain
PwTrain
PreTrainJoint
R. Urtasun (UofT)
H1 = 128
8.60 / 61.32
16.80 / 65.28
12.70 / 64.35
20.65 / 67.42
25.50 / 67.13
10.05 / 58.90
28.15 / 69.07
H2 = 32
15.25 / 69.04
35.95 / 76.92
34.85 / 79.11
42.25 / 81.10
54.65 / 83.98
39.95 / 81.14
62.60 / 88.03
H1 = 256
10.80 / 64.41
25.20 / 70.75
18.00 / 68.27
25.70 / 71.65
34.60 / 73.19
14.10 / 63.44
36.85 / 75.21
H2 = 64
18.15 / 70.66
43.80 / 81.64
38.95 / 80.93
44.85 / 82.96
61.80 / 87.30
48.25 / 84.45
65.80 / 89.32
H1 = 512
12.50 / 65.69
31.80 / 74.90
22.80 / 71.29
31.70 / 75.56
45.55 / 79.60
18.10 / 67.31
45.75 / 80.09
H2 = 128
19.00 / 71.43
44.75 / 82.22
42.75 / 82.38
46.85 / 83.50
66.15 / 89.09
52.65 / 86.24
68.75 / 90.47
Deep Structured Models
H1 = 768
12.95 / 66.66
33.05 / 76.42
23.25 / 72.62
34.50 / 77.14
51.55 / 82.37
20.40 / 70.14
50.10 / 82.30
H2 = 256
19.20 / 72.06
46.00 / 82.96
45.10 / 83.67
47.95 / 84.21
64.85 / 88.93
57.10 / 87.61
68.60 / 90.42
H1 = 1024
13.40 / 67.02
34.30 / 77.02
26.30 / 73.96
35.85 / 78.05
54.05 / 83.57
22.20 / 71.25
52.25 / 83.39
H2 = 512
20.40 / 72.51
47.70 / 83.64
45.75 / 83.88
47.05 / 84.08
68.00 / 89.96
62.90 / 89.49
69.35 / 90.75
August 21, 2015
94 / 128
Learned Weights
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
a b c d e f g h i j k l mn o p q r s t u v w x y z
Unary weights
R. Urtasun (UofT)
distance-1 edges
Deep Structured Models
a b c d e f g h i j k l mn o p q r s t u v w x y z
distance-2 edges
August 21, 2015
95 / 128
Example 2: Image Tagging
[Chen & Schwing & Yuille & Urtasun ICML’15]
Flickr dataset: 38 possible tags, |Y| = 238
10k training, 10k test examples
Training method
Unary only
Piecewise
Joint (with pre-training)
Prediction error [%]
9.36
7.70
7.25
5
x 10
8
10000
w/o blend
w blend
6
4
R. Urtasun (UofT)
w/o blend
w blend
6000
4000
2000
2
0
0
8000
Training error
Neg. Log−Likelihood
10
5000
10000
Time [s]
0
0
Deep Structured Models
5000
10000
Time [s]
August 21, 2015
96 / 128
Visual results
female/indoor/portrait
female/indoor/portrait
sky/plant life/tree
sky/plant life/tree
animals/dog/indoor
animals/dog
R. Urtasun (UofT)
water/animals/sea
water/animals/sky
indoor/flower/plant life
∅
Deep Structured Models
August 21, 2015
97 / 128
Learned class correlations
Only part of the correlations are shown for clarity
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
98 / 128
Example 3: Semantic Segmentation
[Chen et al. ICLR’15; Krähenbühl & Koltun NIPS’11,ICML’13; Zhen et al. Arxiv’15;
Schwing & Urtasun Arxiv’15 ]
|Y| = 21350·500 , ≈ 10k training, ≈ 1500 test examples
Oxford-net pre trained on PASCAL, predicts 40 × 40 + upsampling
The graphical model is a fully connected CRF with Gaussian potentials
Inference using (algo2), with mean-field as approx. inference
Interpolation
Layer
Pooling &
Subsampling
R. Urtasun (UofT)
Fully
Connected CRF
Deep Structured Models
August 21, 2015
99 / 128
Pascal VOC 2012 dataset
[Chen et al. ICLR’15; Krähenbühl & Koltun NIPS’11,ICML’13; Zhen et al. Arxiv’15;
Schwing & Urtasun Arxiv’15 ]
|Y| = 21350·500 , ≈ 10k training, ≈ 1500 test examples
Oxford-net pre trained on PASCAL, predicts 40 × 40 + upsampling
The graphical model is a fully connected CRF with Gaussian potentials
Inference using (algo2), with mean-field as approx. inference
Training method
Unary only
Joint
R. Urtasun (UofT)
Mean IoU [%]
61.476
64.060
Deep Structured Models
August 21, 2015
100 / 128
Pascal VOC 2012 dataset
[Chen et al. ICLR’15; Krähenbühl & Koltun NIPS’11,ICML’13; Zhen et al. Arxiv’15;
Schwing & Urtasun Arxiv’15 ]
|Y| = 21350·500 , ≈ 10k training, ≈ 1500 test examples
Oxford-net pre trained on PASCAL, predicts 40 × 40 + upsampling
The graphical model is a fully connected CRF with Gaussian potentials
Inference using (algo2), with mean-field as approx. inference
Training method
Unary only
Joint
Mean IoU [%]
61.476
64.060
Disclaimer: Much better results now with a few tricks. Zheng et al. 15 is
now at 74.7%!
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
100 / 128
Visual results
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
101 / 128
Example 4: 3D Object Proposals for Detection
Use structured prediction to learn to propose object candidates (i.e.,
grouping)
(image)
(stereo)
(depth-feat)
(prior)
Use deep learning to do final detection: OxfordNet
Only 1.2s to generate proposals
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
102 / 128
0.7
1
BING
SS
EB
MCG
MCG−D
Ours
0.9
0.8
recall at IoU threshold 0.7
Car
recall at IoU threshold 0.7
0.8
0.6
0.5
0.4
0.3
0.7
0.4
0.3
0.2
0.1
3
10
10
10
10
0.8
0.5
0.4
0.3
0.7
3
10
10
0.8
10
10
0.8
0.5
0.4
0.3
0.7
10
0.4
0.3
2
3
10
(a) Easy
4
10
0.8
0.3
0.1
# candidates
0.9
0.4
0.1
0
1
10
3
10
10
4
10
# candidates
0.5
0.2
2
0.5
1
BING
SS
EB
MCG
MCG−D
Ours
0.6
0.2
10
0.6
0
1
10
4
recall at IoU threshold 0.5
0.9
0.6
0
1
10
3
1
recall at IoU threshold 0.5
Cyclist
recall at IoU threshold 0.5
0.7
0.7
# candidates
BING
SS
EB
MCG
MCG−D
Ours
4
10
0.1
2
# candidates
1
3
10
BING
SS
EB
MCG
MCG−D
Ours
0.2
0
1
10
4
0.9
0.3
0.1
10
2
10
# candidates
0.4
0.2
2
0.3
1
BING
SS
EB
MCG
MCG−D
Ours
0.5
0.1
0
1
10
0.4
0
1
10
10
0.6
0.2
0.8
10
4
recall at IoU threshold 0.5
0.9
0.6
0.9
3
1
BING
SS
EB
MCG
MCG−D
Ours
recall at IoU threshold 0.5
recall at IoU threshold 0.5
Pedestrian
0.7
0.5
# candidates
1
0.8
0.6
0.1
2
# candidates
0.9
0.7
BING
SS
EB
MCG
MCG−D
Ours
0.2
0
1
10
4
0.8
0.5
0.1
2
0.9
0.6
0.2
0
1
10
1
BING
SS
EB
MCG
MCG−D
Ours
recall at IoU threshold 0.7
1
0.9
0.7
BING
SS
EB
MCG
MCG−D
Ours
0.6
0.5
0.4
0.3
0.2
0.1
2
3
10
10
4
10
# candidates
(b) Moderate
0
1
10
2
3
10
10
4
10
# candidates
(c) Hard
Figure : Proposal recall: 0.7 overlap threshold for Car, and 0.5 for rest.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
103 / 128
1
1
BING 11.8
SS 15.9
EB 21.9
MCG 25.4
MCG−D 49.6
Ours 65.6
0.9
0.8
0.7
0.8
0.7
0.7
0.6
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.6
0.7
0.8
IoU overlap threshold
0.9
0
0.5
1
1
0.1
0.6
0.7
0.8
IoU overlap threshold
0.9
0
0.5
1
1
BING 6.7
SS 3
EB 5.4
MCG 8.3
MCG−D 19.6
Ours 44.9
0.9
0.8
0.7
0.9
0.8
0.7
0.5
0.7
0.5
0.4
0.3
0.3
0.3
0.2
0.2
0.1
0.1
0.7
0.8
IoU overlap threshold
0.9
0
0.5
1
1
0.2
0.1
0.6
0.7
0.8
IoU overlap threshold
0.9
0
0.5
1
1
BING 5.9
SS 4.2
EB 3.2
MCG 5.5
MCG−D 10.8
Ours 55.4
0.9
0.8
0.7
0.8
0.7
0.7
recall
recall
0.5
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.6
0.7
0.8
IoU overlap threshold
0.9
1
1
0.6
0.5
0
0.5
0.9
BING 4.5
SS 3.4
EB 2.7
MCG 4.3
MCG−D 10.7
Ours 39.7
0.8
0.4
0
0.5
0.7
0.8
IoU overlap threshold
0.9
0.6
0.5
0.6
1
BING 4.1
SS 3.6
EB 2.6
MCG 4.4
MCG−D 10.2
Ours 40
0.9
0.6
1
0.6
0.5
0.4
0.6
0.9
BING 5.4
SS 2.8
EB 4.2
MCG 6.8
MCG−D 14
Ours 36.4
0.8
0.4
0
0.5
0.7
0.8
IoU overlap threshold
0.9
0.6
recall
0.6
0.6
1
BING 5.8
SS 2.8
EB 4.5
MCG 7.3
MCG−D 16.1
Ours 41.1
recall
0
0.5
recall
0.8
recall
0.5
BING 6.8
SS 10.3
EB 13.7
MCG 17.6
MCG−D 32.8
Ours 58.6
0.9
0.6
recall
recall
0.6
recall
1
BING 7.4
SS 10.9
EB 15.4
MCG 19.9
MCG−D 38.8
Ours 59.5
0.9
0.1
0.6
0.7
0.8
IoU overlap threshold
0.9
1
0
0.5
0.6
0.7
0.8
IoU overlap threshold
0.9
1
(a) Easy
(b) Moderate
(c) Hard
Figure : Recall vs IoU for 500 proposals. (Top) Cars, (Middle) Pedestrians,
(Bottom) Cyclists.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
104 / 128
KITTI Detection Results
[ X. Chen, K. Kundu and S. Fidler and R. Urtasun, On Arxiv soon]
LSVM-MDPM-sv
SquaresICF
DPM-C8B1
MDPM-un-BB
DPM-VOC+VP
OC-DPM
AOG
SubCat
DA-DPM
Fusion-DPM
R-CNN
FilteredICF
pAUCEnsT
MV-RGBD-RF
3DVP
Regionlets
Ours
Easy
68.02
74.33
71.19
74.95
74.94
84.36
84.14
87.46
84.75
88.33
Cars
Moderate
56.48
60.99
62.16
64.71
65.95
71.88
75.46
75.77
76.45
87.14
Hard
44.18
47.16
48.43
48.76
53.86
59.27
59.71
65.38
59.70
76.11
Easy
47.74
57.33
38.96
59.48
54.67
56.36
59.51
61.61
61.14
65.26
70.21
73.14
70.16
Pedestrians
Moderate
39.36
44.42
29.03
44.86
42.34
45.51
46.67
50.13
53.98
54.49
54.56
61.15
59.35
Hard
35.95
40.08
25.61
40.37
37.95
41.08
42.05
44.79
49.29
48.60
51.25
55.21
52.76
Easy
35.04
43.49
42.43
51.62
54.02
70.41
77.94
Cyclists
Moderate
27.50
29.04
31.08
38.03
39.72
58.72
67.35
Hard
26.21
26.20
28.23
33.38
34.82
51.83
59.49
Table : Average Precision (AP) (in %) on the test set of the KITTI Object
Detection Benchmark.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
105 / 128
KITTI Detection Results
[ X. Chen, K. Kundu and S. Fidler and R. Urtasun, On Arxiv soon]
AOG
DPM-C8B1
LSVM-MDPM-sv
DPM-VOC+VP
OC-DPM
SubCat
3DVP
Ours
Easy
43.81
59.51
67.27
72.28
73.50
83.41
86.92
83.03
Cars
Mod.
38.21
50.32
55.77
61.84
64.42
74.42
74.59
80.21
Hard
31.53
39.22
43.59
46.54
52.40
58.83
64.11
69.60
Easy
31.08
43.58
53.55
44.32
48.58
Pedestrians
Mod.
23.37
35.49
39.83
34.18
40.56
Hard
20.72
32.42
35.73 /
30.76
36.08
Easy
27.25
27.54
30.52
57.72
Cyclists
Mod.
19.25
22.07
23.17
48.21
Hard
17.95
21.45
21.58
42.72
Table : AOS scores on the KITTI Object Detection and Orientation Benchmark
(test set).
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
106 / 128
Car Results
Best prop.
Ground truth
Top 100 prop.
Images
[ X. Chen, K. Kundu, Y. Zhu, S. Fidler and R. Urtasun, On Arxiv soon]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
107 / 128
Pedestrian Results
Best prop.
Ground truth
Top 100 prop.
Images
[ X. Chen, K. Kundu, Y. Zhu, S. Fidler and R. Urtasun, On Arxiv soon]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
108 / 128
Cyclist Results
Best proposals
Ground truth Top 100 Prop.
2D images
[ X. Chen, K. Kundu, Y. Zhu, S. Fidler and R. Urtasun, On Arxiv soon]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
109 / 128
Example 5: More Precise Grouping
Given a single image, we want to infer Instance-level Segmentation and
Depth Ordering
Use deep convolutional nets to do both tasks simultaneously
Trick: Encode both tasks with a single parameterization
Run the conv. net at multiple resolutions
Use MRF to form a single coherent explanation across all the image
combining the conv nets at multiple resolutions
Important: we do not use a single pixel-wise training example!
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
110 / 128
Results on KITTI
[Z. Zhang, A. Schwing, S. Fidler and R. Urtasun, ICCV ’15]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
111 / 128
More Results (including failures/difficulties)
[Z. Zhang, A. Schwing, S. Fidler and R. Urtasun, ICCV ’15]
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
112 / 128
Example 6: Enhancing freely-available maps
[G. Matthyus, S. Wang, S. Fidler and R. Urtasun, ICCV ’15]
Toronto: Airport
San Francisco: Russian Hill
NYC: Times square
Kyoto: Kinkakuji
Sydney: At Harbour bridge
Monte Carlo: Casino
Enhancing OpenStreetMaps
Can be trained on a single image and test on the whole world
Trick: Not to reason at the pixel level
Preserves topology and is state-of-the-art
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
113 / 128
Example 7: Fashion
[E. Simo-Serra, S. Fidler, F. Moreno, R. Urtasun, CVPR15]
Figure : An example of a post on http://www.chictopia.com. We crawled the
site for 180K posts.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
114 / 128
How Fashionable Are You?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
115 / 128
How Fashionable Are You?
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
116 / 128
How Fashionable Are You?
Figure : We ran a face detector that predicts also beauty of the face, age,
ethnicity, mood.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
117 / 128
How Fashionable Are You?
Face detector + attributes
http://www.rekognition.com
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
117 / 128
How Fashionable Are You?
Face detector + attributes
http://www.rekognition.com
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
117 / 128
How Fashionable Are You?
Face detector + attributes
http://www.rekognition.com
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
117 / 128
How Fashionable Are You?
Figure : Our model is a Conditional Random Field that uses many visual and
textual features, as well as meta-data features such as where the user is from.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
118 / 128
How Fashionable Are You?
Figure : We predict fashionability of users.
Figure : We predict what kind of outfit the person wears.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
119 / 128
How Fashionable Can You Become?
Figure : Examples of recommendations provided by our model. The parenthesis
we show the fashionability scores.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
120 / 128
Not a big deal... but
Appear all over the Tech and News
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
121 / 128
Not a big deal... but
Appear all over the Tech and News
All over the Fashion press
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
122 / 128
Not a big deal... but
Appear all over the Tech and News
All over the Fashion press
International News and TV (Fox, BBC, SkypeNews, RTVE, etc)
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
123 / 128
Best Quote Award
Cosmopolitan (UK): The technology scores your facial
attributes (this just keeps getting better, doesn’t it) from
before combining all the information using an equation
SO complex we won’t begin to go into it.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
124 / 128
But the Most Important Impact
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
125 / 128
Previous Work
Use the hinge loss to optimize the unaries only which are neural nets (Li and
Zemel 14). Correlations between variables are not used for learning
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
126 / 128
Previous Work
Use the hinge loss to optimize the unaries only which are neural nets (Li and
Zemel 14). Correlations between variables are not used for learning
If inference is tractable, Conditional Neural Fields (Peng et al. 09) use
back-propagation on the log-loss
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
126 / 128
Previous Work
Use the hinge loss to optimize the unaries only which are neural nets (Li and
Zemel 14). Correlations between variables are not used for learning
If inference is tractable, Conditional Neural Fields (Peng et al. 09) use
back-propagation on the log-loss
Decision Tree Fields (Nowozin et al. 11), use complex region potentials
(decision trees), but given the tree, it is still linear in the parameters.
Trained using pseudo likelihood.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
126 / 128
Previous Work
Use the hinge loss to optimize the unaries only which are neural nets (Li and
Zemel 14). Correlations between variables are not used for learning
If inference is tractable, Conditional Neural Fields (Peng et al. 09) use
back-propagation on the log-loss
Decision Tree Fields (Nowozin et al. 11), use complex region potentials
(decision trees), but given the tree, it is still linear in the parameters.
Trained using pseudo likelihood.
Restricted Bolzmann Machines (RBMs): Generative model that has a very
particular architecture so that inference is tractable via sampling
(Salakhutdinov 07). Problems with partition function.
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
126 / 128
Previous Work
Use the hinge loss to optimize the unaries only which are neural nets (Li and
Zemel 14). Correlations between variables are not used for learning
If inference is tractable, Conditional Neural Fields (Peng et al. 09) use
back-propagation on the log-loss
Decision Tree Fields (Nowozin et al. 11), use complex region potentials
(decision trees), but given the tree, it is still linear in the parameters.
Trained using pseudo likelihood.
Restricted Bolzmann Machines (RBMs): Generative model that has a very
particular architecture so that inference is tractable via sampling
(Salakhutdinov 07). Problems with partition function.
(Domke 13) treat the problem as learning a set of logistic regressors
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
126 / 128
Previous Work
Use the hinge loss to optimize the unaries only which are neural nets (Li and
Zemel 14). Correlations between variables are not used for learning
If inference is tractable, Conditional Neural Fields (Peng et al. 09) use
back-propagation on the log-loss
Decision Tree Fields (Nowozin et al. 11), use complex region potentials
(decision trees), but given the tree, it is still linear in the parameters.
Trained using pseudo likelihood.
Restricted Bolzmann Machines (RBMs): Generative model that has a very
particular architecture so that inference is tractable via sampling
(Salakhutdinov 07). Problems with partition function.
(Domke 13) treat the problem as learning a set of logistic regressors
Fields of experts (Roth et al. 05), not deep, use CD training
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
126 / 128
Previous Work
Use the hinge loss to optimize the unaries only which are neural nets (Li and
Zemel 14). Correlations between variables are not used for learning
If inference is tractable, Conditional Neural Fields (Peng et al. 09) use
back-propagation on the log-loss
Decision Tree Fields (Nowozin et al. 11), use complex region potentials
(decision trees), but given the tree, it is still linear in the parameters.
Trained using pseudo likelihood.
Restricted Bolzmann Machines (RBMs): Generative model that has a very
particular architecture so that inference is tractable via sampling
(Salakhutdinov 07). Problems with partition function.
(Domke 13) treat the problem as learning a set of logistic regressors
Fields of experts (Roth et al. 05), not deep, use CD training
Many ideas go back to (Boutou 91)
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
126 / 128
Conclusions and Future Work
Conclusions:
Modeling of correlations between variables
Non-linear dependence on parameters
Joint training of many convolutional neural networks
Parallel implementation
Wide range of applications: Word recognition, Tagging, Segmentation
Future work:
Latent Variables
More applications
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
127 / 128
Acknowledgments
Liang Chieh Chen (student)
Xiaozhi Chen (student)
Sanja Fidler
Gellert Matthyus (student)
Francesc Moreno
Alexander Schwing (postdoc)
Edgar Simo-Serra (student)
Shenlong Wang (student)
Allan Yuille
Ziyu Zhang (student)
Yukun Zhu (student)
The introductory slides on deep learning have been inspired by M. Ranzato tutorial on
deep learning and S. Fidler lecture notes for CSC420
R. Urtasun (UofT)
Deep Structured Models
August 21, 2015
128 / 128
```