Learning Deep Structured Models Raquel Urtasun University of Toronto August 21, 2015 R. Urtasun (UofT) Deep Structured Models August 21, 2015 1 / 128 Current Status of your Field? R. Urtasun (UofT) Deep Structured Models August 21, 2015 2 / 128 Roadmap 1 Part I: Deep learning 2 Part II: Deep Structured Models R. Urtasun (UofT) Deep Structured Models August 21, 2015 3 / 128 Part I: Deep Learning R. Urtasun (UofT) Deep Structured Models August 21, 2015 4 / 128 Deep Learning Supervised models Unsupervised learning (will not talk about this today) Generative models (will not talk about this today) R. Urtasun (UofT) Deep Structured Models August 21, 2015 5 / 128 Binary Classification Given inputs x, and outputs t ∈ {−1, 1} We want to fit a hyperplane that divides the space into half y∗ = sign(wT x∗ + w0 ) R. Urtasun (UofT) Deep Structured Models August 21, 2015 6 / 128 Binary Classification Given inputs x, and outputs t ∈ {−1, 1} We want to fit a hyperplane that divides the space into half y∗ = sign(wT x∗ + w0 ) SVMs try to maximize the margin R. Urtasun (UofT) Deep Structured Models August 21, 2015 6 / 128 Non-linear Predictors How can we make our classifier more powerful? R. Urtasun (UofT) Deep Structured Models August 21, 2015 7 / 128 Non-linear Predictors How can we make our classifier more powerful? Compute non-linear functions of the input y∗ = F (x∗ , w) R. Urtasun (UofT) Deep Structured Models August 21, 2015 7 / 128 Non-linear Predictors How can we make our classifier more powerful? Compute non-linear functions of the input y∗ = F (x∗ , w) Two types of approaches: R. Urtasun (UofT) Deep Structured Models August 21, 2015 7 / 128 Non-linear Predictors How can we make our classifier more powerful? Compute non-linear functions of the input y∗ = F (x∗ , w) Two types of approaches: Kernel Trick: Fixed functions and optimize linear parameters on non-linear mapping y∗ = sign(wT φ(x∗ ) + w0 ) R. Urtasun (UofT) Deep Structured Models August 21, 2015 7 / 128 Non-linear Predictors How can we make our classifier more powerful? Compute non-linear functions of the input y∗ = F (x∗ , w) Two types of approaches: Kernel Trick: Fixed functions and optimize linear parameters on non-linear mapping y∗ = sign(wT φ(x∗ ) + w0 ) Deep Learning: Learn parametric non-linear functions y∗ = F (x∗ , w) R. Urtasun (UofT) Deep Structured Models August 21, 2015 7 / 128 Why ”Deep”? Supervised Learning: Examples Classification “dog” c at i ific s las on Denoising n sio es r reg OCR “2 3 4 5” red ctu ion u r st dict e pr 3 Ranzato R. Urtasun (UofT) Deep Structured Models August 21, 2015 8 / 128 Why ”Deep”? Supervised Deep Learning Classification “dog” Denoising OCR “2 3 4 5” 4 Ranzato R. Urtasun (UofT) Deep Structured Models August 21, 2015 8 / 128 Neural Networks Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid, tanh, max R. Urtasun (UofT) Deep Structured Models August 21, 2015 9 / 128 Neural Networks Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid, tanh, max Note: a composite of linear functions is linear! R. Urtasun (UofT) Deep Structured Models August 21, 2015 9 / 128 Neural Networks Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid, tanh, max Note: a composite of linear functions is linear! Example: 2 layer NNet h1 x max(0, W1T x) R. Urtasun (UofT) h2 max(0, W2T h1 ) Deep Structured Models W3T h2 y August 21, 2015 9 / 128 Neural Networks Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid, tanh, max Note: a composite of linear functions is linear! Example: 2 layer NNet h1 x max(0, W1T x) h2 max(0, W2T h1 ) W3T h2 y August 21, 2015 9 / 128 x is the input R. Urtasun (UofT) Deep Structured Models Neural Networks Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid, tanh, max Note: a composite of linear functions is linear! Example: 2 layer NNet h1 x max(0, W1T x) h2 max(0, W2T h1 ) W3T h2 y August 21, 2015 9 / 128 x is the input y is the output (what we want to predict) R. Urtasun (UofT) Deep Structured Models Neural Networks Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid, tanh, max Note: a composite of linear functions is linear! Example: 2 layer NNet h1 x max(0, W1T x) h2 max(0, W2T h1 ) W3T h2 y August 21, 2015 9 / 128 x is the input y is the output (what we want to predict) hi is the i-th hidden layer R. Urtasun (UofT) Deep Structured Models Neural Networks Deep learning uses composite of simpler functions, e.g., ReLU, sigmoid, tanh, max Note: a composite of linear functions is linear! Example: 2 layer NNet h1 x max(0, W1T x) h2 max(0, W2T h1 ) W3T h2 y August 21, 2015 9 / 128 x is the input y is the output (what we want to predict) hi is the i-th hidden layer W i are the parameters of the i-th layer R. Urtasun (UofT) Deep Structured Models Evaluating the Function Forward Propagation: compute the output given the input h1 h2 x max(0, W1T x) max(0, W2T h1 ) W3T h2 R. Urtasun (UofT) Deep Structured Models August 21, 2015 y 10 / 128 Evaluating the Function Forward Propagation: compute the output given the input h1 h2 x max(0, W1T x) max(0, W2T h1 ) W3T h2 y Fully connected layer: Each hidden unit takes as input all the units from the previous layer R. Urtasun (UofT) Deep Structured Models August 21, 2015 10 / 128 Evaluating the Function Forward Propagation: compute the output given the input h1 h2 x max(0, W1T x) max(0, W2T h1 ) W3T h2 y Fully connected layer: Each hidden unit takes as input all the units from the previous layer The non-linearity is called a ReLU (rectified linear unit), with x ∈ <D , b i ∈ <Ni the biases and W i ∈ <Ni ×Ni−1 the weights R. Urtasun (UofT) Deep Structured Models August 21, 2015 10 / 128 Evaluating the Function Forward Propagation: compute the output given the input h1 h2 x max(0, W1T x) max(0, W2T h1 ) W3T h2 y Fully connected layer: Each hidden unit takes as input all the units from the previous layer The non-linearity is called a ReLU (rectified linear unit), with x ∈ <D , b i ∈ <Ni the biases and W i ∈ <Ni ×Ni−1 the weights Do it in a compositional way, h1 = max(0, W 1 x + b 1 ) R. Urtasun (UofT) Deep Structured Models August 21, 2015 10 / 128 Evaluating the Function Forward Propagation: compute the output given the input h1 x h2 max(0, W2T h1 ) max(0, W1T x) W3T h2 y Fully connected layer: Each hidden unit takes as input all the units from the previous layer The non-linearity is called a ReLU (rectified linear unit), with x ∈ <D , b i ∈ <Ni the biases and W i ∈ <Ni ×Ni−1 the weights Do it in a compositional way h1 h2 R. Urtasun (UofT) = max(0, W 1 x + b 1 ) = max(0, W 2 h1 + b 2 ) Deep Structured Models August 21, 2015 11 / 128 Evaluating the Function Forward Propagation: compute the output given the input h1 x h2 max(0, W2T h1 ) max(0, W1T x) W3T h2 y Fully connected layer: Each hidden unit takes as input all the units from the previous layer The non-linearity is called a ReLU (rectified linear unit), with x ∈ <D , b i ∈ <Ni the biases and W i ∈ <Ni ×Ni−1 the weights Do it in a compositional way h1 h2 y R. Urtasun (UofT) = max(0, W 1 x + b 1 ) = max(0, W 2 h1 + b 2 ) = max(0, W 3 h2 + b 3 ) Deep Structured Models August 21, 2015 12 / 128 Alternative Graphical Representation h k max 0, W k1 hk k h h k 1 h k hk 1 W k 1 W k h1 h k2 h k3 h k4 hk 1 k 1 k1 w 1,1 k1 w 3,4 h k1 1 h k2 1 h k3 1 12 Ranzato R. Urtasun (UofT) Deep Structured Models August 21, 2015 13 / 128 Relu Interpretation Piece-wise linear tiling: mapping is locally linear. Figure : by M. Ranzato R. Urtasun (UofT) Deep Structured Models August 21, 2015 14 / 128 Why Hierarchical? Interpretation [1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1… ] motorbike [0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 … ] truck 15 Ranzato R. Urtasun (UofT) Deep Structured Models August 21, 2015 15 / 128 Why Hierarchical? Interpretation ... prediction of class high-level parts distributed representations feature sharing compositionality mid-level parts low level parts Input image 16 Lee et al. “Convolutional DBN's ...” ICML 2009 R. Urtasun (UofT) Deep Structured Models Ranzato August 21, 2015 16 / 128 Learning h1 x max(0, W1T x) R. Urtasun (UofT) h2 max(0, W2T h1 ) Deep Structured Models W3T h2 y August 21, 2015 17 / 128 Learning h1 x max(0, W1T x) h2 max(0, W2T h1 ) W3T h2 y We want to estimate the parameters, biases and hyper-parameters (e.g., number of layers, number of units) such that we do good predictions R. Urtasun (UofT) Deep Structured Models August 21, 2015 17 / 128 Learning h1 x max(0, W1T x) h2 max(0, W2T h1 ) W3T h2 y We want to estimate the parameters, biases and hyper-parameters (e.g., number of layers, number of units) such that we do good predictions Collect a training set of input-output pairs {x, t} R. Urtasun (UofT) Deep Structured Models August 21, 2015 17 / 128 Learning h1 x max(0, W1T x) h2 max(0, W2T h1 ) W3T h2 y We want to estimate the parameters, biases and hyper-parameters (e.g., number of layers, number of units) such that we do good predictions Collect a training set of input-output pairs {x, t} Encode the output with 1-K encoding t = [0, · · · , 1, · · · , 0] R. Urtasun (UofT) Deep Structured Models August 21, 2015 17 / 128 Learning h1 x max(0, W1T x) h2 max(0, W2T h1 ) W3T h2 y We want to estimate the parameters, biases and hyper-parameters (e.g., number of layers, number of units) such that we do good predictions Collect a training set of input-output pairs {x, t} Encode the output with 1-K encoding t = [0, · · · , 1, · · · , 0] Define a loss per training example and minimize the empirical risk N 1 X L(w) = `(w, x(i) , t (i) ) + R(w) N i=1 with N number of examples, R a regularizer, and w contains all parameters R. Urtasun (UofT) Deep Structured Models August 21, 2015 17 / 128 Learning h1 x max(0, W1T x) h2 max(0, W2T h1 ) W3T h2 y We want to estimate the parameters, biases and hyper-parameters (e.g., number of layers, number of units) such that we do good predictions Collect a training set of input-output pairs {x, t} Encode the output with 1-K encoding t = [0, · · · , 1, · · · , 0] Define a loss per training example and minimize the empirical risk N 1 X L(w) = `(w, x(i) , t (i) ) + R(w) N i=1 with N number of examples, R a regularizer, and w contains all parameters What do we want to use as `? R. Urtasun (UofT) Deep Structured Models August 21, 2015 17 / 128 Learning h1 x max(0, W1T x) h2 max(0, W2T h1 ) W3T h2 y We want to estimate the parameters, biases and hyper-parameters (e.g., number of layers, number of units) such that we do good predictions Collect a training set of input-output pairs {x, t} Encode the output with 1-K encoding t = [0, · · · , 1, · · · , 0] Define a loss per training example and minimize the empirical risk N 1 X L(w) = `(w, x(i) , t (i) ) + R(w) N i=1 with N number of examples, R a regularizer, and w contains all parameters What do we want to use as `? The task loss: how we are going to evaluate at test time R. Urtasun (UofT) Deep Structured Models August 21, 2015 17 / 128 Loss Functions L(w) = 1 X `(w, x(i) , t (i) ) + R(w) N i R. Urtasun (UofT) Deep Structured Models August 21, 2015 18 / 128 Loss Functions L(w) = 1 X `(w, x(i) , t (i) ) + R(w) N i The task loss is too difficult to compute, so one uses a surrogate that its typically convex R. Urtasun (UofT) Deep Structured Models August 21, 2015 18 / 128 Loss Functions L(w) = 1 X `(w, x(i) , t (i) ) + R(w) N i The task loss is too difficult to compute, so one uses a surrogate that its typically convex Probability of class k given input (softmax): exp(yk ) p(ck = 1|x) = PC j=1 exp(yj ) R. Urtasun (UofT) Deep Structured Models August 21, 2015 18 / 128 Loss Functions L(w) = 1 X `(w, x(i) , t (i) ) + R(w) N i The task loss is too difficult to compute, so one uses a surrogate that its typically convex Probability of class k given input (softmax): exp(yk ) p(ck = 1|x) = PC j=1 exp(yj ) Cross entropy is the most used loss function for classification X `(x, t, w) = − t (i) log p(ci |x) i R. Urtasun (UofT) Deep Structured Models August 21, 2015 18 / 128 Loss Functions L(w) = 1 X `(w, x(i) , t (i) ) + R(w) N i The task loss is too difficult to compute, so one uses a surrogate that its typically convex Probability of class k given input (softmax): exp(yk ) p(ck = 1|x) = PC j=1 exp(yj ) Cross entropy is the most used loss function for classification X `(x, t, w) = − t (i) log p(ci |x) i Use gradient descent to train the network 1 X min `(w, x(i) , t (i) ) + R(w) w N i R. Urtasun (UofT) Deep Structured Models August 21, 2015 18 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule h1 x max(0, W1T x) R. Urtasun (UofT) ∂` ∂y h2 max(0, W2T h1 ) Deep Structured Models W3T h2 August 21, 2015 y 19 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule h1 x max(0, W2T h1 ) max(0, W1T x) p(ck = 1|x) R. Urtasun (UofT) ∂` ∂y h2 = W3T h2 y exp(yk ) PC j=1 exp(yj ) Deep Structured Models August 21, 2015 19 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule h1 x max(0, W2T h1 ) max(0, W1T x) p(ck = 1|x) `(x, t, w) ∂` ∂y h2 W3T h2 y exp(yk ) PC j=1 exp(yj ) X = − t (i) log p(ci |x) = i R. Urtasun (UofT) Deep Structured Models August 21, 2015 19 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule h1 x ∂` ∂y h2 max(0, W2T h1 ) max(0, W1T x) p(ck = 1|x) `(x, t, w) W3T h2 y exp(yk ) PC j=1 exp(yj ) X = − t (i) log p(ci |x) = i Compute the derivative of loss w.r.t. the output ∂` = p(c|x) − t ∂y R. Urtasun (UofT) Deep Structured Models August 21, 2015 19 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule h1 x ∂` ∂y h2 max(0, W2T h1 ) max(0, W1T x) p(ck = 1|x) `(x, t, w) W3T h2 y exp(yk ) PC j=1 exp(yj ) X = − t (i) log p(ci |x) = i Compute the derivative of loss w.r.t. the output ∂` = p(c|x) − t ∂y Note that the forward pass is necessary to compute R. Urtasun (UofT) Deep Structured Models ∂` ∂y August 21, 2015 19 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule x max(0, W1T x) ∂` ∂y ∂` ∂h2 h1 max(0, W2T h1 ) W3T h2 y Compute the derivative of loss w.r.t the output ∂` = p(c|x) − t ∂y R. Urtasun (UofT) Deep Structured Models August 21, 2015 20 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule max(0, W1T x) x ∂` ∂y ∂` ∂h2 h1 max(0, W2T h1 ) W3T h2 y Compute the derivative of loss w.r.t the output ∂` = p(c|x) − t ∂y Given ∂` ∂y if we can compute the Jacobian of each module R. Urtasun (UofT) Deep Structured Models August 21, 2015 20 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule max(0, W1T x) x ∂` ∂y ∂` ∂h2 h1 max(0, W2T h1 ) W3T h2 y Compute the derivative of loss w.r.t the output ∂` = p(c|x) − t ∂y Given ∂` ∂y if we can compute the Jacobian of each module ∂` = ∂W 3 R. Urtasun (UofT) Deep Structured Models August 21, 2015 20 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule max(0, W1T x) x ∂` ∂y ∂` ∂h2 h1 max(0, W2T h1 ) W3T h2 y Compute the derivative of loss w.r.t the output ∂` = p(c|x) − t ∂y Given ∂` ∂y if we can compute the Jacobian of each module ∂` ∂` ∂y = = ∂W 3 ∂y ∂W 3 R. Urtasun (UofT) Deep Structured Models August 21, 2015 20 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule max(0, W1T x) x ∂` ∂y ∂` ∂h2 h1 max(0, W2T h1 ) W3T h2 y Compute the derivative of loss w.r.t the output ∂` = p(c|x) − t ∂y Given ∂` ∂y if we can compute the Jacobian of each module ∂` ∂` ∂y = = (p(c|x) − t)(h2 )T ∂W 3 ∂y ∂W 3 R. Urtasun (UofT) Deep Structured Models August 21, 2015 20 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule max(0, W1T x) x ∂` ∂y ∂` ∂h2 h1 max(0, W2T h1 ) W3T h2 y Compute the derivative of loss w.r.t the output ∂` = p(c|x) − t ∂y Given ∂` ∂y if we can compute the Jacobian of each module ∂` ∂` ∂y = = (p(c|x) − t)(h2 )T ∂W 3 ∂y ∂W 3 ∂` = ∂h2 R. Urtasun (UofT) Deep Structured Models August 21, 2015 20 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule max(0, W1T x) x ∂` ∂y ∂` ∂h2 h1 max(0, W2T h1 ) W3T h2 y Compute the derivative of loss w.r.t the output ∂` = p(c|x) − t ∂y Given ∂` ∂y if we can compute the Jacobian of each module ∂` ∂` ∂y = = (p(c|x) − t)(h2 )T ∂W 3 ∂y ∂W 3 ∂` ∂` ∂y = = ∂h2 ∂y ∂h2 R. Urtasun (UofT) Deep Structured Models August 21, 2015 20 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule max(0, W1T x) x ∂` ∂y ∂` ∂h2 h1 max(0, W2T h1 ) W3T h2 y Compute the derivative of loss w.r.t the output ∂` = p(c|x) − t ∂y Given ∂` ∂y if we can compute the Jacobian of each module ∂` ∂` ∂y = = (p(c|x) − t)(h2 )T ∂W 3 ∂y ∂W 3 ∂` ∂` ∂y = = (W 3 )T (p(c|x) − t) ∂h2 ∂y ∂h2 R. Urtasun (UofT) Deep Structured Models August 21, 2015 20 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule max(0, W1T x) x ∂` ∂y ∂` ∂h2 h1 max(0, W2T h1 ) W3T h2 y Compute the derivative of loss w.r.t the output ∂` = p(c|x) − t ∂y Given ∂` ∂y if we can compute the Jacobian of each module ∂` ∂` ∂y = = (p(c|x) − t)(h2 )T ∂W 3 ∂y ∂W 3 ∂` ∂` ∂y = = (W 3 )T (p(c|x) − t) ∂h2 ∂y ∂h2 Need to compute gradient w.r.t. inputs and parameters in each layer R. Urtasun (UofT) Deep Structured Models August 21, 2015 20 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule ∂` ∂h1 max(0, W1T x) x ∂` ∂y ∂` ∂h2 max(0, W2T h1 ) W3T h2 y ∂` ∂` ∂y = = (W 3 )T (p(c|x) − t) ∂h2 ∂y ∂h2 Given ∂` ∂h2 if we can compute the Jacobian of each module R. Urtasun (UofT) Deep Structured Models August 21, 2015 21 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule ∂` ∂h1 max(0, W1T x) x ∂` ∂y ∂` ∂h2 max(0, W2T h1 ) W3T h2 y ∂` ∂` ∂y = = (W 3 )T (p(c|x) − t) ∂h2 ∂y ∂h2 Given ∂` ∂h2 if we can compute the Jacobian of each module ∂` = ∂W 2 R. Urtasun (UofT) Deep Structured Models August 21, 2015 21 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule ∂` ∂h1 max(0, W1T x) x ∂` ∂y ∂` ∂h2 max(0, W2T h1 ) W3T h2 y ∂` ∂` ∂y = = (W 3 )T (p(c|x) − t) ∂h2 ∂y ∂h2 Given ∂` ∂h2 if we can compute the Jacobian of each module ∂` ∂h2 ∂` = ∂W 2 ∂h2 ∂W 2 R. Urtasun (UofT) Deep Structured Models August 21, 2015 21 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule ∂` ∂h1 max(0, W1T x) x ∂` ∂y ∂` ∂h2 max(0, W2T h1 ) W3T h2 y ∂` ∂` ∂y = = (W 3 )T (p(c|x) − t) ∂h2 ∂y ∂h2 Given ∂` ∂h2 if we can compute the Jacobian of each module ∂` ∂h2 ∂` = ∂W 2 ∂h2 ∂W 2 ∂` = ∂h1 R. Urtasun (UofT) Deep Structured Models August 21, 2015 21 / 128 Backpropagation Efficient computation of the gradients by applying the chain rule ∂` ∂h1 max(0, W1T x) x ∂` ∂y ∂` ∂h2 max(0, W2T h1 ) W3T h2 y ∂` ∂` ∂y = = (W 3 )T (p(c|x) − t) ∂h2 ∂y ∂h2 Given ∂` ∂h2 if we can compute the Jacobian of each module ∂` ∂h2 ∂` = ∂W 2 ∂h2 ∂W 2 ∂` ∂` ∂h2 = 1 ∂h ∂h2 ∂h1 R. Urtasun (UofT) Deep Structured Models August 21, 2015 21 / 128 Gradient Descent Gradient descent is a first order method, where one takes steps proportional to the negative of the gradient of the function at the current point xn+1 = xn − γn ∇F (xn ) Example: f (x) = x 4 − 3x 3 + 2 R. Urtasun (UofT) Deep Structured Models August 21, 2015 22 / 128 Learning via Gradient Descent Use gradient descent to train the network N 1 X min `(w, x(i) , t (i) ) + R(w) w N i=1 R. Urtasun (UofT) Deep Structured Models August 21, 2015 23 / 128 Learning via Gradient Descent Use gradient descent to train the network N 1 X min `(w, x(i) , t (i) ) + R(w) w N i=1 We need to compute at each iteration wn+1 = wn − γn ∇L(wn ) R. Urtasun (UofT) Deep Structured Models August 21, 2015 23 / 128 Learning via Gradient Descent Use gradient descent to train the network N 1 X min `(w, x(i) , t (i) ) + R(w) w N i=1 We need to compute at each iteration wn+1 = wn − γn ∇L(wn ) Use the backward pass to compute ∇L(wn ) efficiently R. Urtasun (UofT) Deep Structured Models August 21, 2015 23 / 128 Learning via Gradient Descent Use gradient descent to train the network N 1 X min `(w, x(i) , t (i) ) + R(w) w N i=1 We need to compute at each iteration wn+1 = wn − γn ∇L(wn ) Use the backward pass to compute ∇L(wn ) efficiently Recall that the backward pass requires the forward pass first R. Urtasun (UofT) Deep Structured Models August 21, 2015 23 / 128 Toy Code (Matlab): Neural Net Trainer % F-PROP for i = 1 : nr_layers - 1 [h{i} jac{i}] = nonlinearity(W{i} * h{i-1} + b{i}); end h{nr_layers-1} = W{nr_layers-1} * h{nr_layers-2} + b{nr_layers-1}; prediction = softmax(h{l-1}); % CROSS ENTROPY LOSS loss = - sum(sum(log(prediction) .* target)) / batch_size; % B-PROP dh{l-1} = prediction - target; for i = nr_layers – 1 : -1 : 1 Wgrad{i} = dh{i} * h{i-1}'; bgrad{i} = sum(dh{i}, 2); dh{i-1} = (W{i}' * dh{i}) .* jac{i-1}; end % UPDATE for i = 1 : nr_layers - 1 W{i} = W{i} – (lr / batch_size) b{i} = b{i} – (lr / batch_size) end * * Wgrad{i}; bgrad{i}; 28 Ranzato R. Urtasun (UofT) Deep Structured Models August 21, 2015 24 / 128 Dealing with Big Data min w N 1 X `(w, x(i) , t (i) ) + R(w) N i=1 We need to compute at each iteration wn+1 = wn − γn ∇L(wn ) R. Urtasun (UofT) Deep Structured Models August 21, 2015 25 / 128 Dealing with Big Data min w N 1 X `(w, x(i) , t (i) ) + R(w) N i=1 We need to compute at each iteration wn+1 = wn − γn ∇L(wn ) with ∇L(wn ) = N 1 X ∇`(w, x(i) , t (i) ) + ∇R(w) N i=1 R. Urtasun (UofT) Deep Structured Models August 21, 2015 25 / 128 Dealing with Big Data min w N 1 X `(w, x(i) , t (i) ) + R(w) N i=1 We need to compute at each iteration wn+1 = wn − γn ∇L(wn ) with ∇L(wn ) = N 1 X ∇`(w, x(i) , t (i) ) + ∇R(w) N i=1 Too expensive when having millions of examples R. Urtasun (UofT) Deep Structured Models August 21, 2015 25 / 128 Dealing with Big Data min w N 1 X `(w, x(i) , t (i) ) + R(w) N i=1 We need to compute at each iteration wn+1 = wn − γn ∇L(wn ) with ∇L(wn ) = N 1 X ∇`(w, x(i) , t (i) ) + ∇R(w) N i=1 Too expensive when having millions of examples Instead approximate the gradient with a mini-batch (subset of examples) N X 1 1 X ∇`(w, x(i) , t (i) ) ≈ ∇`(w, x(i) , t (i) ) N |S| i=1 R. Urtasun (UofT) i∈S Deep Structured Models August 21, 2015 25 / 128 Dealing with Big Data min w N 1 X `(w, x(i) , t (i) ) + R(w) N i=1 We need to compute at each iteration wn+1 = wn − γn ∇L(wn ) with ∇L(wn ) = N 1 X ∇`(w, x(i) , t (i) ) + ∇R(w) N i=1 Too expensive when having millions of examples Instead approximate the gradient with a mini-batch (subset of examples) N X 1 1 X ∇`(w, x(i) , t (i) ) ≈ ∇`(w, x(i) , t (i) ) N |S| i=1 i∈S This is called stochastic gradient descent R. Urtasun (UofT) Deep Structured Models August 21, 2015 25 / 128 Stochastic Gradient Descent with Momentum Stochastic Gradient Descent update wn+1 = wn − γn ∇L(wn ) with ∇L(wn ) = X 1 ∇`(w, x(i) , t (i) ) + ∇R(w) |S| i∈S R. Urtasun (UofT) Deep Structured Models August 21, 2015 26 / 128 Stochastic Gradient Descent with Momentum Stochastic Gradient Descent update wn+1 = wn − γn ∇L(wn ) with ∇L(wn ) = X 1 ∇`(w, x(i) , t (i) ) + ∇R(w) |S| i∈S We can also use momentum w ← w − γ∆ ∆ ← κ∆ + ∇L R. Urtasun (UofT) Deep Structured Models August 21, 2015 26 / 128 Stochastic Gradient Descent with Momentum Stochastic Gradient Descent update wn+1 = wn − γn ∇L(wn ) with ∇L(wn ) = X 1 ∇`(w, x(i) , t (i) ) + ∇R(w) |S| i∈S We can also use momentum w ← w − γ∆ ∆ ← κ∆ + ∇L Many other variants exist R. Urtasun (UofT) Deep Structured Models August 21, 2015 26 / 128 How to deal with large Input Spaces Images can have millions of pixels, i.e., x is very high dimensional R. Urtasun (UofT) Deep Structured Models August 21, 2015 27 / 128 How to deal with large Input Spaces Images can have millions of pixels, i.e., x is very high dimensional Prohibitive to have fully-connected layer R. Urtasun (UofT) Deep Structured Models August 21, 2015 27 / 128 How to deal with large Input Spaces Images can have millions of pixels, i.e., x is very high dimensional Prohibitive to have fully-connected layer We can use a locally connected layer R. Urtasun (UofT) Deep Structured Models August 21, 2015 27 / 128 How to deal with large Input Spaces Images can have millions of pixels, i.e., x is very high dimensional Prohibitive to have fully-connected layer We can use a locally connected layer This is good when the input is registered R. Urtasun (UofT) Deep Structured Models August 21, 2015 27 / 128 Locally Connected Layer Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., 34 face recognition). Ranzato R. Urtasun (UofT) Deep Structured Models August 21, 2015 28 / 128 Locally Connected Layer STATIONARITY? Statistics is similar at different locations Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., 35 face recognition). Ranzato R. Urtasun (UofT) Deep Structured Models August 21, 2015 29 / 128 Convolutional Neural Net Idea: statistics are similar at different locations (Lecun 1998) Connect each hidden unit to a small input patch and share the weight across space This is called a convolution layer and the network is a convolutional network R. Urtasun (UofT) Deep Structured Models August 21, 2015 30 / 128 Convolutional Layer Ranzato hjn = max(0, K X hkn−1 ∗ wjkn ) k=1 R. Urtasun (UofT) Deep Structured Models August 21, 2015 31 / 128 Convolutional Layer Ranzato hjn = max(0, K X hkn−1 ∗ wjkn ) k=1 R. Urtasun (UofT) Deep Structured Models August 21, 2015 32 / 128 Convolutional Layer Ranzato hjn = max(0, K X hkn−1 ∗ wjkn ) k=1 R. Urtasun (UofT) Deep Structured Models August 21, 2015 33 / 128 Convolutional Layer Ranzato hjn = max(0, K X hkn−1 ∗ wjkn ) k=1 R. Urtasun (UofT) Deep Structured Models August 21, 2015 34 / 128 Convolutional Layer Ranzato hjn = max(0, K X hkn−1 ∗ wjkn ) k=1 R. Urtasun (UofT) Deep Structured Models August 21, 2015 35 / 128 Convolutional Layer Ranzato hjn = max(0, K X hkn−1 ∗ wjkn ) k=1 R. Urtasun (UofT) Deep Structured Models August 21, 2015 36 / 128 Convolutional Layer Learn multiple filters. E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters 54 Ranzato R. Urtasun (UofT) Deep Structured Models August 21, 2015 37 / 128 Pooling Layer By “pooling” (e.g., taking max) filter responses at different locations we gain robustness to the exact spatial location of features. 61 Ranzato R. Urtasun (UofT) Deep Structured Models August 21, 2015 38 / 128 Pooling Options Max Pooling: return the maximal argument Average Pooling: return the average of the arguments Other types of pooling exist: L2 pooling R. Urtasun (UofT) Deep Structured Models August 21, 2015 39 / 128 Pooling Layer: Receptive Field Size hn hn−1 Conv. layer hn1 Pool. layer If convolutional filters have size KxK and stride 1, and pooling layer has pools of size PxP, then each unit in the pooling layer depends upon a patch (at the input of the preceding conv. layer) of size: (P+K-1)x(P+K-1) 67 Ranzato R. Urtasun (UofT) Deep Structured Models August 21, 2015 40 / 128 Now let’s make this very deep R. Urtasun (UofT) Deep Structured Models August 21, 2015 41 / 128 Convolutional Neural Networks (CNN) Remember from your image processing / computer vision course about filtering? R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Convolutional Neural Networks (CNN) If our filter was [−1, 1], we got a vertical edge detector R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Convolutional Neural Networks (CNN) Now imagine we want to have many filters (e.g., vertical, horizontal, corners, one for dots). We will use a filterbank. [Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky] R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Convolutional Neural Networks (CNN) So applying a filterbank to an image yields a cube-like output, a 3D matrix in which each slice is an output of convolution with one filter. [Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky] R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Convolutional Neural Networks (CNN) So applying a filterbank to an image yields a cube-like output, a 3D matrix in which each slice is an output of convolution with one filter. [Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky] R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Convolutional Neural Networks (CNN) Do some additional tricks. A popular one is called max pooling. Any idea why you would do this? [Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky] R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Convolutional Neural Networks (CNN) Do some additional tricks. A popular one is called max pooling. Any idea why you would do this? To get invariance to small shifts in position. [Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky] R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Convolutional Neural Networks (CNN) Now add another “layer” of filters. For each filter again do convolution, but this time with the output cube of the previous layer. [Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky] R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Convolutional Neural Networks (CNN) Keep adding a few layers. Any idea what’s the purpose of more layers? Why can’t we just have a full bunch of filters in one layer? [Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky] R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Convolutional Neural Networks (CNN) In the end add one or two fully (or densely) connected layers. In this layer, we don’t do convolution we just do a dot-product between the “filter” and the output of the previous layer. [Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky] R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Convolutional Neural Networks (CNN) Add one final layer: a classification layer. Each dimension of this vector tells us the probability of the input image being of a certain class. [Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky] R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Convolutional Neural Networks (CNN) The trick is to not hand-fix the weights, but to train them. Train them such that when the network sees a picture of a dog, the last layer will say “dog”. [Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky] R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Convolutional Neural Networks (CNN) Or when the network sees a picture of a cat, the last layer will say “cat”. [Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky] R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Convolutional Neural Networks (CNN) Or when the network sees a picture of a boat, the last layer will say “boat”... The more pictures the network sees, the better. [Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky] R. Urtasun (UofT) Deep Structured Models August 21, 2015 42 / 128 Classification Once trained we feed in an image or a crop, run through the network, and read out the class with the highest probability in the last (classif) layer. [Slide Credit: Sanja Fidler] R. Urtasun (UofT) Deep Structured Models August 21, 2015 43 / 128 Classification Performance Imagenet, main challenge for object classification: http://image-net.org/ 1000 classes, 1.2M training images, 150K for test R. Urtasun (UofT) Deep Structured Models August 21, 2015 44 / 128 Architecture for Classification category prediction LINEAR FULLY CONNECTED FULLY CONNECTED MAX POOLING CONV CONV CONV MAX POOLING LOCAL CONTRAST NORM CONV MAX POOLING LOCAL CONTRAST NORM CONV input 95 Krizhevsky et al. “ImageNet Classification with deep CNNs” NIPS 2012 R. Urtasun (UofT) Deep Structured Models Ranzato August 21, 2015 45 / 128 Architecture for Classification Total nr. params: 60M category prediction Total nr. flops: 832M 4M LINEAR 4M 16M FULLY CONNECTED 16M 37M FULLY CONNECTED 37M MAX POOLING 442K CONV 74M 1.3M CONV 224M 884K CONV 149M MAX POOLING LOCAL CONTRAST NORM 307K CONV 223M MAX POOLING LOCAL CONTRAST NORM 35K CONV input 105M 96 Krizhevsky et al. “ImageNet Classification with deep CNNs” NIPS 2012 R. Urtasun (UofT) Deep Structured Models Ranzato August 21, 2015 46 / 128 The 2012 Computer Vision Crisis (Classification) R. Urtasun (UofT) (Detection) Deep Structured Models August 21, 2015 47 / 128 Neural Networks as Descriptors What vision people like to do is take the already trained network (avoid one week of training), and remove the last classification layer. Then take the top remaining layer (the 4096 dimensional vector here) and use it as a descriptor (feature vector). [Slide Credit: Sanja Fidler] R. Urtasun (UofT) Deep Structured Models August 21, 2015 48 / 128 Neural Networks as Descriptors What vision people like to do is take the already trained network, and remove the last classification layer. Then take the top remaining layer (the 4096 dimensional vector here) and use it as a descriptor (feature vector). Now train your own classifier on top of these features for arbitrary classes. [Slide Credit: Sanja Fidler] R. Urtasun (UofT) Deep Structured Models August 21, 2015 48 / 128 Neural Networks as Descriptors What vision people like to do is take the already trained network, and remove the last classification layer. Then take the top remaining layer (the 4096 dimensional vector here) and use it as a descriptor (feature vector). Now train your own classifier on top of these features for arbitrary classes. This is quite hacky, but works miraculously well. R. Urtasun (UofT) Deep Structured Models August 21, 2015 48 / 128 Neural Networks as Descriptors What vision people like to do is take the already trained network, and remove the last classification layer. Then take the top remaining layer (the 4096 dimensional vector here) and use it as a descriptor (feature vector). Now train your own classifier on top of these features for arbitrary classes. This is quite hacky, but works miraculously well. Everywhere where we were using SIFT (or anything else), you can use NNs. R. Urtasun (UofT) Deep Structured Models August 21, 2015 48 / 128 Caltech 256 Caltech Results Zeiler & Fergus, Visualizing and Understanding Convolutional Networks, arXiv 1311.2901, 2013 75 70 65 $FFXUDF\ 60 55 50 45 40 35 %RHWDO 6RKQHWDO 30 25 0 R. Urtasun (UofT) 10 20 30 40 7UDLQLQJ,PDJHVSHUïFODVV Deep Structured Models 50 60 August 21, 2015 49 / 128 Caltech 256 Caltech Results Zeiler & Fergus, Visualizing and Understanding Convolutional Networks, arXiv 1311.2901, 2013 75 70 65 6 training examples $FFXUDF\ 60 55 50 45 40 2XU0RGHO %RHWDO 6RKQHWDO 35 30 25 0 R. Urtasun (UofT) 10 20 30 40 7UDLQLQJ,PDJHVSHUïFODVV Deep Structured Models 50 60 August 21, 2015 50 / 128 And Detection? For classification we feed in the full image to the network. But how can we perform detection? R. Urtasun (UofT) Deep Structured Models August 21, 2015 51 / 128 The Era Post-Alex Net: PASCAL VOC detection Extract object proposals with bottom up grouping and then classify them using your big net R. Urtasun (UofT) Deep Structured Models August 21, 2015 52 / 128 Detection Performance PASCAL VOC challenge: http://pascallin.ecs.soton.ac.uk/challenges/VOC/. Figure : PASCAL has 20 object classes, 10K images for training, 10K for test R. Urtasun (UofT) Deep Structured Models August 21, 2015 53 / 128 Detection Performance a Year Ago: 40.4% A year ago, no networks: Results on the main recognition benchmark, the PASCAL VOC challenge. Figure : Leading method segDPM (ours). Those were the good times... S. Fidler, R. Mottaghi, A. Yuille, R. Urtasun, Bottom-up Segmentation for Top-down Detection, CVPR’13 R. Urtasun (UofT) Deep Structured Models August 21, 2015 54 / 128 The Era Post-Alex Net: PASCAL VOC detection R. Urtasun (UofT) Deep Structured Models August 21, 2015 55 / 128 So Neural Networks are Great So networks turn out to be great. Everything is deep, even if it’s shallow! Companies leading the competitions: ImageNet, KITTI, but not yet PASCAL At this point Google, Facebook, Microsoft, Baidu “steal” most neural network professors from academia. · · · and a lot of our good students :( R. Urtasun (UofT) Deep Structured Models August 21, 2015 56 / 128 So Neural Networks are Great But to train the networks you need quite a bit of computational power. So what do you do? R. Urtasun (UofT) Deep Structured Models August 21, 2015 56 / 128 So Neural Networks are Great Buy even more. R. Urtasun (UofT) Deep Structured Models August 21, 2015 56 / 128 So Neural Networks are Great And train more layers. 16 instead of 7 before. 144 million parameters. [Slide Credit: Sanja Fidler, Pic adopted from: A. Krizhevsky] Figure : K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014 R. Urtasun (UofT) Deep Structured Models August 21, 2015 56 / 128 The Era Post-Alex Net: PASCAL VOC detection R. Urtasun (UofT) Deep Structured Models August 21, 2015 57 / 128 What if we Want Semantic Segmentation? R. Urtasun (UofT) Deep Structured Models August 21, 2015 58 / 128 What if we Want Semantic Segmentation? Every layer, even fully connected can be treated as a convolutional layer, and then we can deal with arbitrary dimensions of the input R. Urtasun (UofT) Deep Structured Models August 21, 2015 58 / 128 What if we Want Semantic Segmentation? Every layer, even fully connected can be treated as a convolutional layer, and then we can deal with arbitrary dimensions of the input The network can work on super pixels, or can directly operate in pixels R. Urtasun (UofT) Deep Structured Models August 21, 2015 58 / 128 What if we Want Semantic Segmentation? Every layer, even fully connected can be treated as a convolutional layer, and then we can deal with arbitrary dimensions of the input The network can work on super pixels, or can directly operate in pixels Due to pooling, the output is typically lower dimensional than the input, use interpolation. R. Urtasun (UofT) Deep Structured Models August 21, 2015 58 / 128 What if we Want Semantic Segmentation? Every layer, even fully connected can be treated as a convolutional layer, and then we can deal with arbitrary dimensions of the input The network can work on super pixels, or can directly operate in pixels Due to pooling, the output is typically lower dimensional than the input, use interpolation. PASCAL VOC, 65% IOU R. Urtasun (UofT) Deep Structured Models August 21, 2015 58 / 128 What if we Want Semantic Segmentation? Every layer, even fully connected can be treated as a convolutional layer, and then we can deal with arbitrary dimensions of the input The network can work on super pixels, or can directly operate in pixels Due to pooling, the output is typically lower dimensional than the input, use interpolation. PASCAL VOC, 65% IOU More to come in Part II R. Urtasun (UofT) Deep Structured Models August 21, 2015 58 / 128 Practical Tips R. Urtasun (UofT) Deep Structured Models August 21, 2015 59 / 128 How to choose Hyperparameters? Hyperparameters: architecture, learning rate, num layers, num features, etc R. Urtasun (UofT) Deep Structured Models August 21, 2015 60 / 128 How to choose Hyperparameters? Hyperparameters: architecture, learning rate, num layers, num features, etc How to choose them? R. Urtasun (UofT) Deep Structured Models August 21, 2015 60 / 128 How to choose Hyperparameters? Hyperparameters: architecture, learning rate, num layers, num features, etc How to choose them? 1 Cross-validation R. Urtasun (UofT) Deep Structured Models August 21, 2015 60 / 128 How to choose Hyperparameters? Hyperparameters: architecture, learning rate, num layers, num features, etc How to choose them? 1 2 Cross-validation Grid search (need lots of GPUs) R. Urtasun (UofT) Deep Structured Models August 21, 2015 60 / 128 How to choose Hyperparameters? Hyperparameters: architecture, learning rate, num layers, num features, etc How to choose them? 1 2 3 Cross-validation Grid search (need lots of GPUs) Random [Bergstra & Bengio JMLR 2012] R. Urtasun (UofT) Deep Structured Models August 21, 2015 60 / 128 How to choose Hyperparameters? Hyperparameters: architecture, learning rate, num layers, num features, etc How to choose them? 1 2 3 4 Cross-validation Grid search (need lots of GPUs) Random [Bergstra & Bengio JMLR 2012] Bayesian optimization [Whetlab Toronto] R. Urtasun (UofT) Deep Structured Models August 21, 2015 60 / 128 Good to Know ALWAYS check gradients numerically by finite differences! R. Urtasun (UofT) Deep Structured Models August 21, 2015 61 / 128 Good to Know ALWAYS check gradients numerically by finite differences! Measure error on both training and validation set, NEVER TEST R. Urtasun (UofT) Deep Structured Models August 21, 2015 61 / 128 Good to Know ALWAYS check gradients numerically by finite differences! Measure error on both training and validation set, NEVER TEST Test on a small subset of the data and check that you can over fit (i.e., error → 0) R. Urtasun (UofT) Deep Structured Models August 21, 2015 61 / 128 Good to Know ALWAYS check gradients numerically by finite differences! Measure error on both training and validation set, NEVER TEST Test on a small subset of the data and check that you can over fit (i.e., error → 0) Visualize features (feature maps need to be uncorrelated) and have high variance. (good) (bad) Figure : from M. Ranzato R. Urtasun (UofT) Deep Structured Models August 21, 2015 61 / 128 Good to Know ALWAYS check gradients numerically by finite differences! Measure error on both training and validation set, NEVER TEST Test on a small subset of the data and check that you can over fit (i.e., error → 0) Visualize features (feature maps need to be uncorrelated) and have high variance. Visualize parameters Figure : from M. Ranzato [Slide credit: M. Ranzato] R. Urtasun (UofT) Deep Structured Models August 21, 2015 62 / 128 What if it doesn’t work? Training diverges R. Urtasun (UofT) Deep Structured Models August 21, 2015 63 / 128 What if it doesn’t work? Training diverges Decrease learning rate R. Urtasun (UofT) Deep Structured Models August 21, 2015 63 / 128 What if it doesn’t work? Training diverges Decrease learning rate Check gradients R. Urtasun (UofT) Deep Structured Models August 21, 2015 63 / 128 What if it doesn’t work? Training diverges Decrease learning rate Check gradients Parameters collapse / loss is minimized but accuracy is low R. Urtasun (UofT) Deep Structured Models August 21, 2015 63 / 128 What if it doesn’t work? Training diverges Decrease learning rate Check gradients Parameters collapse / loss is minimized but accuracy is low Appropriate loss function? R. Urtasun (UofT) Deep Structured Models August 21, 2015 63 / 128 What if it doesn’t work? Training diverges Decrease learning rate Check gradients Parameters collapse / loss is minimized but accuracy is low Appropriate loss function? Does loss-function have degenerate solutions? R. Urtasun (UofT) Deep Structured Models August 21, 2015 63 / 128 What if it doesn’t work? Training diverges Decrease learning rate Check gradients Parameters collapse / loss is minimized but accuracy is low Appropriate loss function? Does loss-function have degenerate solutions? Network is underperforming R. Urtasun (UofT) Deep Structured Models August 21, 2015 63 / 128 What if it doesn’t work? Training diverges Decrease learning rate Check gradients Parameters collapse / loss is minimized but accuracy is low Appropriate loss function? Does loss-function have degenerate solutions? Network is underperforming Make it bigger R. Urtasun (UofT) Deep Structured Models August 21, 2015 63 / 128 What if it doesn’t work? Training diverges Decrease learning rate Check gradients Parameters collapse / loss is minimized but accuracy is low Appropriate loss function? Does loss-function have degenerate solutions? Network is underperforming Make it bigger Visualize hidden units/params and fix optimization R. Urtasun (UofT) Deep Structured Models August 21, 2015 63 / 128 What if it doesn’t work? Training diverges Decrease learning rate Check gradients Parameters collapse / loss is minimized but accuracy is low Appropriate loss function? Does loss-function have degenerate solutions? Network is underperforming Make it bigger Visualize hidden units/params and fix optimization Network is too slow R. Urtasun (UofT) Deep Structured Models August 21, 2015 63 / 128 What if it doesn’t work? Training diverges Decrease learning rate Check gradients Parameters collapse / loss is minimized but accuracy is low Appropriate loss function? Does loss-function have degenerate solutions? Network is underperforming Make it bigger Visualize hidden units/params and fix optimization Network is too slow GPU,distrib. framework, make net smaller [Slide credit: M. Ranzato] R. Urtasun (UofT) Deep Structured Models August 21, 2015 63 / 128 Improving Generalization Weight sharing (Reduce the number of parameters) R. Urtasun (UofT) Deep Structured Models August 21, 2015 64 / 128 Improving Generalization Weight sharing (Reduce the number of parameters) Data augmentation (e.g., jittering, noise injection, tranformations) R. Urtasun (UofT) Deep Structured Models August 21, 2015 64 / 128 Improving Generalization Weight sharing (Reduce the number of parameters) Data augmentation (e.g., jittering, noise injection, tranformations) Dropout [Hinton et al.]: randomly drop units (along with their connections) from the neural network during training. Use for the fully connected layers only R. Urtasun (UofT) Deep Structured Models August 21, 2015 64 / 128 Improving Generalization Weight sharing (Reduce the number of parameters) Data augmentation (e.g., jittering, noise injection, tranformations) Dropout [Hinton et al.]: randomly drop units (along with their connections) from the neural network during training. Use for the fully connected layers only Regularization: Weight decay (L2, L1) R. Urtasun (UofT) Deep Structured Models August 21, 2015 64 / 128 Improving Generalization Weight sharing (Reduce the number of parameters) Data augmentation (e.g., jittering, noise injection, tranformations) Dropout [Hinton et al.]: randomly drop units (along with their connections) from the neural network during training. Use for the fully connected layers only Regularization: Weight decay (L2, L1) Sparsity in the hidden units R. Urtasun (UofT) Deep Structured Models August 21, 2015 64 / 128 Improving Generalization Weight sharing (Reduce the number of parameters) Data augmentation (e.g., jittering, noise injection, tranformations) Dropout [Hinton et al.]: randomly drop units (along with their connections) from the neural network during training. Use for the fully connected layers only Regularization: Weight decay (L2, L1) Sparsity in the hidden units Multi-task learning [Slide credit: M. Ranzato] R. Urtasun (UofT) Deep Structured Models August 21, 2015 64 / 128 Software Torch7: learning library that supports neural net training http://www.torch.ch http://code.cogbits.com/wiki/doku.php (tutorial with demos by C. Farabet) https://github.com/sermanet/OverFeat Python-based learning library (U. Montreal) http://deeplearning.net/software/theano/ (does automatic differentiation Efficient CUDA kernels for ConvNets (Krizhevsky) code.google.com/p/cuda-convnet Caffe (Yangqing Jia) http://caffe.berkeleyvision.org Deep Structured Models http://www.alexander-schwing.de/ (soon available) [Slide Credit: M. Ranzato] R. Urtasun (UofT) Deep Structured Models August 21, 2015 65 / 128 Part II: Deep Structured Learning R. Urtasun (UofT) Deep Structured Models August 21, 2015 66 / 128 Your current Status? R. Urtasun (UofT) Deep Structured Models August 21, 2015 67 / 128 What’s next? R. Urtasun (UofT) Deep Structured Models August 21, 2015 68 / 128 What’s next? 1 Theoretical Understanding 2 Unsupervised Learning 3 Structured models R. Urtasun (UofT) Deep Structured Models August 21, 2015 69 / 128 Structure! Many Vision Problems are complex and involve predicting many random variables that are statistically related Scene understanding Tag prediction Segmentation x = image x = image x = image y : room layout y : tag ”combo” y : segmentation R. Urtasun (UofT) Deep Structured Models August 21, 2015 70 / 128 Deep Learning Complex mapping F (x, y , w) to predict output y given input x through a series of matrix multiplications, non-linearities and pooling operations Figure : Imagenet CNN [Krizhevsky et al. 13] R. Urtasun (UofT) Deep Structured Models August 21, 2015 71 / 128 Deep Learning Complex mapping F (x, y , w) to predict output y given input x through a series of matrix multiplications, non-linearities and pooling operations Figure : Imagenet CNN [Krizhevsky et al. 13] We typically train the network to predict one random variable (e.g., ImageNet) by minimizing cross-entropy R. Urtasun (UofT) Deep Structured Models August 21, 2015 71 / 128 Deep Learning Complex mapping F (x, y , w) to predict output y given input x through a series of matrix multiplications, non-linearities and pooling operations Figure : Imagenet CNN [Krizhevsky et al. 13] We typically train the network to predict one random variable (e.g., ImageNet) by minimizing cross-entropy Multi-task extensions: sum the loss of each task, and share part of the features (e.g., segmentation) R. Urtasun (UofT) Deep Structured Models August 21, 2015 71 / 128 Deep Learning Complex mapping F (x, y , w) to predict output y given input x through a series of matrix multiplications, non-linearities and pooling operations Figure : Imagenet CNN [Krizhevsky et al. 13] We typically train the network to predict one random variable (e.g., ImageNet) by minimizing cross-entropy Multi-task extensions: sum the loss of each task, and share part of the features (e.g., segmentation) Use an MRF as a post processing step R. Urtasun (UofT) Deep Structured Models August 21, 2015 71 / 128 PROBLEM: How can we take into account complex dependencies when predicting multiple variables? R. Urtasun (UofT) Deep Structured Models August 21, 2015 72 / 128 PROBLEM: How can we take into account complex dependencies when predicting multiple variables? SOLUTION: Graphical models R. Urtasun (UofT) Deep Structured Models August 21, 2015 72 / 128 Graphical Models Convenient tool to illustrate dependencies among random variables X X X E (y) = − fi (yi ) − f (yi , yj ) − fα (yα ) i α i,j∈E | {z } unaries | {z } pairwise | {z } high−order High-order Potential Pairwise Potential Unary Potential Widespread usage among different fields: vision, NLP, comp. bio, · · · R. Urtasun (UofT) Deep Structured Models August 21, 2015 73 / 128 Compact Notation In Computer Vision we usually express X X X E (y) = − fi (yi ) − f (yi , yj ) − fα (yα ) i unaries R. Urtasun (UofT) α i,j∈E | {z } | {z pairwise Deep Structured Models } | {z } high−order August 21, 2015 74 / 128 Compact Notation In Computer Vision we usually express X X X E (y) = − fi (yi ) − f (yi , yj ) − fα (yα ) i α i,j∈E | {z } unaries | {z pairwise } | {z } high−order For the purpose of this talk we are going to use a more compact notation X E (y, w) = − fr (yr , w) r ∈R R. Urtasun (UofT) Deep Structured Models August 21, 2015 74 / 128 Compact Notation In Computer Vision we usually express X X X E (y) = − fi (yi ) − f (yi , yj ) − fα (yα ) i α i,j∈E | {z } unaries | {z pairwise } | {z } high−order For the purpose of this talk we are going to use a more compact notation X E (y, w) = − fr (yr , w) r ∈R r is a region and R is the set of all regions R. Urtasun (UofT) Deep Structured Models August 21, 2015 74 / 128 Compact Notation In Computer Vision we usually express X X X E (y) = − fi (yi ) − f (yi , yj ) − fα (yα ) i α i,j∈E | {z } unaries | {z pairwise } | {z } high−order For the purpose of this talk we are going to use a more compact notation X E (y, w) = − fr (yr , w) r ∈R r is a region and R is the set of all regions yr is of any order R. Urtasun (UofT) Deep Structured Models August 21, 2015 74 / 128 Compact Notation In Computer Vision we usually express X X X E (y) = − fi (yi ) − f (yi , yj ) − fα (yα ) i α i,j∈E | {z } unaries | {z pairwise } | {z } high−order For the purpose of this talk we are going to use a more compact notation X E (y, w) = − fr (yr , w) r ∈R r is a region and R is the set of all regions yr is of any order The functions fr are a function of parameters w R. Urtasun (UofT) Deep Structured Models August 21, 2015 74 / 128 Continuous vs Discrete MRFs E (y, w) = − X fr (yr , w) r ∈R Discrete MRFs: yi ∈ {1, · · · , Ci } Continuous MRFs: yi ∈ Y ⊆ R Hybrid MRFs with continuous and discrete variables R. Urtasun (UofT) Deep Structured Models August 21, 2015 75 / 128 Continuous vs Discrete MRFs E (y, w) = − X fr (yr , w) r ∈R Discrete MRFs: yi ∈ {1, · · · , Ci } Continuous MRFs: yi ∈ Y ⊆ R Hybrid MRFs with continuous and discrete variables Today’s talk: only discrete MRFs R. Urtasun (UofT) Deep Structured Models August 21, 2015 75 / 128 Probabilistic Interpretation The energy is defined as E (y, w) = −F (y, w) = − X fr (yr , w) r ∈R R. Urtasun (UofT) Deep Structured Models August 21, 2015 76 / 128 Probabilistic Interpretation The energy is defined as E (y, w) = −F (y, w) = − X fr (yr , w) r ∈R We can construct a probability distribution over the outputs ! X 1 p(y; w) = exp fr (yr , w) Z r ∈R with Z (w) = R. Urtasun (UofT) P y exp P r ∈R fr (yr , w) the partition function Deep Structured Models August 21, 2015 76 / 128 Probabilistic Interpretation The energy is defined as E (y, w) = −F (y, w) = − X fr (yr , w) r ∈R We can construct a probability distribution over the outputs ! X 1 p(y; w) = exp fr (yr , w) Z r ∈R with Z (w) = P y exp P r ∈R fr (yr , w) the partition function CRFs vs MRFs 1 p(y|x; w) = exp Z (x) with Z (x, w) = R. Urtasun (UofT) P y exp P ! X fr (x, yr , w) r ∈R r ∈R fr (x, yr , w) Deep Structured Models the partition function August 21, 2015 76 / 128 Inference Tasks MAP: maximum a posteriori estimate, or minimum energy configuration X y∗ = arg max fr (yr , w) y R. Urtasun (UofT) r ∈R Deep Structured Models August 21, 2015 77 / 128 Inference Tasks MAP: maximum a posteriori estimate, or minimum energy configuration X y∗ = arg max fr (yr , w) y r ∈R Probabilistic Inference: We might want to compute p(yr ) for any possible subset of variables r , or p(yr |yp ) for any subset r and p R. Urtasun (UofT) Deep Structured Models August 21, 2015 77 / 128 Inference Tasks MAP: maximum a posteriori estimate, or minimum energy configuration X y∗ = arg max fr (yr , w) y r ∈R Probabilistic Inference: We might want to compute p(yr ) for any possible subset of variables r , or p(yr |yp ) for any subset r and p M-best configurations (e.g., top-k) R. Urtasun (UofT) Deep Structured Models August 21, 2015 77 / 128 Inference Tasks MAP: maximum a posteriori estimate, or minimum energy configuration X y∗ = arg max fr (yr , w) y r ∈R Probabilistic Inference: We might want to compute p(yr ) for any possible subset of variables r , or p(yr |yp ) for any subset r and p M-best configurations (e.g., top-k) Very difficult tasks in general (i.e., NP-hard). Some exceptions, e.g., low-tree width models and binary MRFs with sub-modular energies R. Urtasun (UofT) Deep Structured Models August 21, 2015 77 / 128 Learning in CRFs Given a training set of N pairs (x, y) ∈ D, we want to estimate the functions fr (x, yr , w) As these functions are parametric, this is equivalent to estimating w R. Urtasun (UofT) Deep Structured Models August 21, 2015 78 / 128 Learning in CRFs Given a training set of N pairs (x, y) ∈ D, we want to estimate the functions fr (x, yr , w) As these functions are parametric, this is equivalent to estimating w We would like to do this by minimizing the empirical loss min w 1 N X `task (x, y, w) (x,y)∈D where `task is the loss that we’ll be evaluated on R. Urtasun (UofT) Deep Structured Models August 21, 2015 78 / 128 Learning in CRFs Given a training set of N pairs (x, y) ∈ D, we want to estimate the functions fr (x, yr , w) As these functions are parametric, this is equivalent to estimating w We would like to do this by minimizing the empirical loss min w 1 N X `task (x, y, w) (x,y)∈D where `task is the loss that we’ll be evaluated on Very difficult, instead we minimize the sum of a surrogate (typically convex) loss and a regularizer min R(w) + w R. Urtasun (UofT) C N X ¯ y, w) `(x, (x,y)∈D Deep Structured Models August 21, 2015 78 / 128 More on Learning in CRFs Given a training set of N pairs (x, y) ∈ D, we want to estimate the functions fr (y, x, w) Minimize a surrogate (typically convex) loss and a regularizer C X ¯ min R(w) + `(x, y, w) w N (x,y)∈D R. Urtasun (UofT) Deep Structured Models August 21, 2015 79 / 128 More on Learning in CRFs Given a training set of N pairs (x, y) ∈ D, we want to estimate the functions fr (y, x, w) Minimize a surrogate (typically convex) loss and a regularizer C X ¯ min R(w) + `(x, y, w) w N (x,y)∈D ¯ hinge-loss, log-loss The surrogate loss `: `¯log (x, y, w) = − ln px,y (y; w). `¯hinge (x, y, w) = max `(y, ŷ) − w> Φ(x, ŷ) + w> Φ(x, y) ŷ∈Y R. Urtasun (UofT) Deep Structured Models August 21, 2015 79 / 128 More on Learning in CRFs Given a training set of N pairs (x, y) ∈ D, we want to estimate the functions fr (y, x, w) Minimize a surrogate (typically convex) loss and a regularizer C X ¯ min R(w) + `(x, y, w) w N (x,y)∈D ¯ hinge-loss, log-loss The surrogate loss `: `¯log (x, y, w) = − ln px,y (y; w). `¯hinge (x, y, w) = max `(y, ŷ) − w> Φ(x, ŷ) + w> Φ(x, y) ŷ∈Y The assumption is that the model is log-linear E (x, y, w) = −F (x, y, w) = −wT φ(x, y) and the features decompose in a graph X wT φ(x, y) = wrT φ(x, y) r ∈R R. Urtasun (UofT) Deep Structured Models August 21, 2015 79 / 128 PROBLEM: How can we remove the log-linear restriction? R. Urtasun (UofT) Deep Structured Models August 21, 2015 80 / 128 PROBLEM: How can we remove the log-linear restriction? SOLUTION: Deep Structured Models R. Urtasun (UofT) Deep Structured Models August 21, 2015 80 / 128 With Pictures ;) Standard CNN y1 CNN R. Urtasun (UofT) Deep Structured Models August 21, 2015 81 / 128 With Pictures ;) Standard CNN y1 CNN Deep Structured Models y1,2 y2,3 CNN4 CNN5 y1 y2 y3 CNN1 CNN2 CNN3 R. Urtasun (UofT) Deep Structured Models August 21, 2015 81 / 128 Learning Probability of a configuration y: 1 exp F (x, y, w) Z (x, w) X exp F (x, ŷ, w) Z (x, w) = p(y | x; w) = ŷ∈Y R. Urtasun (UofT) Deep Structured Models August 21, 2015 82 / 128 Learning Probability of a configuration y: 1 exp F (x, y, w) Z (x, w) X exp F (x, ŷ, w) Z (x, w) = p(y | x; w) = ŷ∈Y Maximize the likelihood of training data via Y w∗ = arg max p(y|x; w) w (x,y)∈D = arg max X F (x, y, w) − ln w (x,y)∈D R. Urtasun (UofT) Deep Structured Models X exp F (x, y, w) ŷ∈Y August 21, 2015 82 / 128 Learning Probability of a configuration y: 1 exp F (x, y, w) Z (x, w) X exp F (x, ŷ, w) Z (x, w) = p(y | x; w) = ŷ∈Y Maximize the likelihood of training data via Y w∗ = arg max p(y|x; w) w (x,y)∈D = arg max X F (x, y, w) − ln w (x,y)∈D X exp F (x, y, w) ŷ∈Y Maximum likelihood is equivalent to maximizing cross-entropy when the target distribution p(x,y),tg (ŷ) = δ(ŷ = y) R. Urtasun (UofT) Deep Structured Models August 21, 2015 82 / 128 Gradient Ascent on Cross Entropy Program of interest: max X p(x,y),tg (ŷ) ln p(ŷ | x; w) w (x,y)∈D,ŷ Optimize via gradient ascent X ∂ p(x,y),tg (ŷ) ln p(ŷ | x; w) ∂w (x,y)∈D,ŷ ∂ p(x,y),tg (ŷ) − p(ŷ | x; w) F (ŷ, x, w) ∂w (x,y)∈D,ŷ X ∂ ∂ = Ep(x,y),tg F (ŷ, x, w) − Ep(x,y) F (ŷ, x, w) ∂w ∂w (x,y)∈D | {z } = X moment matching Compute predicted distribution p(ŷ | x; w) Use chain rule to pass back difference between prediction and observation R. Urtasun (UofT) Deep Structured Models August 21, 2015 83 / 128 Deep Structured Learning (algo 1) [Peng et al. NIPS’09] Repeat until stopping criteria 1 Forward pass to compute F (y, x, w) 2 Compute p(y | x, w) 3 Backward pass via chain rule to obtain gradient 4 Update parameters w R. Urtasun (UofT) Deep Structured Models August 21, 2015 84 / 128 Deep Structured Learning (algo 1) [Peng et al. NIPS’09] Repeat until stopping criteria 1 Forward pass to compute F (y, x, w) 2 Compute p(y | x, w) 3 Backward pass via chain rule to obtain gradient 4 Update parameters w What is the PROBLEM? R. Urtasun (UofT) Deep Structured Models August 21, 2015 84 / 128 Deep Structured Learning (algo 1) [Peng et al. NIPS’09] Repeat until stopping criteria 1 Forward pass to compute F (y, x, w) 2 Compute p(y | x, w) 3 Backward pass via chain rule to obtain gradient 4 Update parameters w What is the PROBLEM? How do we even represent F (y, x, w) if Y is large? How do we compute p(y | x, w)? R. Urtasun (UofT) Deep Structured Models August 21, 2015 84 / 128 Use the Graphical Model Structure 1 Use the graphical model F (y, x, w) = ∂ ∂w X P r fr (yr , x, w) p(x,y),tg (ŷ) ln p(ŷ | x; w) (x,y)∈D,ŷ X = (x,y)∈D,r R. Urtasun (UofT) Ep(x,y),r ,tg ∂ ∂ fr (ŷr , x, w) − Ep(x,y),r fr (ŷr , x, w) ∂w ∂w Deep Structured Models August 21, 2015 85 / 128 Use the Graphical Model Structure 1 Use the graphical model F (y, x, w) = ∂ ∂w X r fr (yr , x, w) p(x,y),tg (ŷ) ln p(ŷ | x; w) (x,y)∈D,ŷ X = Ep(x,y),r ,tg (x,y)∈D,r 2 P ∂ ∂ fr (ŷr , x, w) − Ep(x,y),r fr (ŷr , x, w) ∂w ∂w Approximate marginals pr (ŷr |x, w) via beliefs br (ŷr |x, w) computed by: Sampling methods Variational methods R. Urtasun (UofT) Deep Structured Models August 21, 2015 85 / 128 Deep Structured Learning (algo 2) [Schwing & Urtasun Arxiv’15, Zheng et al. Arxiv’15] Repeat until stopping criteria 1 Forward pass to compute the fr (yr , x, w) 2 Compute the br (yr | x, w) by running approximated inference 3 Backward pass via chain rule to obtain gradient 4 Update parameters w R. Urtasun (UofT) Deep Structured Models August 21, 2015 86 / 128 Deep Structured Learning (algo 2) [Schwing & Urtasun Arxiv’15, Zheng et al. Arxiv’15] Repeat until stopping criteria 1 Forward pass to compute the fr (yr , x, w) 2 Compute the br (yr | x, w) by running approximated inference 3 Backward pass via chain rule to obtain gradient 4 Update parameters w PROBLEM: We have to run inference in the graphical model every time we want to update the weights R. Urtasun (UofT) Deep Structured Models August 21, 2015 86 / 128 How to deal with Big Data Dealing with large number |D| of training examples: Parallelized across samples (any number of machines and GPUs) Usage of mini batches R. Urtasun (UofT) Deep Structured Models August 21, 2015 87 / 128 How to deal with Big Data Dealing with large number |D| of training examples: Parallelized across samples (any number of machines and GPUs) Usage of mini batches Dealing with large output spaces Y: Variational approximations Blending of learning and inference R. Urtasun (UofT) Deep Structured Models August 21, 2015 87 / 128 Approximated Deep Structured Learning [Schwing & Urtasun Arxiv’15] Sample parallel implementation: Partition data D onto compute nodes Repeat until stopping criteria 1 Each compute node uses GPU for CNN Forward pass to compute fr (yr , x, w) 2 Each compute node estimates beliefs br (yr | x, w) for assigned samples 3 Backpropagation of difference using GPU to obtain machine local gradient 4 Synchronize gradient across all machines using MPI 5 Update parameters w R. Urtasun (UofT) Deep Structured Models August 21, 2015 88 / 128 Better Option: Interleaving Learning and Inference Use LP relaxation instead min w X (x,y)∈D max X b(x,y ) ∈C(x,y ) R. Urtasun (UofT) b(x,y ),r (ŷr )fr (x, ŷr ; w) + X r r ,ŷr Deep Structured Models cr H(b(x,y ),r ) −F (x, y; w) August 21, 2015 89 / 128 Better Option: Interleaving Learning and Inference Use LP relaxation instead min w X (x,y)∈D max X b(x,y ) ∈C(x,y ) b(x,y ),r (ŷr )fr (x, ŷr ; w) + X r r ,ŷr cr H(b(x,y ),r ) −F (x, y; w) More efficient algorithm by blending min. w.r.t. w and max. of the beliefs b R. Urtasun (UofT) Deep Structured Models August 21, 2015 89 / 128 Better Option: Interleaving Learning and Inference Use LP relaxation instead min X w (x,y)∈D max X b(x,y ) ∈C(x,y ) b(x,y ),r (ŷr )fr (x, ŷr ; w) + X r r ,ŷr cr H(b(x,y ),r ) −F (x, y; w) More efficient algorithm by blending min. w.r.t. w and max. of the beliefs b After introducing Lagrange multipliers λ, the dual becomes fr (x, ŷr ; w) + min w,λ X cr ln (x,y),r with F (w) = R. Urtasun (UofT) X exp ŷr P (x,y)∈D P λ(x,y ),c→r (ŷc ) − c∈C (r ) P λ(x,y),r →p (ŷr ) p∈P(r ) cr − F (w). F (x, y; w) the sum of empirical function observations Deep Structured Models August 21, 2015 89 / 128 Better Option: Interleaving Learning and Inference Use LP relaxation instead min X w (x,y)∈D max X b(x,y ) ∈C(x,y ) b(x,y ),r (ŷr )fr (x, ŷr ; w) + X r r ,ŷr cr H(b(x,y ),r ) −F (x, y; w) More efficient algorithm by blending min. w.r.t. w and max. of the beliefs b After introducing Lagrange multipliers λ, the dual becomes fr (x, ŷr ; w) + min w,λ X cr ln (x,y),r with F (w) = X exp ŷr P (x,y)∈D P λ(x,y ),c→r (ŷc ) − c∈C (r ) P λ(x,y),r →p (ŷr ) p∈P(r ) cr − F (w). F (x, y; w) the sum of empirical function observations We can then do block coordinate descent to solve the minimization problem, and we get the following algorithm · · · R. Urtasun (UofT) Deep Structured Models August 21, 2015 89 / 128 Deep Structured Learning (algo 3) [Chen & Schwing & Yuille & Urtasun ICML’15] Repeat until stopping criteria 1 Forward pass to compute the fr (yr , x, w) 2 Update (some) messages λ 3 Backward pass via chain rule to obtain gradient 4 Update parameters w R. Urtasun (UofT) Deep Structured Models August 21, 2015 90 / 128 Deep Structured Learning (algo 4) [Chen & Schwing & Yuille & Urtasun ICML’15] Sample parallel implementation: Partition data D onto compute nodes Repeat until stopping criteria 1 Each compute node uses GPU for CNN Forward pass to compute fr (yr , x, w) 2 Each compute node updates (some) messages λ 3 Backpropagation of difference using GPU to obtain machine local gradient 4 Synchronize gradient across all machines using MPI 5 Update parameters w R. Urtasun (UofT) Deep Structured Models August 21, 2015 91 / 128 R. Urtasun (UofT) Deep Structured Models August 21, 2015 92 / 128 Application 1: Character Recognition Task: Word Recognition from a fixed vocabulary of 50 words, 28 × 28 sized image patches Characters have complex backgrounds and suffer many different distortions Training, validation and test set sizes are 10k, 2k and 2k variations of words banal julep resty drein yojan mothy snack feize porer R. Urtasun (UofT) Deep Structured Models August 21, 2015 93 / 128 Results Graphical model has 5 nodes, MLP for each unary and non-parametric pairwise potentials Joint training, structured, deep and more capacity helps Grap MLP 1st 1lay 2nd 1lay 1st 2lay 2nd 2lay Method Unary only JointTrain PwTrain PreTrainJoint JointTrain PwTrain PreTrainJoint H1 = 512 Unary only JointTrain PwTrain PreTrainJoint JointTrain PwTrain PreTrainJoint R. Urtasun (UofT) H1 = 128 8.60 / 61.32 16.80 / 65.28 12.70 / 64.35 20.65 / 67.42 25.50 / 67.13 10.05 / 58.90 28.15 / 69.07 H2 = 32 15.25 / 69.04 35.95 / 76.92 34.85 / 79.11 42.25 / 81.10 54.65 / 83.98 39.95 / 81.14 62.60 / 88.03 H1 = 256 10.80 / 64.41 25.20 / 70.75 18.00 / 68.27 25.70 / 71.65 34.60 / 73.19 14.10 / 63.44 36.85 / 75.21 H2 = 64 18.15 / 70.66 43.80 / 81.64 38.95 / 80.93 44.85 / 82.96 61.80 / 87.30 48.25 / 84.45 65.80 / 89.32 H1 = 512 12.50 / 65.69 31.80 / 74.90 22.80 / 71.29 31.70 / 75.56 45.55 / 79.60 18.10 / 67.31 45.75 / 80.09 H2 = 128 19.00 / 71.43 44.75 / 82.22 42.75 / 82.38 46.85 / 83.50 66.15 / 89.09 52.65 / 86.24 68.75 / 90.47 Deep Structured Models H1 = 768 12.95 / 66.66 33.05 / 76.42 23.25 / 72.62 34.50 / 77.14 51.55 / 82.37 20.40 / 70.14 50.10 / 82.30 H2 = 256 19.20 / 72.06 46.00 / 82.96 45.10 / 83.67 47.95 / 84.21 64.85 / 88.93 57.10 / 87.61 68.60 / 90.42 H1 = 1024 13.40 / 67.02 34.30 / 77.02 26.30 / 73.96 35.85 / 78.05 54.05 / 83.57 22.20 / 71.25 52.25 / 83.39 H2 = 512 20.40 / 72.51 47.70 / 83.64 45.75 / 83.88 47.05 / 84.08 68.00 / 89.96 62.90 / 89.49 69.35 / 90.75 August 21, 2015 94 / 128 Learned Weights a b c d e f g h i j k l m n o p q r s t u v w x y z a b c d e f g h i j k l m n o p q r s t u v w x y z a b c d e f g h i j k l mn o p q r s t u v w x y z Unary weights R. Urtasun (UofT) distance-1 edges Deep Structured Models a b c d e f g h i j k l mn o p q r s t u v w x y z distance-2 edges August 21, 2015 95 / 128 Example 2: Image Tagging [Chen & Schwing & Yuille & Urtasun ICML’15] Flickr dataset: 38 possible tags, |Y| = 238 10k training, 10k test examples Training method Unary only Piecewise Joint (with pre-training) Prediction error [%] 9.36 7.70 7.25 5 x 10 8 10000 w/o blend w blend 6 4 R. Urtasun (UofT) w/o blend w blend 6000 4000 2000 2 0 0 8000 Training error Neg. Log−Likelihood 10 5000 10000 Time [s] 0 0 Deep Structured Models 5000 10000 Time [s] August 21, 2015 96 / 128 Visual results female/indoor/portrait female/indoor/portrait sky/plant life/tree sky/plant life/tree animals/dog/indoor animals/dog R. Urtasun (UofT) water/animals/sea water/animals/sky indoor/flower/plant life ∅ Deep Structured Models August 21, 2015 97 / 128 Learned class correlations Only part of the correlations are shown for clarity R. Urtasun (UofT) Deep Structured Models August 21, 2015 98 / 128 Example 3: Semantic Segmentation [Chen et al. ICLR’15; Krähenbühl & Koltun NIPS’11,ICML’13; Zhen et al. Arxiv’15; Schwing & Urtasun Arxiv’15 ] |Y| = 21350·500 , ≈ 10k training, ≈ 1500 test examples Oxford-net pre trained on PASCAL, predicts 40 × 40 + upsampling The graphical model is a fully connected CRF with Gaussian potentials Inference using (algo2), with mean-field as approx. inference Interpolation Layer Pooling & Subsampling R. Urtasun (UofT) Fully Connected CRF Deep Structured Models August 21, 2015 99 / 128 Pascal VOC 2012 dataset [Chen et al. ICLR’15; Krähenbühl & Koltun NIPS’11,ICML’13; Zhen et al. Arxiv’15; Schwing & Urtasun Arxiv’15 ] |Y| = 21350·500 , ≈ 10k training, ≈ 1500 test examples Oxford-net pre trained on PASCAL, predicts 40 × 40 + upsampling The graphical model is a fully connected CRF with Gaussian potentials Inference using (algo2), with mean-field as approx. inference Training method Unary only Joint R. Urtasun (UofT) Mean IoU [%] 61.476 64.060 Deep Structured Models August 21, 2015 100 / 128 Pascal VOC 2012 dataset [Chen et al. ICLR’15; Krähenbühl & Koltun NIPS’11,ICML’13; Zhen et al. Arxiv’15; Schwing & Urtasun Arxiv’15 ] |Y| = 21350·500 , ≈ 10k training, ≈ 1500 test examples Oxford-net pre trained on PASCAL, predicts 40 × 40 + upsampling The graphical model is a fully connected CRF with Gaussian potentials Inference using (algo2), with mean-field as approx. inference Training method Unary only Joint Mean IoU [%] 61.476 64.060 Disclaimer: Much better results now with a few tricks. Zheng et al. 15 is now at 74.7%! R. Urtasun (UofT) Deep Structured Models August 21, 2015 100 / 128 Visual results R. Urtasun (UofT) Deep Structured Models August 21, 2015 101 / 128 Example 4: 3D Object Proposals for Detection Use structured prediction to learn to propose object candidates (i.e., grouping) (image) (stereo) (depth-feat) (prior) Use deep learning to do final detection: OxfordNet Only 1.2s to generate proposals R. Urtasun (UofT) Deep Structured Models August 21, 2015 102 / 128 0.7 1 BING SS EB MCG MCG−D Ours 0.9 0.8 recall at IoU threshold 0.7 Car recall at IoU threshold 0.7 0.8 0.6 0.5 0.4 0.3 0.7 0.4 0.3 0.2 0.1 3 10 10 10 10 0.8 0.5 0.4 0.3 0.7 3 10 10 0.8 10 10 0.8 0.5 0.4 0.3 0.7 10 0.4 0.3 2 3 10 (a) Easy 4 10 0.8 0.3 0.1 # candidates 0.9 0.4 0.1 0 1 10 3 10 10 4 10 # candidates 0.5 0.2 2 0.5 1 BING SS EB MCG MCG−D Ours 0.6 0.2 10 0.6 0 1 10 4 recall at IoU threshold 0.5 0.9 0.6 0 1 10 3 1 recall at IoU threshold 0.5 Cyclist recall at IoU threshold 0.5 0.7 0.7 # candidates BING SS EB MCG MCG−D Ours 4 10 0.1 2 # candidates 1 3 10 BING SS EB MCG MCG−D Ours 0.2 0 1 10 4 0.9 0.3 0.1 10 2 10 # candidates 0.4 0.2 2 0.3 1 BING SS EB MCG MCG−D Ours 0.5 0.1 0 1 10 0.4 0 1 10 10 0.6 0.2 0.8 10 4 recall at IoU threshold 0.5 0.9 0.6 0.9 3 1 BING SS EB MCG MCG−D Ours recall at IoU threshold 0.5 recall at IoU threshold 0.5 Pedestrian 0.7 0.5 # candidates 1 0.8 0.6 0.1 2 # candidates 0.9 0.7 BING SS EB MCG MCG−D Ours 0.2 0 1 10 4 0.8 0.5 0.1 2 0.9 0.6 0.2 0 1 10 1 BING SS EB MCG MCG−D Ours recall at IoU threshold 0.7 1 0.9 0.7 BING SS EB MCG MCG−D Ours 0.6 0.5 0.4 0.3 0.2 0.1 2 3 10 10 4 10 # candidates (b) Moderate 0 1 10 2 3 10 10 4 10 # candidates (c) Hard Figure : Proposal recall: 0.7 overlap threshold for Car, and 0.5 for rest. R. Urtasun (UofT) Deep Structured Models August 21, 2015 103 / 128 1 1 BING 11.8 SS 15.9 EB 21.9 MCG 25.4 MCG−D 49.6 Ours 65.6 0.9 0.8 0.7 0.8 0.7 0.7 0.6 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.6 0.7 0.8 IoU overlap threshold 0.9 0 0.5 1 1 0.1 0.6 0.7 0.8 IoU overlap threshold 0.9 0 0.5 1 1 BING 6.7 SS 3 EB 5.4 MCG 8.3 MCG−D 19.6 Ours 44.9 0.9 0.8 0.7 0.9 0.8 0.7 0.5 0.7 0.5 0.4 0.3 0.3 0.3 0.2 0.2 0.1 0.1 0.7 0.8 IoU overlap threshold 0.9 0 0.5 1 1 0.2 0.1 0.6 0.7 0.8 IoU overlap threshold 0.9 0 0.5 1 1 BING 5.9 SS 4.2 EB 3.2 MCG 5.5 MCG−D 10.8 Ours 55.4 0.9 0.8 0.7 0.8 0.7 0.7 recall recall 0.5 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.6 0.7 0.8 IoU overlap threshold 0.9 1 1 0.6 0.5 0 0.5 0.9 BING 4.5 SS 3.4 EB 2.7 MCG 4.3 MCG−D 10.7 Ours 39.7 0.8 0.4 0 0.5 0.7 0.8 IoU overlap threshold 0.9 0.6 0.5 0.6 1 BING 4.1 SS 3.6 EB 2.6 MCG 4.4 MCG−D 10.2 Ours 40 0.9 0.6 1 0.6 0.5 0.4 0.6 0.9 BING 5.4 SS 2.8 EB 4.2 MCG 6.8 MCG−D 14 Ours 36.4 0.8 0.4 0 0.5 0.7 0.8 IoU overlap threshold 0.9 0.6 recall 0.6 0.6 1 BING 5.8 SS 2.8 EB 4.5 MCG 7.3 MCG−D 16.1 Ours 41.1 recall 0 0.5 recall 0.8 recall 0.5 BING 6.8 SS 10.3 EB 13.7 MCG 17.6 MCG−D 32.8 Ours 58.6 0.9 0.6 recall recall 0.6 recall 1 BING 7.4 SS 10.9 EB 15.4 MCG 19.9 MCG−D 38.8 Ours 59.5 0.9 0.1 0.6 0.7 0.8 IoU overlap threshold 0.9 1 0 0.5 0.6 0.7 0.8 IoU overlap threshold 0.9 1 (a) Easy (b) Moderate (c) Hard Figure : Recall vs IoU for 500 proposals. (Top) Cars, (Middle) Pedestrians, (Bottom) Cyclists. R. Urtasun (UofT) Deep Structured Models August 21, 2015 104 / 128 KITTI Detection Results [ X. Chen, K. Kundu and S. Fidler and R. Urtasun, On Arxiv soon] LSVM-MDPM-sv SquaresICF DPM-C8B1 MDPM-un-BB DPM-VOC+VP OC-DPM AOG SubCat DA-DPM Fusion-DPM R-CNN FilteredICF pAUCEnsT MV-RGBD-RF 3DVP Regionlets Ours Easy 68.02 74.33 71.19 74.95 74.94 84.36 84.14 87.46 84.75 88.33 Cars Moderate 56.48 60.99 62.16 64.71 65.95 71.88 75.46 75.77 76.45 87.14 Hard 44.18 47.16 48.43 48.76 53.86 59.27 59.71 65.38 59.70 76.11 Easy 47.74 57.33 38.96 59.48 54.67 56.36 59.51 61.61 61.14 65.26 70.21 73.14 70.16 Pedestrians Moderate 39.36 44.42 29.03 44.86 42.34 45.51 46.67 50.13 53.98 54.49 54.56 61.15 59.35 Hard 35.95 40.08 25.61 40.37 37.95 41.08 42.05 44.79 49.29 48.60 51.25 55.21 52.76 Easy 35.04 43.49 42.43 51.62 54.02 70.41 77.94 Cyclists Moderate 27.50 29.04 31.08 38.03 39.72 58.72 67.35 Hard 26.21 26.20 28.23 33.38 34.82 51.83 59.49 Table : Average Precision (AP) (in %) on the test set of the KITTI Object Detection Benchmark. R. Urtasun (UofT) Deep Structured Models August 21, 2015 105 / 128 KITTI Detection Results [ X. Chen, K. Kundu and S. Fidler and R. Urtasun, On Arxiv soon] AOG DPM-C8B1 LSVM-MDPM-sv DPM-VOC+VP OC-DPM SubCat 3DVP Ours Easy 43.81 59.51 67.27 72.28 73.50 83.41 86.92 83.03 Cars Mod. 38.21 50.32 55.77 61.84 64.42 74.42 74.59 80.21 Hard 31.53 39.22 43.59 46.54 52.40 58.83 64.11 69.60 Easy 31.08 43.58 53.55 44.32 48.58 Pedestrians Mod. 23.37 35.49 39.83 34.18 40.56 Hard 20.72 32.42 35.73 / 30.76 36.08 Easy 27.25 27.54 30.52 57.72 Cyclists Mod. 19.25 22.07 23.17 48.21 Hard 17.95 21.45 21.58 42.72 Table : AOS scores on the KITTI Object Detection and Orientation Benchmark (test set). R. Urtasun (UofT) Deep Structured Models August 21, 2015 106 / 128 Car Results Best prop. Ground truth Top 100 prop. Images [ X. Chen, K. Kundu, Y. Zhu, S. Fidler and R. Urtasun, On Arxiv soon] R. Urtasun (UofT) Deep Structured Models August 21, 2015 107 / 128 Pedestrian Results Best prop. Ground truth Top 100 prop. Images [ X. Chen, K. Kundu, Y. Zhu, S. Fidler and R. Urtasun, On Arxiv soon] R. Urtasun (UofT) Deep Structured Models August 21, 2015 108 / 128 Cyclist Results Best proposals Ground truth Top 100 Prop. 2D images [ X. Chen, K. Kundu, Y. Zhu, S. Fidler and R. Urtasun, On Arxiv soon] R. Urtasun (UofT) Deep Structured Models August 21, 2015 109 / 128 Example 5: More Precise Grouping Given a single image, we want to infer Instance-level Segmentation and Depth Ordering Use deep convolutional nets to do both tasks simultaneously Trick: Encode both tasks with a single parameterization Run the conv. net at multiple resolutions Use MRF to form a single coherent explanation across all the image combining the conv nets at multiple resolutions Important: we do not use a single pixel-wise training example! R. Urtasun (UofT) Deep Structured Models August 21, 2015 110 / 128 Results on KITTI [Z. Zhang, A. Schwing, S. Fidler and R. Urtasun, ICCV ’15] R. Urtasun (UofT) Deep Structured Models August 21, 2015 111 / 128 More Results (including failures/difficulties) [Z. Zhang, A. Schwing, S. Fidler and R. Urtasun, ICCV ’15] R. Urtasun (UofT) Deep Structured Models August 21, 2015 112 / 128 Example 6: Enhancing freely-available maps [G. Matthyus, S. Wang, S. Fidler and R. Urtasun, ICCV ’15] Toronto: Airport San Francisco: Russian Hill NYC: Times square Kyoto: Kinkakuji Sydney: At Harbour bridge Monte Carlo: Casino Enhancing OpenStreetMaps Can be trained on a single image and test on the whole world Trick: Not to reason at the pixel level Very efficient: 0.1s/km of road Preserves topology and is state-of-the-art R. Urtasun (UofT) Deep Structured Models August 21, 2015 113 / 128 Example 7: Fashion [E. Simo-Serra, S. Fidler, F. Moreno, R. Urtasun, CVPR15] Figure : An example of a post on http://www.chictopia.com. We crawled the site for 180K posts. R. Urtasun (UofT) Deep Structured Models August 21, 2015 114 / 128 How Fashionable Are You? R. Urtasun (UofT) Deep Structured Models August 21, 2015 115 / 128 How Fashionable Are You? R. Urtasun (UofT) Deep Structured Models August 21, 2015 116 / 128 How Fashionable Are You? Figure : We ran a face detector that predicts also beauty of the face, age, ethnicity, mood. R. Urtasun (UofT) Deep Structured Models August 21, 2015 117 / 128 How Fashionable Are You? Face detector + attributes http://www.rekognition.com R. Urtasun (UofT) Deep Structured Models August 21, 2015 117 / 128 How Fashionable Are You? Face detector + attributes http://www.rekognition.com R. Urtasun (UofT) Deep Structured Models August 21, 2015 117 / 128 How Fashionable Are You? Face detector + attributes http://www.rekognition.com R. Urtasun (UofT) Deep Structured Models August 21, 2015 117 / 128 How Fashionable Are You? Figure : Our model is a Conditional Random Field that uses many visual and textual features, as well as meta-data features such as where the user is from. R. Urtasun (UofT) Deep Structured Models August 21, 2015 118 / 128 How Fashionable Are You? Figure : We predict fashionability of users. Figure : We predict what kind of outfit the person wears. R. Urtasun (UofT) Deep Structured Models August 21, 2015 119 / 128 How Fashionable Can You Become? Figure : Examples of recommendations provided by our model. The parenthesis we show the fashionability scores. R. Urtasun (UofT) Deep Structured Models August 21, 2015 120 / 128 Not a big deal... but Appear all over the Tech and News R. Urtasun (UofT) Deep Structured Models August 21, 2015 121 / 128 Not a big deal... but Appear all over the Tech and News All over the Fashion press R. Urtasun (UofT) Deep Structured Models August 21, 2015 122 / 128 Not a big deal... but Appear all over the Tech and News All over the Fashion press International News and TV (Fox, BBC, SkypeNews, RTVE, etc) R. Urtasun (UofT) Deep Structured Models August 21, 2015 123 / 128 Best Quote Award Cosmopolitan (UK): The technology scores your facial attributes (this just keeps getting better, doesn’t it) from your looks, to your age, and the emotion you’re showing, before combining all the information using an equation SO complex we won’t begin to go into it. R. Urtasun (UofT) Deep Structured Models August 21, 2015 124 / 128 But the Most Important Impact R. Urtasun (UofT) Deep Structured Models August 21, 2015 125 / 128 Previous Work Use the hinge loss to optimize the unaries only which are neural nets (Li and Zemel 14). Correlations between variables are not used for learning R. Urtasun (UofT) Deep Structured Models August 21, 2015 126 / 128 Previous Work Use the hinge loss to optimize the unaries only which are neural nets (Li and Zemel 14). Correlations between variables are not used for learning If inference is tractable, Conditional Neural Fields (Peng et al. 09) use back-propagation on the log-loss R. Urtasun (UofT) Deep Structured Models August 21, 2015 126 / 128 Previous Work Use the hinge loss to optimize the unaries only which are neural nets (Li and Zemel 14). Correlations between variables are not used for learning If inference is tractable, Conditional Neural Fields (Peng et al. 09) use back-propagation on the log-loss Decision Tree Fields (Nowozin et al. 11), use complex region potentials (decision trees), but given the tree, it is still linear in the parameters. Trained using pseudo likelihood. R. Urtasun (UofT) Deep Structured Models August 21, 2015 126 / 128 Previous Work Use the hinge loss to optimize the unaries only which are neural nets (Li and Zemel 14). Correlations between variables are not used for learning If inference is tractable, Conditional Neural Fields (Peng et al. 09) use back-propagation on the log-loss Decision Tree Fields (Nowozin et al. 11), use complex region potentials (decision trees), but given the tree, it is still linear in the parameters. Trained using pseudo likelihood. Restricted Bolzmann Machines (RBMs): Generative model that has a very particular architecture so that inference is tractable via sampling (Salakhutdinov 07). Problems with partition function. R. Urtasun (UofT) Deep Structured Models August 21, 2015 126 / 128 Previous Work Use the hinge loss to optimize the unaries only which are neural nets (Li and Zemel 14). Correlations between variables are not used for learning If inference is tractable, Conditional Neural Fields (Peng et al. 09) use back-propagation on the log-loss Decision Tree Fields (Nowozin et al. 11), use complex region potentials (decision trees), but given the tree, it is still linear in the parameters. Trained using pseudo likelihood. Restricted Bolzmann Machines (RBMs): Generative model that has a very particular architecture so that inference is tractable via sampling (Salakhutdinov 07). Problems with partition function. (Domke 13) treat the problem as learning a set of logistic regressors R. Urtasun (UofT) Deep Structured Models August 21, 2015 126 / 128 Previous Work Use the hinge loss to optimize the unaries only which are neural nets (Li and Zemel 14). Correlations between variables are not used for learning If inference is tractable, Conditional Neural Fields (Peng et al. 09) use back-propagation on the log-loss Decision Tree Fields (Nowozin et al. 11), use complex region potentials (decision trees), but given the tree, it is still linear in the parameters. Trained using pseudo likelihood. Restricted Bolzmann Machines (RBMs): Generative model that has a very particular architecture so that inference is tractable via sampling (Salakhutdinov 07). Problems with partition function. (Domke 13) treat the problem as learning a set of logistic regressors Fields of experts (Roth et al. 05), not deep, use CD training R. Urtasun (UofT) Deep Structured Models August 21, 2015 126 / 128 Previous Work Use the hinge loss to optimize the unaries only which are neural nets (Li and Zemel 14). Correlations between variables are not used for learning If inference is tractable, Conditional Neural Fields (Peng et al. 09) use back-propagation on the log-loss Decision Tree Fields (Nowozin et al. 11), use complex region potentials (decision trees), but given the tree, it is still linear in the parameters. Trained using pseudo likelihood. Restricted Bolzmann Machines (RBMs): Generative model that has a very particular architecture so that inference is tractable via sampling (Salakhutdinov 07). Problems with partition function. (Domke 13) treat the problem as learning a set of logistic regressors Fields of experts (Roth et al. 05), not deep, use CD training Many ideas go back to (Boutou 91) R. Urtasun (UofT) Deep Structured Models August 21, 2015 126 / 128 Conclusions and Future Work Conclusions: Modeling of correlations between variables Non-linear dependence on parameters Joint training of many convolutional neural networks Parallel implementation Wide range of applications: Word recognition, Tagging, Segmentation Future work: Latent Variables More applications R. Urtasun (UofT) Deep Structured Models August 21, 2015 127 / 128 Acknowledgments Liang Chieh Chen (student) Xiaozhi Chen (student) Sanja Fidler Gellert Matthyus (student) Francesc Moreno Alexander Schwing (postdoc) Edgar Simo-Serra (student) Shenlong Wang (student) Allan Yuille Ziyu Zhang (student) Yukun Zhu (student) The introductory slides on deep learning have been inspired by M. Ranzato tutorial on deep learning and S. Fidler lecture notes for CSC420 R. Urtasun (UofT) Deep Structured Models August 21, 2015 128 / 128

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising