Improving Neural Networks with Dropout by Nitish Srivastava A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Computer Science University of Toronto c Copyright 2013 by Nitish Srivastava Abstract Improving Neural Networks with Dropout Nitish Srivastava Master of Science Graduate Department of Computer Science University of Toronto 2013 Deep neural nets with a huge number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from a neural network during training. This prevents the units from co-adapting too much. Dropping units creates thinned networks during training. The number of possible thinned networks is exponential in the number of units in the network. At test time all possible thinned networks are combined using an approximate model averaging procedure. Dropout training followed by this approximate model combination significantly reduces overfitting and gives major improvements over other regularization methods. In this work, we describe models that improve the performance of neural networks using dropout, often obtaining state-of-the-art results on benchmark datasets. ii Contents 1 Introduction 1 2 Dropout with feed forward neural nets 3 2.1 2.2 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Learning dropout nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 2.3 Pretraining dropout nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.5 Comparison with Bayesian methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.6 Comparison with standard regularizers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.7 Effect on features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.8 Effect on sparsity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.9 Effect of dropout rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.10 Effect of data set size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.11 Monte-Carlo model averaging vs. weight scaling. . . . . . . . . . . . . . . . . . . . . . . . 13 3 Dropout with Boltzmann Machines 15 3.1 Dropout RBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Learning Dropout RBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Effect on features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4 Effect on sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 Marginalizing dropout 18 4.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Logistic regression and deep networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Conclusions 20 Bibliography 21 iii Chapter 1 Introduction Neural networks are powerful computational models that are being used extensively for solving problems in vision, speech, natural language processing and many other areas. In spite of many successes, neural networks still suffer from a major weakness. The presence of nonlinear hidden layers makes deep networks very expressive models which are therefore prone to severe overfitting. A typical neural net training procedure involves early stopping to prevent this. Several regularization schemes have also been proposed to prevent overfitting. These methods combined with large datasets have made it possible to apply neural networks for solving machine learning problems in several domains. However, overfitting still remains a major challenge to overcome when it comes to training extremely large neural networks or working in domains which offer very small amounts of data. Model combination typically improves the performance of machine learning models. Averaging the predictions of several models is most helpful when the individual models are different from each other and each model is fast to train and use at test time. However, large neural networks are hard to train and slow to use at test time. In order to make them different they must either have different hyperparameters or be trained on different data. This often makes it impractical to train many large networks and average all their predictions at test time. “Dropout” is a technique that aims to address both these concerns. The term “dropout” refers to dropping out units (hidden and visible) in a neural network. By dropping a unit out, we mean removing it from the network, along with all its incoming and outgoing edges. The choice of which units to drop is random. In the simplest case, each unit is retained with a fixed probability p, where p can be chosen based on the particular problem by a validation set (a typical value is p = 0.5). Dropping out is done independently for each hidden unit and for each training case. Thus, applying dropout to a neural network amounts to sub-sampling a “thinned” neural network from it. A neural net with n units, can be seen as a collection of 2n possible thinned neural networks. These networks all share weights so that the total number of parameters is still O(n2 ), or less. For large n, each time a training case is presented, it is likely to use a new thinned network. So training a neural network with dropout can be seen as training a collection of 2n thinned networks with massive weight sharing, where each thinned network gets trained very rarely, if at all. When the model is being used at test time, it is not feasible to explicitly average the predictions from exponentially many thinned models. However, a very simple approximate averaging method works well. The idea is to use a single neural net at test time without dropout. The weights of this test network 1 Chapter 1. Introduction 2 are scaled versions of the weights of the thinned networks used during training. The weights are scaled such that for any given input to a hidden unit the expected output (under the distribution used to drop units at training time) is the same as the output at test time. So, if a unit is retained with probability p, this amounts to multiplying the outgoing weights of that unit by p. With this approximate averaging method, 2n networks with shared weights can be combined into a single neural network to be used at test time. Training a network with dropout and using the approximate averaging method at test time leads to significantly lower generalization error on a wide variety of classification problems. Dropout can also be interpreted as a way of regularizing a neural network by adding noise to its hidden units. This idea has previously been used in the context of Denoising Autoencoders [26, 27] where noise is added to the inputs of an autoencoder and the target is kept noise-free. Our work extends this idea by dropping units in the hidden layers too and performing appropriate weight scaling at test time. Compared to the 5% noise that typically works best for DAEs, it is usually optimal to drop out 20% of input units and 50% of the hidden units to obtain the most benefit from dropout. A motivation for this method comes from a theory of role of sex in evolution [10]. Sexual reproduction involves taking half the genes of one parent and half of the other and combining them to produce an offspring. The asexual alternative involves creating an offspring with a copy of the parent’s genes. It seems plausible that asexual reproduction is a better optimizer of fitness which is widely believed to be the criterion for natural selection, i.e., successful organisms would be able to create more copies of successful organisms. Sexual reproduction seems to be downgrading the genes by pairing up two randomly chosen halves. However, sexual reproduction is the way most advanced organisms evolved. One explanation is that the criteria for natural selection may not be individual fitness but rather mixability of genes. The ability of genes to be able to work well with another random set of genes makes them more robust. Since a gene cannot rely on an exact partner to be present at all times, it must learn to do something useful on its own without relying on a partner to make up for its shortcomings. Similarly, hidden units in a neural network trained with dropout must learn to work with a randomly chosen sample of other units. This makes each hidden unit more robust and drives it towards creating useful features on its own without relying on other hidden units to correct its mistakes. Preventing co-adaptation in this manner improves neural networks. The idea of dropout is not limited to feed forward neural nets. It can be more generally applied to graphical models such as Boltzmann Machines. The chapters that follow explore different aspects of dropout in detail, apply it to different problems and compare it with other forms of regularization and model combination. Chapter 2 Dropout with feed forward neural nets This chapter describes training and testing methods to be used when dropout is applied to feed forward neural nets. 2.1 Model Description This section describes the dropout neural network model. Consider a neural network with L hidden layers. Let l ∈ {1, . . . , L} index the hidden layers of the network. Let z(l) denote the vector of inputs into layer l, y(l) denote the vector of outputs from layer l (y(0) = x is the input). W (l) and b(l) are the weights and biases at layer l. The feed forward operation of a neural network can be described as (for l ∈ {0, . . . , L − 1})z(l+1) y (l+1) = W (l+1) yl + b(l+1) = f (z (l+1) ) (2.1) (2.2) where f is any activation function. With dropout, the feed forward operation becomes (l) ri e y (l) z(l+1) y (l+1) ∼ Bernoulli(p) (l) = r ∗y (l) e l + b(l+1) = W (l+1) y = f (z (l+1) ) (2.3) (2.4) (2.5) (2.6) Here r(l) is a vector of Bernoulli random variables each of which has probability p of being 1. This vector is sampled for each layer and multiplied element-wise with the outputs of that layer, y(l) , to create e (l) . The thinned outputs are then used as input to the next layer. For learning, the thinned outputs y the derivatives of the loss function are backpropagated through the thinned network. (l) At test time, the weights are scaled as Wtest = pW (l) . The resulting neural network is run without dropout. 3 Chapter 2. Dropout with feed forward neural nets 2.2 4 Learning dropout nets Dropout neural networks can be trained with stochastic gradient descent. Dropout is done separately for each training case in every minibatch. Dropout can be used with any activation function and our experiments with logistic, tanh and rectified linear units yielded similar results though requiring different amounts of training time (rectified linear units were fastest to train). Several methods that have been used to improve stochastic gradient descent with standard neural networks such as momentum, decaying learning rates and L2 weight decay are useful for dropout neural networks as well. One particular form of regularization was found to be especially useful for dropout - constraining the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant c. In other words, if wi represents the vector of weights incident on hidden unit i, the neural network was optimized under the constraint ||wi ||2 ≤ c. This constraint was imposed during optimization by scaling wi to lie on a ball of radius c, if it ever violated the constraint. This is kind of regularization is also called max-norm regularization and has been previously used in the context of collaborative filtering [20]. The constant c is a tuneable hyperparameter, which can be determined using a validation set. Although dropout alone gives significant improvements, optimizing under this constraint, coupled with a large decaying learning rates and high momentum provides a significant boost over just using dropout. One explanation of this fact is that constraining the weight vector to lie inside a ball of fixed radius makes it possible to use a huge learning rate without the possibility of weights blowing up. The noise provided by dropout then allows the optimization process to explore different regions of the weight space that it would have otherwise not encountered. As the learning rate decays, the optimization takes shorter steps and gradually trades off exploration with exploitation and finally settles into a minimum. 2.3 Pretraining dropout nets Neural networks can be pretrained using stacks of RBMs [6], autoencoders [27] or Deep Boltzmann Machines [17]. This pretraining followed by finetuning with backpropagation has been shown to give significant performance boosts over finetuning from random initializations in certain cases. Pretraining is also an effective way of making use of unlabeled data. Dropout nets can also be pretrained using these techniques. The procedure is identical to standard pretraining [6] except with a small modification - the weights obtained from pretraining should be scaled up by a factor of 1/p. The reason is similar to that for scaling down the weights by a factor of p when testing (maintaining the same expected output at each unit). Compared to learning from random initializations, finetuning from pretrained weights typically requires a smaller learning rate so that the information in the pretrained weights is not entirely lost. 2.4 Classification Results The above training and test procedure was applied to several datasets. The best results were consistently obtained when dropout was used. These datasets range over a variety of domains and tasks • MNIST is a standard toy dataset of handwritten digits. • TIMIT is a standard speech benchmark for clean speech recognition. 5 Chapter 2. Dropout with feed forward neural nets • SVHN consists of images of house numbers collected by Google Street View. • Reuters-RCV1 is a collection of Reuters newswire articles. • Flickr-1M consists of multimodal data (1 million images and tags). • Alternate Splicing dataset consists of biochemistry data for genes. For completeness, we also report results obtained by other researchers who have used dropout. These include CIFAR-10 [7] and ImageNet-1K [9]. This section describes the results along with the neural net architectures used to obtain these results. All datasets are publicly available and all except TIMIT and WSJ are free. The code for reproducing these results can be obtained from http://www.cs.toronto. edu/~nitish/dropout. The implementation is GPU-based and uses cudamat [11]. Convolutional neural net implementation is based on kernels from cuda-convnet used for obtaining results in [9]. 2.4.1 Results on MNIST 2.2 Unit Type Logistic ReLU ReLU Error % 1.60 1.4 1.25 1.05 Logistic Logistic 1.18 0.92 Logistic Logistic 0.96 0.79 2.0 1.8 Classification Error % Method 2 layer NN [19] SVM gaussian kernel Dropout Dropout + weight norm constraint DBN + finetuning DBN + dropout finetuning DBM + finetuning DBM + dropout finetuning Dropout 1000 units 2 layers Dropout 1000 units 3 layers Dropout 1000 units 4 layers Dropout 2000 units 2 layers Dropout 2000 units 3 layers Dropout 2000 units 4 layers 1.6 1.4 1.2 1.0 0.80 200000 400000 600000 Number of weight updates 800000 1000000 Figure 2.1: Comparison of training methods . Figure 2.2: Test error for different architectures . MNIST is a collection of 28×28 pixel handwritten digit images. There are 60,000 training and 10,000 test images. A validation set consisting of 10,000 images was held out from the training set. No input preprocessing was done. No spatial information or input distortions was used. Classification experiments were done with networks of many different architectures. Fig. 2.2 shows the test error curves obtained for some of these. All of these used rectified linear units. Fig 2.1 compares the test classification results obtained by several different methods and their extensions using dropout. The pretrained dropout nets use logistic units and all other networks use rectified linear units. The best performance without unsupervised pretraining for the permutation invariant setting using neural nets is 1.60% [19]. Adding dropout reduced the error to 1.25% and adding weight norm constraints further reduced that to 1.05%. Pretrained dropout nets also improved the performance for Deep Belief Nets and Deep Boltzmann Machines. DBM pretrained dropout nets achieve a test error of 0.79% which is state-of-the-art for the permutation invariant setting. Chapter 2. Dropout with feed forward neural nets 2.4.2 6 Results on SVHN The Street View House Numbers (SVHN) Dataset [14] consists of real-world images of house numbers obtained from Google Street View. The part of the dataset that we use in our experiments consists of 32 × 32 pixel color images centered on a digit in a house number. Fig. 2.3 shows some examples of images from this dataset. The task is to identify the digit in the center of the image. Figure 2.3: Samples of images from the Street View House Numbers (SVHN) dataset. For this dataset, dropout was applied in convolutional neural networks. The network consists of three convolutional layers each followed by a max-pooling layer. The convolutional layers have 64, 64 and 128 filters respectively. Each convolutional layer has a 5 × 5 receptive field applied with a stride of 1 pixel. The max pooling layers pool a 3 × 3 region and are applied at strides of 2 pixels. The convolutional layers are followed by two fully connected hidden layers having 3072 and 2048 units respectively. All units use the rectified linear activation function. Dropout was applied to all the layers of the network with the probability of retaining the unit being p = (0.9, 0.9, 0.9, 0.5, 0.5, 0.5) for the different layers of the network (going from input to convolutional layers to fully connected layers). These hyperparameters were tuned using a validation set. In addition, the weight norm constraint was used for hidden units in the fully-connected layers. Besides the test set, the SVHN dataset consists of a standard labelled training set and another set of labelled examples that are easy. The validation set was constructed by taking examples from both the sets. Two-thirds of it were taken from the standard set (400 per class) and one-third from the extra set (200 per class), a total of 6000 samples. This same process is used in [18]. The inputs were RGB pixels normalized to have zero mean and unit variance. Table. 2.1 compares the results obtained by using dropout with other methods. Dropout leads to a more than 35% relative improvement over the best previously published results. It bridges the distance to human-level performance by more than half. The additional gain in performance obtained by adding dropout in the convolutional layers besides doing dropout in the fully connected layers suggests that the utility of dropout is not limited to densely connected neural networks but can be more generally applied to other specialized architectures. 7 Chapter 2. Dropout with feed forward neural nets Table 2.1: Results on the Street View House Numbers dataset. Method Binary Features (WDCH) [14] HOG [14] Stacked Sparse Autoencoders [14] KMeans [14] Multi-stage Conv Net with average pooling [18] Multi-stage Conv Net + L2 pooling [18] Multi-stage Conv Net + L4 pooling + padding [18] Conv Net + max-pooling Conv Net + max pooling + dropout in fully connected layers Conv Net + max pooling + dropout in all layers Conv Net + max pooling + dropout in all layers + input translations Human Performance 2.4.3 Error % 36.7 15.0 10.3 9.4 9.06 5.36 4.90 3.95 3.02 2.78 2.68 2.0 Results on TIMIT TIMIT is a speech dataset with recordings from 680 speakers covering 8 major dialects of American English reading ten phonetically-rich sentences in a controlled noise-free environment. It has been used to benchmark many speech recognition systems. Table. 2.2 compares dropout neural nets against some of them. The open source Kaldi toolkit [16] was used to preprocess the data into log-filter banks and to get labels for speech frames. Dropout neural networks were trained on windows of 21 frames to predict the label of the central frame. No speaker dependent operations were performed. A 6-layer dropout net gives a phone error rate of 23.4%. This is already a very good performance on this dataset. Dropped further improves it to 21.8%. Similarly, a 4-layer pretrained dropout net improves the phone error rate from 22.7% to 19.7%. Table 2.2: Phone error rate on the TIMIT core test set. Method Neural Net (6 layers) [12] Dropout Neural Net (6 layers) DBN-pretrained Neural Net (4 layers) DBN-pretrained Neural Net (6 layers) [12] DBN-pretrained Neural Net (8 layers) [12] mcRBM-DBN-pretrained Neural Net (5 layers) [2] DBN-pretrained Neural Net (4 layers) + dropout DBN-pretrained Neural Net (8 layers) + dropout 2.4.4 Phone Error Rate% 23.4 21.8 22.7 22.4 20.7 20.5 19.7 19.7 Results on Reuters-RCV1 Reuters-RCV1 is a collection of newswire articles from Reuters. We created a subset of this dataset consisting of 402,738 articles and a vocabulary of 2000 most commonly used words after removing stop words. The subset was created so that the articles belong to 50 disjoint categories. The task is to identify the category that a document belongs to. The data was split into equal sized training and test sets. A neural net with 2 hidden layers of 2000 units each obtained an error rate of 31.05%. Adding dropout reduced the error marginally to 29.62%. 8 Chapter 2. Dropout with feed forward neural nets 2.4.5 Results on Flickr-1M Often real-world data consists of multiple modalities - photographs on the web (images and text), videos (images and sound), sensory perception (images, sound, touch, internal feedbacks). Multimodal data raises interesting machine learning problems such as fusing multiple modalities into a joint representation and inferring missing modalities conditioned on observed ones. Recent efforts have been made in computer vision [4] and deep learning [15, 22, 21]. The Flickr-1M dataset [8] consists of 1 million pairs of images and tags (text attributed to the images by users) obtained from the social photography website Flickr. 25,000 pairs are labelled into 38 overlapping topics. The other 975,000 image-text pairs are unlabeled. The task is to identify the topics to which the labelled pairs belongs. Applying dropout to this dataset seeks to demonstrate two ideas. Firstly, the use of unlabeled data to pretrain dropout neural networks and secondly, to show the applicability of dropout to the much less studied domain of multimodal data. Table 2.3: Results on the Flickr-1M dataset. Method LDA [8] SVM [8] DBN [22] Autoencoder (based on [15]) DBM [22] Multiple Kernel Learning SVMs [4] DBN with dropout finetuning DBM with dropout finetuning Mean Average Precision % 0.492 0.475 0.599 0.600 0.609 0.623 0.628 0.632 Precision at 50 0.754 0.758 0.867 0.875 0.873 0.891 0.895 Table. 2.3 compares the pretrained dropout neural networks with other models. The evaluation metrics are Mean Average Precision and Precision at 50. Mean Average Precision is the mean over all 38 topics of the recall-weighted precision for each topic. Precision at 50 is the mean over all 38 topics of the precision at a recall of 50 data points. The labelled set was split as 10K-5K-10K for training, validation and testing respectively. The unlabeled data was used for training DBN and DBM models as described in [22]. The discriminative model pretrained by a DBN has more than 10 million parameters. The DBM model, after being unrolled as described in [17] has around 16 million parameters. However, the training set is only 10,000 in size. This makes it hard to discriminatively finetune the models without causing overfitting. However, when dropout is applied, overfitting is drastically reduced. Dropout with pretrained models achieves state-of-the-art results, outperforming the best previously published results on this dataset that were obtained with an Multiple Kernel Learning based SVM model [4]. It is also interesting to note that the MKL model used over 30,000 standard computer vision features while our model used 3857 features only. 2.4.6 Results on ImageNet ImageNet-1K is a collection of over 1 million images categorized into 1000 labels. The system that was used to obtain state-of-the-art results on this dataset in the ILSVRC-2012 competition [9] used convolutional neural networks trained with dropout. The model achieved a top-5 error rate of 15.3% and won the competition by a massive margin (The second best entry stood at 26.2%). Chapter 2. Dropout with feed forward neural nets 2.5 9 Comparison with Bayesian methods. Dropout can be seen as a way of doing an approximate equally-weighted averaging of exponentially many models. On the other hand, Bayesian neural networks [13] are the proper way of doing model averaging over a continuum of neural network models with appropriate weights. Unfortunately, Bayesian neural nets are slow to train and difficult to scale to very large neural nets. It is also expensive to get predictions from many large nets at test time. On the other hand, dropout neural nets are much faster to train and use at test time. However, Bayesian neural nets are extremely useful for solving problems in domains where data is scarce such as medical diagnosis, genetics, drug discovery and other bio-chemical applications. In this section we report experiments that compare Bayesian neural nets with dropout neural nets for small datasets where Bayesian neural networks are known to perform well and obtain state-of-the-art results. These datasets are mostly characterized by having a large number of dimensions relative to the number of examples. 2.5.1 Predicting tissue-regulated alternative splicing Alternative splicing is a significant cause of cellular diversity in mammalian tissues. Predicting the occurrence of alternate splicing in certain tissues under different conditions is important for understanding many human diseases. The alternative splicing dataset consists of data for 3665 cassette exons, 1014 RNA features and 4 tissue types derived from 27 mouse tissues. Given the RNA features, the task is to predict the probability of three splicing related events that biologists care about. See [29] for a full exposition. The evaluation metric is Code Quality which is a measure of the negative KL divergence between the target and predicted probability distributions (Higher is better). Table 2.4: Results on the Alternative Splicing Dataset. Method Neural Network (early stopping) [29] Regression, PCA [29] SVM, PCA [29] Neural Network (dropout) Bayesian Neural Network [29] Code Quality (bits) 440 463 487 567 623 A two layer network with 1024 units in each layer was trained on this dataset. A value of p = 0.5 was used for the hidden layer and p = 0.7 for the input layer. Results were averaged across the same 5 folds used in [29]. Table. 2.4 compares dropout neural nets with other models trained on this data. This experiment suggests that dropout improves the performance of neural networks significantly but not enough to match the performance of Bayesian neural networks. The dropout neural networks outperform SVMs and standard neural nets trained with early stopping. It is interesting to note that the dropout nets are very large (1000s of hidden units) compared to a few tens of units in the Bayesian network. 2.6 Comparison with standard regularizers. Several regularization methods have been proposed for preventing overfitting in neural networks. These include L2 weight decay (more generally Tikhonov regularization [24]), lasso [23] and KL-sparsity regularization which minimizes the KL-divergence between the distribution of hidden unit activations and Chapter 2. Dropout with feed forward neural nets 10 a target Bernoulli distribution. Another regularization involves putting an upper bound on the norm of the incoming weight vector at each hidden unit. Dropout can be seen as another way of regularizing neural networks. In this section we compare dropout with some of these regularization methods. The MNIST dataset is used to compare these regularizers. The same network architecture (7841024-1024-2048-10) was used for all the methods. Table. 2.5 shows the results. The KL-sparsity method used a target sparsity of 0.1 at each layer of the network. It is easy to see that dropout leads to less generalization error. An important observation is that weight norm regularization significantly improves the results obtained by dropout alone. Table 2.5: Comparison of different regularization methods on MNIST Method L2 L1 (towards the end of training) KL-sparsity Max-norm Dropout Dropout + Max-norm 2.7 MNIST Classification error % 1.62 1.60 1.55 1.35 1.25 1.05 Effect on features. In a standard neural network, each parameter individually tries to change so that it reduces the final loss function, given what all other units are doing. This conditioning may lead to complex co-adaptations which cause overfitting since these co-adaptations do not generalize. We hypothesize that for each hidden unit, dropout prevents co-adaptation by making the presence of other hidden units unreliable. Therefore, no hidden unit can rely on other units to correct its mistakes and must perform well in a wide variety of different contexts provided by the other hidden units. The experimental results discussed in previous sections lend credence to this hypothesis. To observe this effect directly, we look at the features learned by neural networks trained on visual tasks with and without dropout. Fig. 2.4a shows features learned by an autoencoder with a single hidden layer of 256 rectified linear units without dropout. Fig. 2.4b shows the features learned by an identical autoencoder which used dropout in the hidden layer with p = 0.5. It is apparent that the features shown in Fig. 2.4a have co-adapted in order to produce good reconstructions. Each hidden unit on its own does not seem to be detecting a meaningful feature. On the other hand, in Fig. 2.4b, the features seem to detect edges and spots in different parts of the image. 2.8 Effect on sparsity. A curious side-effect of doing dropout training is that the activations of the hidden units become sparse, even when no sparsity inducing regularizers are present. Thus, dropout leads to sparser representations. To observe this effect, we take the autoencoders trained in the previous section and look at the histogram of hidden unit activations on a random mini-batch taken from the test set. We also look at the histogram of mean hidden unit activations over the minibatch. Fig. 2.5a and Fig. 2.5b show the histograms for the two models. For the dropout autoencoder, we do not scale down the weights since that would obviously 11 Chapter 2. Dropout with feed forward neural nets (a) Without dropout (b) Dropout with p = 0.5. Figure 2.4: Features learned on MNIST with one hidden layer autoencoders having 256 rectified linear units. increase the sparsity by making the weights smaller. To ensure a fair comparison, the weights used to obtain the histogram were the same as the ones learned during training. (a) Without dropout (b) Dropout with p = 0.5. Figure 2.5: Effect of dropout on sparsity: In each panel, the figure on the left shows a histogram of the mean activation of hidden units in a randomly chosen test minibatch. The figure on the right shows a histogram of the activations on the same minibatch. In Fig. 2.5a, there are many more hidden units that are in a non-zero state compared to those in Fig. 2.5b, as seen by the significant mass away from zero. The mean activation of hidden units is close to 2.0 for the autoencoder without dropout but drops to around 0.5 when dropout is used. 12 Chapter 2. Dropout with feed forward neural nets 2.9 Effect of dropout rate. Dropout has a tune-able hyperparameter p (the probability of retaining a hidden unit in the network). In this section, the effect of varying this hyperparameter is explored. The comparison is done in two situations 1. The number of hidden units is held constant. 2. The expected number of hidden units that will be retained is held constant. In the first case, all the nets have the same architecture at test time but they are trained with different amounts of dropout. In our experiment we use a 784-2048-2048-2048-10 architecture. The inputs were not thinned. Fig. 2.6a shows the test error obtained as a function of p. It can be observed that the performance is insensitive to the value of p if 0.4 ≤ p ≤ 0.8, but rises sharply for small value of p. This is to be expected because for the same number of hidden units, having a small p means very few units will turn on during training. It can be seen that this has lead to underfitting since the training error is also high. Therefore, a more fair comparison is the second case in which the quantity pn is held constant where n is the number of hidden units in any particular layer. This means that networks that have small p will have larger number of hidden units. This ensures that the expected number of units that will be present after dropout is same. However, the test networks will be of different sizes. In our experiments, pn = 256 for the first two hidden layers and pn = 512 for the last hidden layer. Fig. 2.6b shows the test error obtained as a function of p. We notice that the magnitude of errors for small values of p has reduced compared to Fig. 2.6a. Values of p that are close to 0.6 seem to perform best for this choice of pn but our usual default value of 0.5 is close to optimal. 3.0 3.5 Test Error Training Error 2.5 2.0 1.5 1.0 2.0 1.5 1.0 0.5 0.5 0.00.0 Test Error Training Error 2.5 Classification Error % Classification Error % 3.0 0.2 0.4 0.6 0.8 Probability of retaining a unit (p) (a) Keeping n fixed. 1.0 0.00.0 0.2 0.4 0.6 0.8 Probability of retaining a unit (p) 1.0 (b) Keeping pn fixed. Figure 2.6: Effect of changing dropout rates on MNIST. 2.10 Effect of data set size. One test of a good regularizer is that it should make it possible to train models with a large number of parameters even on small datasets. This section explores the effect of changing the dataset size when dropout is used with feed forward networks. Huge neural networks trained in the standard way overfit 13 Chapter 2. Dropout with feed forward neural nets massively on small datasets. To see if dropout can help, we run classification experiments on MNIST and vary the amount of data given to the network. 30 With dropout Without dropout Classification Error % 25 20 15 10 5 0 102 103 Dataset size 104 105 Figure 2.7: Effect of varying dataset size. The results of these experiments are shown in Fig. 2.7. The network was given datasets of size 100, 500, 1K, 5K, 10K and 50K randomly sampled without replacement from the MNIST training set. The same network architecture (784-1024-1024-2048-10) was used for all datasets. Dropout with p = 0.5 was performed at all the hidden layers and p = 0.8 at the input layer. It can be observed that for extremely small datasets (100, 500) dropout does not give any improvements. The model has enough parameters that it can overfit on the training data, even with all the noise coming from dropout. As the size of the dataset is increased, the gain from doing dropout increases up to a point and then declines. This suggests that for any given architecture and dropout rate, there is a “sweet spot” corresponding to some amount of data that is large enough to not be memorized in spite of the noise but not so large that overfitting is not a problem anyways. 2.11 Monte-Carlo model averaging vs. weight scaling. The test time procedure that was proposed is to do an approximate model combination by scaling down the weights of the trained neural network. Another expensive but reasonable way of averaging the models is to sample k neural nets using dropout for each test case and average their predictions. As k → ∞, this Monte-Carlo model average gets close to the true model average. Finite values of k are also expected to give reasonable results. It is interesting to compare the performance of this method with the weight scaling method that has been used till now. We again use the MNIST dataset and do classification by averaging the predictions of k randomly sampled neural networks. Fig. 2.8 shows the test error rate obtained for different values of k. This is compared with the error obtained using the weight scaling method (shown as a horizontal line). It can be seen that around k = 50, the Monte-Carlo method becomes as good as the approximate method. Thereafter, the Monte-Carlo method is slightly better than the approximate method but well within one standard deviation of it. This suggests that the weight scaling method is a fairly good approximation of the true model average. 14 Chapter 2. Dropout with feed forward neural nets 1.35 Monte-Carlo Model Averaging Approximate averaging by weight scaling Test Classification error % 1.30 1.25 1.20 1.15 1.10 1.05 1.000 20 40 60 80 100 Number of samples used for Monte-Carlo averaging (k) Figure 2.8: Monte-Carlo model averaging vs. weight scaling. 120 Chapter 3 Dropout with Boltzmann Machines The core idea behind dropout is to sample smaller sub-models from a large model, train them and then combine them at test time. This idea can be generalized beyond feed forward networks. In this chapter, we explore dropout when applied to Restricted Boltzmann Machines. For clarity of exposition, we describe dropout for hidden units only. Extending dropout to visible units is straightforward. 3.1 Dropout RBMs Consider an RBM with visible units v ∈ {0, 1}D and hidden units h ∈ {0, 1}F . It defines the following probability distribution P (h, v; θ) = 1 exp(v> W h + a> h + b> v) Z(θ) Where θ = (W, a, b) represents the model parameters and Z is the partition function. Dropout RBMs are RBMs augmented with a vector of binary random variables r ∈ {0, 1}F . Each random variable rj takes the value 1 with probability p, independent of others. If rj takes the value 1, the hidden unit hj is retained, otherwise it is dropped from the model. The joint distribution defined by a Dropout RBM can be expressed asP (r, h, v; p, θ) P (r; p) = P (r; p)P (h, v|r; θ) = F Y (3.1) prj (1 − p)1−rj j=1 P (h, v|r; θ) = F Y 1 > > > exp(v W h + a h + b v) g(hj , rj ) Z 0 (θ, r) j=1 g(hj , rj ) = 1(rj = 1) + 1(rj = 0)1(hj = 0) Z 0 (θ, r) is the normalization constant. g(hj , rj ) imposes the constraint that if rj = 0, hj must be 0. 15 16 Chapter 3. Dropout with Boltzmann Machines The distribution over h, conditioned on v and r is factorial P (h|r, v) = F Y P (hj |rj , v) j=1 P (hj = 1|rj , v) = 1(rj = 1)σ bj + ! X Wij vi i The distribution over v conditioned on h is same as that of an RBMP (v|h) = D Y P (vi |h) i=1 P (vi = 1|h) = σ ai + X Wij hj j Conditioned on r, the distribution over {v, h} is same as the distribution that an RBM would impose, except that the units for which rj = 0 are dropped from h. Therefore, the Dropout RBM model can be seen as a mixture of exponentially many RBMs with shared weights each using a different subset of h. 3.2 Learning Dropout RBMs Learning algorithms developed for RBMs such as Contrastive Divergence [5] can be directly applied for learning Dropout RBMs. The only difference is that r is first sampled and only the hidden units that are retained are used for training. Similar to dropout neural networks, a different r is sampled for each training case in every minibatch. In our experiments, we use CD-1 for training dropout RBMs. 3.3 Effect on features Dropout in feed forward networks improved the quality of features by reducing co-adaptations. This section explores whether this effect transfers to Dropout RBMs as well. Fig. 3.1a shows features learned by a binary RBM with 256 hidden units. Fig. 3.1b shows features learned by a dropout RBM with the same number of hidden units. Features learned by the dropout RBM appear qualitatively different in the sense that they seem to capture features that are coarser compared to the sharply defined stroke-like features in the standard RBM. There seem to be very few dead units in the dropout RBM relative to the standard RBM. 3.4 Effect on sparsity Next, we investigate the effect of dropout RBM training on sparsity of the hidden unit activations. Fig. 3.2a shows the histograms of hidden unit activations and their means on a test mini-batch after training an RBM. Fig. 3.2b shows the same for dropout RBMs. The histograms clearly indicate that the dropout RBMs learn much sparser representations than standard RBMs even when no additional sparsity inducing regularizer is present. 17 Chapter 3. Dropout with Boltzmann Machines (a) Without dropout (b) Dropout with p = 0.5. Figure 3.1: Features learned on MNIST by 256 hidden unit RBMs. (a) Without dropout (b) Dropout with p = 0.5. Figure 3.2: Effect of dropout on sparsity: In each panel, the figure on the left shows a histogram of the mean activation of hidden units in a randomly chosen test minibatch. The figure on the right shows a histogram of the activations on the same minibatch. Chapter 4 Marginalizing dropout Dropout can be seen as a way of adding noise to the states of hidden units in a neural network. In this chapter, we explore the class of models that arise as a result of marginalizing this noise. These models can be seen as deterministic versions of dropout. In contrast to regular (“Monte-Carlo”) dropout, these models do not need random bits and it is possible to get gradients for the marginalized loss functions. In this chapter, we briefly explore these models. Marginalization in the context of denoising autoencoders has been explored previously [1, 25]. Deterministic algorithms have been proposed that try to learn models that are robust to feature deletion at test time [3]. 4.1 Linear Regression First we explore a very simple case of applying dropout to the classical problem of linear regression. Let X ∈ RN ×D be a data matrix of N data points. y ∈ RN be a vector of targets. Linear regression tries to find a w ∈ RD that minimizes ||y − Xw||2 When the input X is dropped out such that any input dimension is retained with probability p, the input can be expressed as R ∗ X where R ∈ {0, 1}N ×D is a random matrix with Rij ∼ Bernoulli(p) and ∗ denotes an element-wise product. Marginalizing the noise, the objective function becomes minimize w ER∼Bernoulli(p) ||y − (R ∗ X)w||2 This reduces to minimize w ||y − pXw||2 + p(1 − p)||Γw||2 where Γ = (diag(X > X))1/2 . Therefore, dropout with linear regression is equivalent, in expectation, to ridge regression with a particular form for Γ. This form of Γ essentially scales the weight cost for weight wi by the standard deviation of the ith dimension of the data. Another interesting way to look at this objective is to absorb the factor of p into w. This leads to 18 19 Chapter 4. Marginalizing dropout the following form minimize w e 2+ ||y − X w|| 1−p e 2 ||Γw|| p e = pw. This makes the dependence of the regularization constant on p explicit. For p close Where w to 1, all the inputs are retained and the regularization constant is small. As more dropout is done (by decreasing p), the regularization constant grows larger. 4.2 Logistic regression and deep networks For logistic regression and deep neural nets, it is hard to obtain a closed form marginalized model. However, Wang [28] showed that in the context of dropout applied to logistic regression, the corresponding marginalized model can be trained approximately. Under reasonable assumptions, the distributions over the inputs to the logistic unit and over the gradients of the marginalized model are Gaussian. Their means and variances can be computed efficiently. This approximate marginalization outperforms Monte-Carlo dropout in terms of training time and generalization performance. However, the assumptions involved in this technique become successively weaker as more layers are added and it would be interesting to see if this same technique can be directly extended to deeper networks. Chapter 5 Conclusions Dropout is a technique for improving neural networks by reducing overfitting. The main idea is to prevent co-adaptation of hidden units. Dropout improves performance of neural nets in a wide variety of application domains including object classification, digit recognition, speech recognition, document classification and analysis of bio-medical data. This suggests that dropout as a technique is quite general and not specific to any domain. It has been used in models that achieve state-of-the-art results on ImageNet and SVHN. The central idea of dropout is to take a large model that overfits easily and repeatedly sample and train smaller sub-models from it. Since all the sub-models share parameters with the large model, this process trains the large model which is then used at test time. We demonstrated that this idea works in the context of feed forward neural networks. This idea can be extended to Restricted Boltzmann Machines and other graphical models which can be seen as composed of exponentially many sub-models with shared weights. Marginalized versions of dropout models may offer some of the benefits of dropout training without having to deal with noise. These models are an interesting direction for future work. 20 Bibliography [1] Minmin Chen, Zhixiang Xu, Kilian Weinberger, and Fei Sha. Marginalized denoising autoencoders for domain adaptation. In John Langford and Joelle Pineau, editors, Proceedings of the 29th International Conference on Machine Learning (ICML-12), ICML ’12, pages 767–774. ACM, New York, NY, USA, July 2012. [2] G.E. Dahl, M. Ranzato, A. Mohamed, and GE Hinton. Phone recognition with the mean-covariance restricted boltzmann machine. Advances in Neural Information Processing Systems, 23:469–477, 2010. [3] A. Globerson and S. Roweis. Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd international conference on Machine learning, pages 353–360. ACM, 2006. [4] M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semi-supervised learning for image classification. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 902 –909, june 2010. [5] G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006. [6] Geoffrey Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 – 507, 2006. [7] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012. [8] Mark J. Huiskes, Bart Thomee, and Michael S. Lew. New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative. In Multimedia Information Retrieval, pages 527–536, 2010. [9] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1106–1114. 2012. [10] Adi Livnat, Christos Papadimitriou, Nicholas Pippenger, and Marcus W. Feldman. Sex, mixability, and modularity. Proceedings of the National Academy of Sciences, 107(4):1452–1457, 2010. [11] Volodymyr Mnih. Cudamat: a CUDA-based matrix class for python. Technical Report UTML TR 2009-004, Department of Computer Science, University of Toronto, November 2009. 21 Bibliography 22 [12] A. Mohamed, G. Dahl, and G. Hinton. Acoustic modeling using deep belief networks. Audio, Speech, and Language Processing, IEEE Transactions on, (99):1–1, 2010. [13] Radford M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1996. [14] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. [15] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. Multimodal deep learning. In International Conference on Machine Learning (ICML), Bellevue, USA, June 2011. [16] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. The kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011. IEEE Catalog No.: CFP11SRW-USB. [17] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 5, pages 448–455, 2009. [18] Pierre Sermanet, Soumith Chintala, and Yann LeCun. Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition (ICPR 2012), 2012. [19] P.Y. Simard, D. Steinkraus, and J.C. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, volume 2, pages 958–962, 2003. [20] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In Proceedings of the 18th annual conference on Learning Theory, COLT’05, pages 545–560, Berlin, Heidelberg, 2005. SpringerVerlag. [21] Nitish Srivastava and Ruslan Salakhutdinov. Multimodal learning with deep belief nets. ICML 2012 Representation Learning Workshop, 2012. [22] Nitish Srivastava and Ruslan Salakhutdinov. Multimodal learning with deep boltzmann machines. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2231–2239. 2012. [23] Robert Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288, 1996. [24] Andrey Nikolayevich Tikhonov. On the stability of inverse problems. Doklady Akademii Nauk SSSR, 39(5):195–198, 1943. [25] Laurens van der Maaten, M. Chen, S. Tyree, and Kilian Weinberger. Learning with marginalized corrupted features. In Proceedings of the International Conference on Machine Learning, In Press. Bibliography 23 [26] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, ICML ’08, pages 1096–1103, New York, NY, USA, 2008. ACM. [27] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11:3371–3408, December 2010. [28] Sida Wang. Fast dropout training for logistic regression. In NIPS workshop on log-linear models, 2012. [29] Hui Yuan Xiong, Yoseph Barash, and Brendan J. Frey. Bayesian prediction of tissue-regulated splicing using rna sequence and cellular context. Bioinformatics, 27(18):2554–2562, 2011.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement