This article aims to provide an understanding of a very popular regularization technique called Dropout. It assumes a prior understanding of concepts like model training, creating training and test sets, overfitting, underfitting, and regularization.
The article starts with setting the context for and making the case for dropout. It then explains how dropout works and how it affects the training of deep neural networks. Finally, it goes over Keras’s dropout layer to how to use it.
Deep neural networks are heavily parameterized models. Typically, they have tens of thousands or even millions of parameters to be learned. These parameters provide a great amount of capacity to learn a diverse set of complex datasets. This is not always a good thing. Such a capacity often leads to overfitting, a scenario where the training set performance is high and the test set performance is worse (low bias, high variance). The model is likely to have a higher test error rate because it’s too dependent on the training data. To avoid this situation, we try to reduce the learning capacity of the model using various regularization techniques. One such regularization technique is Dropout. Regularization ensures the model generalizes well on the unseen data.
Fig. 1 shows the contrast between an overfitted model represented by the green margin and a regularized model represented by the black margin. Even though the green margin seems to better fit the training data, it’s not likely to perform well enough on unseen instances (test set). Fig. 1 provides a decent picture of what overfitting looks like.
It’s one of the most popular techniques of regularization, proposed by Geoffrey Hinton, 2012 in the paper “Improving neural networks by preventing co-adaptation of feature detectors”. It’s a fairly simple idea but a very potent one.
For instance, if p=0.5, it implies a neuron has a 50% chance of dropping out in every epoch. If a neuron doesn’t participate in a training step, all of its connections are severed, which will impact the downstream layers. This will drastically reduce the density of connections in the neural network (shown in Fig. 2). Dropout can be applied to the input and the hidden layers but not to the output layer. This is because the model has to always generate output for the loss function to enable training. The dropout process is only carried out during the training phase. All the neurons in the network fully participate during the inference phase.
It may be astonishing that turning off neurons arbitrarily works at all. It’s reasonable to assume that it may make the training process highly unstable. But it’s been practically proven to be very effective in reducing the complexity of the model. To understand why I would like to quote an example from the book “Hands-On Machine Learning with Scikit-Learn and TensorFlow”.
Would a company perform better if its employees were told to toss a coin every morning to decide whether or not to go to work? Well, who knows; perhaps it would! The company would obviously be forced to adapt its organization; it could not rely on any single person to fill in the coffee machine or perform any other critical tasks, so this expertise would have to be spread across several people. Employees would have to learn to cooperate with many of their coworkers, not just a handful of them. The company would become much more resilient. If one person quit, it wouldn’t make much of a difference. It’s unclear whether this idea would actually work for companies, but it certainly does for neural networks.
Similarly, in deep neural networks, at each epoch during the training process, the network architecture is different from the previous one. Also, each neuron is forced not to be too reliant on a few input connections, but rather to pay attention to all of its inputs. This makes them more resilient to changes in the input connections. This way it ensures a more robust network that generalizes better.
The tunable hyperparameter in dropout is the dropout rate,denoted byp.Tuning it is fairly straightforward.
The effect of dropout can be clearly seen in the above graphs (Fig. 3 & 4). This is from a simple experiment using Keras where a feed-forward neural network is trained on the MNIST dataset with and without dropout keeping all the other factors constant. The blue lines indicate the model with dropout and the orange lines indicate the model without dropout. In Fig. 3, “Effect of Dropout on Accuracy”, it can be clearly observed that dropout increased the loss of the model. It’s not necessarily a bad thing but it may take longer to converge. Consequently, the accuracy is dropped in Fig. 4. The network architecture used in the experiment is given below.
Github: The code that generated the above graphs is available here.
Keras provides a dropout layer using . It takes the dropout rate as the first parameter. You can find more details in Keras’s documentation. Below is a small snippet showing the use of dropout from the Hands-on ML book.
In addition to dropout, other regularization techniques can also be applied to neural networks. Some of the most popular ones are listed below.
Below are the references I used to write this article. The original papers (in the below list) on Dropout deal with the theory behind it and the experiments conducted to prove its effectiveness in great detail.
Thank you for your time. Please leave any suggestions in the comments section.