Techniques for Regularizing Deep Neural Network Training

Regularization is a set of strategies used in Machine Learning to reduce the generalization error. Most models, after training, perform very well on a specific subset of the overall population but fail to generalize well. This is also known as overfitting. Regularization strategies aim to reduce overfitting and keep, at the same time, the training error as low as possible.

TL;DR

In this article, we will present a review of the most popular regularization techniques used when training Deep Neural Networks. We will categorize these techniques on bigger families based on their similarities.

Why regularization?

You have probably heard of the famous ResNet CNN architecture. ResNets were originally proposed in 2015. A recent paper called “Revisiting ResNets: Improved Training and Scaling Strategies” applied modern regularization methods and achieved more than 3% test set accuracy on Imagenet. If the test set consists of 100K images, this means that 3K more images were classified correctly! Awesome, isn’t it? revisiting-resnets Revisiting ResNets: Improved Training and Scaling Strategies by Irwan Bello et al. Now, let’s cut to the chase.

What is regularization?

According to Ian Goodfellow, Yoshua Bengio and Aaron Courville in their Deep Learning Book: “In the context of deep learning, most regularization strategies are based on regularizing estimators. Regularization of an estimator works by trading increased bias for reduced variance. An effective regularizer is one that makes a profitabletrade, reducing variance significantly while not overly increasing the bias.” In simple terms, regularization results in simpler models. And as the Occam’s razor principle argues: the simplest models are the most likely to perform better. Actually, we constrain the model to a smaller set of possible solutions by introducing different techniques.

The bias-variance tradeoff: overfitting and underfitting

First, let’s clarify that bias-variance tradeoff and overfitting-underfitting are equivalent. The bias error is an error from wrong assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs. This is called underfitting. The variance is an error from sensitivity to small fluctuations in the training set. High variance may result in modeling the random noise in the training data. This is called overfitting. The bias-variance tradeoff is a term to describe the fact that we can reduce the variance by increasing the bias. Good regularization techniques strive to simultaneously minimize the two sources of error. Hence, achieving better generalization.

As a side material, I highly recommend the DeepLearning.Ai course: Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

How to introduce regularization in deep learning models

Modify the loss function: add regularization terms

The most common family of approaches used before the Deep Learning era in estimators such as linear and logistic regression, are parameters norm penalties. Here we add a parameter norm penalty Ω(θ)\Omega(\theta)Ω(θ) to the loss function J(θ;X,y)J(\theta; X, y)J(θ;X,y): J′(θ;X,y)=J(θ;X,y)+aΩ(θ)J^\prime(\theta; X, y) = J(\theta; X, y) + a\Omega(\theta)J′(θ;X,y)=J(θ;X,y)+aΩ(θ)

where θ\thetaθ denotes the trainable parameters, XXX the input, and yyy and target labels. aaa is a hyperparameter that weights the contribution of the norm penalty, hence the effect of the regularization. Ok, the math looks good. But why exactly does this work? Let’s look at the two most popular methods to make that crystal clear. L2 and L1. L2 regularization L2 regularization, also known as weight decay or ridge regression, adds a norm penalty in the form of Ω(θ)=12∣∣w∣∣22\Omega(\theta) = \frac{1}{2}||w||^2_2Ω(θ)=21​∣∣w∣∣22​ . The loss function has been transformed to: J′(w;X,y)=J(w;X,y)+a2∣∣w∣∣22 J^\prime(w; X, y) = J(w; X, y) + \frac{a}{2}||w||^2_2J′(w;X,y)=J(w;X,y)+2a​∣∣w∣∣22​ If we compute the gradients we have: ∇wJ′(w;X,y)=∇wJ(w;X,y)+aw \nabla_w J^\prime(w; X, y) = \nabla_w J(w; X, y) + aw∇w​J′(w;X,y)=∇w​J(w;X,y)+aw For a single training step and a learning rate λ\lambdaλ, this can be written as: w=(1−λa)w−λ∇wJ(w;X,y)w = (1-\lambda a)w – \lambda \nabla_w J(w; X, y)w=(1−λa)w−λ∇w​J(w;X,y) The equation effectively shows us that each weight of the weight vector will be reduced by a constant factor on each training step. Note here that we replaced θ\thetaθ with www. This was due to the fact that usually we regularize only the actual weights of the network and not the biases bbb. If we look at it from the viewpoint of the entire training here is what happens: The L2 regularizer will have a big impact on the directions of the weight vector that don’t “contribute” much to the loss function. On the other hand, it will have a relatively small effect on the directions that contribute to the loss function. As a result, we reduce the variance of our model, which makes it easier to generalize on unseen data. L1 regularization L1 regularization chooses a norm penalty of Ω(θ)=∣∣w∣∣1=∑i∣wi∣\Omega(\theta) = ||w||_1 = \sum_i |w_i|Ω(θ)=∣∣w∣∣1​=∑i​∣wi​∣. In this case, the gradient of the loss function becomes: ∇wJ′(θ;X,y)=∇wJ(θ;X,y)+asign(w) \nabla_w J^\prime(\theta; X, y) = \nabla_w J(\theta; X, y) + a sign(w)∇w​J′(θ;X,y)=∇w​J(θ;X,y)+asign(w) As we can see, the regularization term does not scale linearly, contrary to L2 regularization, but it’s a constant factor with an alternating sign. How does this affect the overall training? The L1 regularizer introduces sparsity in the weights by forcing more weights to be zero instead of reducing the average magnitude of all weights ( as the L2 regularizer does). In other words, L1 suggests that some features should be discarded whatsoever from the training process. Elastic net Elastic net is a method that linearly combines L1 and L2 regularization with the goal to acquire the best of both worlds . More specifically the penalty term is as follows: Ω(θ)=λ1∣∣w∣∣1+λ2∣∣w∣∣22\Omega(\theta) = \lambda_1 ||w||_1 + \lambda_2||w||^2_2Ω(θ)=λ1​∣∣w∣∣1​+λ2​∣∣w∣∣22​ Elastic Net regularization reduces the effect of certain features, as L1 does, but at the same time, it does not eliminate them. So it combines feature elimination from L1 and feature coefficient reduction from the L2. Entropy Regularization Entropy regularization is another norm penalty method that applies to probabilistic models. It has also been used in different Reinforcement Learning techniques such as A3C and policy optimization techniques. Similarly to the previous methods, we add a penalty term to the loss function. If we assume that the model outputs a probability distribution p(x)p(x)p(x), then the penalty term will be denoted as:Ω(X)=−∑p(x)log⁡(p(x)) \Omega(X) = -\sum p(x)\log (p(x))Ω(X)=−∑p(x)log(p(x)) The term “Entropy” has been taken from information theory and represents the average level of “information” inherent in the variable’s possible outcomes. An equivalent definition of entropy is the expected value of the information of a variable. One very simple explanation of why it works is that it forces the probability distribution towards the uniform distribution to reduce variance. In the context of Reinforcement Learning, one can say that the entropy term added to the loss, promotes action diversity and allows better exploration of the environment. For more information on policy gradients and A3C, check our previous articles: Unravel Policy Gradients and REINFORCE and The idea behind Actor-Critics and how A2C and A3C improve them. Label smoothing Noise injection is one of the most powerful regularization strategies. By adding randomness, we can reduce the variance of the models and lower the generalization error. The question is how and where do we inject noise? Label smoothing is a way of adding noise at the output targets, aka labels. Let’s assume that we have a classification problem. In most of them, we use a form of cross-entropy loss such as −∑c=1Myo,clog⁡(po,c)-\sum_{c=1}^M y_{o,c}\log(p_{o,c})−∑c=1M​yo,c​log(po,c​) and softmax to output the final probabilities. The target vector has the form of [0, 1 , 0 , 0]. Because of the way softmax is formulated: ( σ(z)i=ezi∑j=1Kezj\sigma (\mathbf {z} )_{i}={\frac {e^{z_{i}}}{\sum _{j=1}^{K}e^{z_{j}}}}σ(z)i​=∑j=1K​ezj​ezi​​), it can never achieve an output of 1 or 0. The best he can do is something like [0.0003, 0.999, 0.0003, 0.0003]. As a result, the model will continue to be trained, pushing the output values as high and as low as possible. The model will never converge. That, of course, will cause overfitting. To address that, label smoothing replaces the hard 0 and 1 targets by a small margin. Specifically, 0 are replaced with ϵk−1\frac{\epsilon}{ k-1}k−1ϵ​ and 1 with 1−ϵ1-\epsilon1−ϵ, where kkk is the number of classes. Dropout Another strategy to regularize deep neural networks is dropout. Dropout falls into noise injection techniques and can be seen as noise injection into the hidden units of the network. In practice, during training, some number of layer outputs are randomly ignored (dropped out) with probability ppp. During test time, all units are present, but they have been scaled down by ppp. This is happening because after dropout, the next layers will receive lower values. In the test phase though, we are keeping all units so the values will be a lot higher than expected. That’s why we need to scale them down. By using dropout, the same layer will alter its connectivity and will search for alternative paths to convey the information in the next layer. As a result, each update to a layer during training is performed with a different “view” of the configured layer. Conceptually, it approximates training a large number of neural networks with different architectures in parallel. “Dropping” values means temporarily removing them from the network for the current forward pass, along with all its incoming and outgoing connections. Dropout has the effect of making the training process noisy. The choice of the probability ppp depends on the architecture. dropout Image by author This conceptualization suggests that perhaps dropout breaks up situations where network layers co-adapt to correct mistakes from prior layers, making the model more robust. It increases the sparsity of the network and in general, encourages sparse representations! Sparsity can be added to any model with hidden units and is a powerful tool in our regularization arsenal. Other Dropout variations There are many more variations of Dropout that have been proposed over the years. To keep this article relatively digestible, I won’t go into many details for each one. But I will briefly mention a few of them. Feel free to check out paperswithcode.com for more details on each one, alongside the original paper and code. Inverted dropout also randomly drops some units with a probability ppp. The difference with traditional dropout is: During training, it also scales the activations by the inverse of the keep probability 1−p1-p1−p. The reason behind this is: to prevent the activations from becoming too large thus the need to modify the network during the testing phase. The end result will be similar to the traditional dropout. Gaussian dropout: instead of dropping units during training, is injecting noise to the weights of each unit. The noise is, more often than not ,Gaussian. This results in: A reduction in the computational effort during testing time. No weight scaling is required. Faster training overall DropConnect follows a slightly different approach. Instead of zeroing out random activations (units), it zeros random weights during each forward pass. The weights are dropped with a probability of 1−p1-p1−p. This essentially transforms a fully connected layer to a sparsely connected layer. Mathematically we can represent DropConnect as: r=a((M∗W)v)r = a \left(\left(M * W\right){v}\right)r=a((M∗W)v) where rrr is the layers’ output, vvv the input, WWW the weights and MMM a binary matrix. MMM is a mask that instantiates a different connectivity pattern from each data sample. Usually, the mask is derived from each training example. DropConnect can be seen as a generalization of Dropout to the full-connection structure of a layer. Variational Dropout: we use the same dropout mask on each timestep. This means that we will drop the same network units each time. This was initially introduced for Recurrent Neural Networks and it follows the same principles as variational inference. Attention Dropout: popular over the past years because of the rapid advancements of attention-based models like Transformers. As you may have guessed, we randomly dropped certain attention units with a probability ppp. Adaptive Dropout: a technique that extends dropout by allowing the dropout probability to be different for different units. The intuition is that there may be hidden units that can individually make confident predictions for the presence or absence of an important feature or combination of features. Embedding Dropout: a strategy that performs dropout on the embedding matrix and is used for a full forward and backward pass. DropBlock: is used in Convolutional Neural networks and it discards all units in a continuous region of the feature map. Stochastic Depth Stochastic depth goes a step further. It drops entire network blocks while keeping the model intact during testing. The most popular application is in large ResNets where we bypass certain blocks through their skip connections. In particular, Stochastic depth (Huang et al., 2016) drops out each layer in the network that has residual connections around it. It does so with a specified probability ppp that is a function of the layer depth. stochastic-depth Source: Deep Networks with Stochastic Depth Mathematically we can express this as: Hl=ReLU(blfl(Hl−1)+id(Hl−1)) H_{l} = \text{ReLU}(b_{l}f

Latest articles

Related articles