Deep Learning Optimizers

Antony Christopher
11 min readJul 8, 2021

--

In Deep Learning the optimizers play an important role. It takes the key role in losses. Basically, the optimizers do reduce the losses. How do the optimizers play in Deep Learning? Well, coming to this deep learning model if any changes in weight in the input layer the output layer gets changed. Further layer gets used of the weight. As things go on, we might be ended up with model accuracy is very low. In such a situation, we can use the optimizers to reduce the loss. Let start with the basics about the Gradient Descent that leads to different optimization algorithms.

What is Gradient?

“A gradient measures how much the output of a function changes if you change the inputs a little bit.” — Lex Fridman (MIT)

A gradient simply measures the change in weight with respect to the change in the error. The gradient can be termed as the slope of the function. If the gradient is higher then the slope will be steeper and faster the model can learn. If the slope is zero, the model stops learning. In mathematical terms, the gradient is the partial derivative of the inputs.

Gradient Descent

Gradient Descent is an optimization algorithm for finding the local minimum of a differential function. It is used to find the values of a function’s parameters that minimize a cost function as far as possible.

You have defined the value of the initial parameter from the gradient descent calculates iteratively, so the value becomes minimize which gives the cost function.

Let’s explain in a better understanding. At this point image a person at the top of the mountain. The goal is to reach the base camp, say the bottom of the mountain. Here the problem is the person is blind. How come the person reaches the bottom?. Well, you have to take small steps and move towards the direction of the higher incline. Continue this step iteratively until reaches the base camp at the bottom. This is how works in Gradient Descent.

Imagine the image below illustrates our hill from a top-down view and the red arrows are the steps of our climber. Think of a gradient in this context as a vector that contains the direction of the steepest step the blindfolded man can take and also how long that step should be.

Note that the gradient ranging from X0 to X1 is much longer than the one reaching from X3 to X4. This is because the steepness/slope of the hill, which determines the length of the vector, is less. This perfectly represents the example of the hill because the hill is getting less steep the higher it’s climbed. Therefore a reduced gradient goes along with a reduced slope and a reduced step size for the hill climber.

How Gradient descent Works?

The equation below describes what gradient descent does: b is the next position of our climber, while a represents his current position. The minus sign refers to the minimization part of gradient descent. The gamma in the middle is a waiting factor and the gradient term ( Δf(a) ) is simply the direction of the steepest descent.

So this formula basically tells us the next position we need to go, which is the direction of the steepest descent. Let’s look at another example to really drive the concept home.

Imagine you have a machine learning problem and want to train your algorithm with gradient descent to minimize your cost-function J(w, b) and reach its local minimum by tweaking its parameters (w and b). The image below shows the horizontal axes represent the parameters (w and b), while the cost function J(w, b) is represented on the vertical axes. Gradient descent is a convex function.

We know we want to find the values of w and b that correspond to the minimum of the cost function (marked with the red arrow). To start finding the right values we initialize w and b with some random numbers. Gradient descent then starts at that point (somewhere around the top of our illustration), and it takes one step after another in the steepest downside direction (i.e., from the top to the bottom of the illustration) until it reaches the point where the cost function is as small as possible.

What is the Learning Rate?

In Gradient Descent algorithm plays a vital role in the learning rate. It shows how many steps need to take compared to the previous one. Let’s take the climber example as above, on top taking each step is determined by the learning rate. In simple steps as (Learning Rate * old step). If the magnitude is 4.2 and the learning rate is 0.01, the next step will be 0.042. Which is away from the previous one. Hope this clarifies why the learning rate is important in GD.

If the learning rate chooses the as a big one, the climber never reaches the bottom or the model doesn’t reach a local minimum. If the learning rate is small then the model reaches a local minimum as expected.

Gradient Descent variants

There are three variants of gradient descent based on the amount of data used to calculate the gradient:

  • Batch gradient descent
  • Stochastic gradient descent
  • Mini-batch gradient descent

Batch Gradient Descent

Batch Gradient Descent, aka Vanilla gradient descent, calculates the error for each observation in the dataset but performs an update only after all observations have been evaluated.

Batch gradient descent is not often used, because it represents a huge consumption of computational resources, as the entire dataset needs to remain in memory.

Stochastic Gradient Descent

Stochastic gradient descent (SGD) performs a parameter update for each observation. So instead of looping over each observation, it just needs one to perform the parameter update. SGD is usually faster than batch gradient descent, but its frequent updates cause a higher variance in the error rate, which can sometimes jump around instead of decreasing.

Mini-Batch Gradient Descent

It is a combination of both bath gradient descent and stochastic gradient descent. Mini-batch gradient descent performs an update for a batch of observations. It is the algorithm of choice for neural networks, and the batch sizes are usually from 50 to 256.

SGD with Momentum

The SGD with momentum is always faster than the SGD. Basically, momentum plays a key role here. The momentum leads to accelerates the gradient in the right direction and leads to faster convergence. Imagine a bowling ball rolling down a gentle slope on a smooth surface: it will start out slowly, but it will quickly pick up momentum until it eventually reaches terminal velocity (if there is some friction or air resistance). This is the very simple idea behind it. Gradient Descent will simply take small regular steps down the slope, so it will take much more time to reach the bottom.

Recall that Gradient Descent simply updates the weights θ by directly subtracting the gradient of the cost function J(θ) with regards to the weights (∇θJ(θ)) multiplied by the learning rate η. The equation is θ ← θ — η∇θJ(θ). It does not care about what the earlier gradients were. If the local gradient is tiny, it goes very slowly.

Momentum optimization cares a great deal about what previous gradients were: at each iteration, it subtracts the local gradient from the momentum vector m (multiplied by the learning rate η), and it updates the weights by simply adding this momentum vector (see below equation). In other words, the gradient is used for acceleration, not for speed. To simulate some sort of friction mechanism and prevent the momentum from growing too large, the algorithm introduces a new hyperparameter β, simply called the momentum, which must be set between 0 (high friction) and 1(no friction). A typical momentum value is 0.9.

Momentum algorithm

1 . m ← βm − η∇θJ(θ)

2 . θ ←θ + m

You can easily verify that if the gradient remains constant, the terminal velocity (i.e., the maximum size of the weight updates) is equal to that gradient multiplied by the learning rate η multiplied by 1 − β (ignoring the sign). For example, if β = 0.9, then the terminal velocity is equal to 10 times the gradient times the learning rate, so Momentum optimization ends up going 10 times faster than Gradient Descent! This allows Momentum optimization to escape from plateaus much faster than Gradient Descent.

Gradient Descent goes down the steep slope quite fast, but then it takes a very long time to go down the valley. In contrast, Momentum optimization will roll down the valley faster and faster until it reaches the bottom (the optimum).

Implementing Momentum optimization in Keras is a no-brainer: just use the SGD optimizer and set its momentum hyperparameter, then lie back and profit!

optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

Nesterov Accelerated Gradient

The idea of Nesterov Momentum optimization, or Nesterov Accelerated Gradient (NAG), is to measure the gradient of the cost function not at the local position but slightly ahead in the direction of the momentum. The only difference between Momentum optimization is that the gradient is measured at θ + βm rather than at θ.

Nesterov Accelerated Gradient algorithm

1 . m →βm − η∇θJ θ + βm

2 . θ →θ + m

This small tweak works because in general the momentum vector will be pointing in the right direction (i.e., toward the optimum), so it will be slightly more accurate to use the gradient measured a bit farther in that direction rather than using the gradient at the original position, as you can see in below figure where ∇1 represents the gradient of the cost function measured at the starting point θ, and ∇2 represents the gradient at the point located at θ + βm). As you can see, the Nesterov update ends up slightly closer to the optimum. After a while, these small improvements add up and NAG ends up being significantly faster than regular Momentum optimization. Moreover, note that when the momentum pushes the weights across a valley, ∇1 continues to push further across the valley, while ∇2 pushes back toward the bottom of the valley. This helps reduce oscillations and thus converges faster.

NAG will almost always speed up training compared to regular Momentum optimization.

To use it, simply set nesterov=True when creating the SGD optimizer:

optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

AdaGrad

Consider the elongated bowl problem again: Gradient Descent starts by quickly going down the steepest slope, then slowly goes down the bottom of the valley. It would be nice if the algorithm could detect this early on and correct its direction to point a bit more toward the global optimum.

The AdaGrad algorithm achieves this by scaling down the gradient vector along the steepest dimensions

AdaGrad algorithm

The first step accumulates the square of the gradients into the vector s (recall that the ⊗ symbol represents the element-wise multiplication). In other words, each si accumulates the squares of the partial derivative of the cost function with regards to parameter θi. If the cost function is steep along the ith dimension, then si will get larger and larger at each iteration.

The second step is almost identical to Gradient Descent, but with one big difference: the gradient vector is scaled down by a factor (the ⊘ symbol represents the element-wise division, and ϵ is a smoothing term to avoid division by zero, typically set to 10–10). This vectorized form is equivalent to computing θi for all parameters θi (simultaneously).

In short, this algorithm decays the learning rate, but it does so faster for steep dimensions than for dimensions with gentler slopes. This is called an adaptive learning rate. It helps point the resulting updates more directly toward the global optimum. One additional benefit is that it requires much less tuning of the learning rate hyperparameter η.

AdaGrad often performs well for simple quadratic problems, but unfortunately, it often stops too early when training neural networks. The learning rate gets scaled down so much that the algorithm ends up stopping entirely before reaching the global optimum. So even though Keras has an Adagrad optimizer, you should not use it to train deep neural networks (it may be efficient for simpler tasks such as Linear Regression, though). However, understanding Adagrad is helpful to grasp the other adaptive learning rate optimizers.

RMSProp

Although AdaGrad slows down a bit too fast and ends up never converging to the global optimum, the RMSProp algorithm fixes this by accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). It does so by using exponential decay in the first step.

RMSProp algorithm

The decay rate β is typically set to 0.9. Yes, it is once again a new hyperparameter, but this default value often works well, so you may not need to tune it at all.

As you might expect, Keras has an RMSProp optimizer:

optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

Except for very simple problems, this optimizer almost always performs much better than AdaGrad. In fact, it was the preferred optimization algorithm of many researchers until Adam optimization came around.

Adam

Adam, which stands for adaptive moment estimation, combines the ideas of Momentum optimization and RMSProp: just like Momentum optimization it keeps track of an exponentially decaying average of past gradients, and just like RMSProp it keeps track of an exponentially decaying average of past squared gradients

Adam Algorithm

*t represents the iteration number (starting at 1).

If you just look at steps 1, 2, and 5, you will notice Adam’s close similarity to both Momentum optimization and RMSProp. The only difference is that step 1 computes an exponentially decaying average rather than an exponentially decaying sum, but these are actually equivalent except for a constant factor (the decaying average is just 1 — β1 times the decaying sum). Steps 3 and 4 are somewhat of technical detail: since m and s are initialized at 0, they will be biased toward 0 at the beginning of training, so these two steps will help boost m and s at the beginning of training.

The momentum decay hyperparameter β1 is typically initialized to 0.9, while the scaling decay hyperparameter β2 is often initialized to 0.999. As earlier, the smoothing term ϵ is usually initialized to a tiny number such as 10–7.

optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

Since Adam is an adaptive learning rate algorithm (like AdaGrad and RMSProp), it requires less tuning of the learning rate hyperparameter η. You can often use the default value η = 0.001, making Adam even easier to use than Gradient Descent.

Hope this article gives you a better understanding of optimizers. See you soon on new one :)

--

--

Antony Christopher

Data Science and Machine Learning enthusiast | Software Architect | Full stack developer