## Pages

Machine Learning Quiz (134 Objective Questions) Start ML Quiz

Deep Learning Quiz (205 Objective Questions) Start DL Quiz

## Wednesday, 12 June 2019

1. Momentum based (Nesterov Momentum)

Lets first understand something about momentum.

Momentum

Momentum helps in accelerating SGD in a relevant direction. So, its a good idea to also consider momentum for every parameter. It has following advantages:

1. Avoids local minima: As momentum adds up speed and hence increases the step size, optimizer will not get trapped in local minima.

2. Faster convergence: Momentum makes the convergence faster as it increases the step size due to the gained speed.

Now, lets see some flavors of SGD.

1. Nesterov Momentum

It finds out the current momentum and based upon that approximates the next position. And then, it calculates the gradient w.r.t next approximated position instead of calculating gradient w.r.t current position. This thing prevents us from going too fast and results in increased responsiveness, which significantly increases the performance of SGD.

In standard SGD, learning rate is always constant. It means, we have to go with same speed irrespective of the slope. This seems impractical in real life.

What happen if we know that we should slow down or speed up? What happen if we know that we should accelerate more in this direction and decelerate in that direction? Its not possible using the standard SGD.

Adagrad keeps updating the learning rate instead of using constant learning rate. It accumulates the sum of squared of all of the gradient, and use that to normalize the learning rate, so that now the learning rate could be smaller or larger depending on how the past gradients behaved.

It adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

As discussed in Adagrad section, Adagrad accumulates the sum of squared of all of the gradient, and use that to normalize the learning rate. Due to this, Adagrad encounters an issue. The issue is that learning rate in Adagrad keeps on decreasing due to which at a point learning almost stops.

To handle this issue AdaDelta and RMSprop decay the past accumulated gradient, so only a portion of past gradients are considered. Now, instead of considering all of the past gradients, we consider the moving average.

Adam is the finest Gradient Descent Optimizer and is widely used. It uses powers of both momentum and adaptive learning. In other words, Adam is RMSprop or AdaDelta with momentum. It considers momentum and also normalize the learning rate using the moving average squared gradient.

Conclusion: Most of the above Gradient Descent methods are already implemented in the popular Deep Learning frameworks like Tensorflow, Keras, Caffe etc. However, Adam is currently the default recommended algorithm to be used as it utilizes both momentum and adaptive learning features.

For more details on above algorithms, I strongly refer this and this article.