To understand it better, consider the following diagram.
The lowest point in the above diagram is referred to as the global minima while other lower points are referred to as local minima. Ideally our SGD should reach till global minima but sometimes it gets stuck in the local minima and it becomes very hard to know that whether our SGD is in global minima or stuck in local minima.
How to avoid local minima?
Local minima is a major issue with gradient descent. Hyper-parameter tuning plays a vital role in avoiding local minima. There is no universal solution to this problem, but there are some methods which we can use to avoid local minima.
1. Increasing the learning rate: If the learning rate of the algorithm is too small, then it is more likely that SGD will get stuck in a local minima.
2. Add some noise while updating weights: Adding random noise to weights also sometimes helps in finding out global minima.
3. Assign random weights: Repeated training with random starting weights is among the popular methods to avoid this problem, but it requires extensive computational time.
4. Use large number of hidden layers: Each hidden node in a layer starts out in a different random starting state. This allows each hidden node to converge to different patterns in the network. Parameterizing this size allows the neural network user to potentially try thousands (or tens of billions) of different local minima in a single neural network.
5. MOST EFFECTIVE ONE: Using momentum and adaptive learning based SGD: Instead of using conventional gradient descent optimizers, try using optimizers like Adagrad, AdaDelta, RMSprop and Adam. Adam uses momentum and adaptive learning rate to reach the global minima. You can find out more detail about momentum and adaptive learning based algorithms in my this article.
Sometimes local minimas are as good as global minimas
Usually, it is not always necessary to reach the true global minimum. It is generally agreed upon that most of the local minimas have values which are close to the global minimum.
There are a lot of papers and research which shows sometimes reaching to global minima is not easy. So, in these cases, if we manage to find an optimal local minima which is as good as global minima, we should use that.