Batch normalization occurs per batch, that is why, it is called batch normalization. We normalize (mean = 0, standard deviation = 1) the output of a layer before applying the activation function, and then feed it into the next layer in a neural network. So, instead of just normalizing the inputs to the network, we normalize the inputs to each hidden layer within the network.
Advantages of Batch Normalization
1. Solves internal covariate shift: In a neural network, each hidden unit’s input distribution changes every time when there is a parameter update in the previous layer. This is called internal covariate shift. This makes training slow and requires a very small learning rate and a good parameter initialization. This problem is solved by normalizing the layer’s inputs over a mini-batch.
2. Solves Vanishing and Exploding Gradient issues: Unstable gradients like vanishing gradients and exploding gradients are the common issues which occur while training a neural network. By normalizing the outputs of each layer, we can significantly deal with this issue.
3. Training becomes faster: In traditional neural networks, if we use higher learning rate, we may face exploding gradient issue. Also, if we use higher learning rate, there are possibilities that network may not converge at all and keeps oscillating around the global minima. Due to this, we usually prefer lower learning rate in traditional neural networks. But, with lower learning rate, as the networks get deeper, gradients get smaller during back propagation, and so require even more iterations which increases training period. But, if we normalize the output of each layer, we can safely use higher learning rate due to which we can drastically reduce the training period.
4. Solves dying ReLU problem: ReLUs often die out during training the deep neural networks and many neurons stop contributing. But, with batch normalization, we can regulate the output of each hidden layer, which prevents this issue in deep neural networks. For more detail, on dying ReLU, you can refer my this article.
5. Introduces regularization: Batch normalization all provides some sort of regularization to the neural network which increases the generalization capabilities on the network.