Hyper-parameters are opposite of learnable parameters. Learnable parameters are automatically learned and then optimized by the neural network. For example, weights and bias are learnable by the neural networks. These are also called trainable parameters as these are optimized during the training process using gradient descent.
This is our responsibility to provide optimal values for these hyper-parameters from our experience, domain knowledge and cross-validation. We need to manually tweak these hyperparameters to get better accuracy from the neural network.
Following is the list of hyperparameters used in neural networks:
1. Number of hidden layers: Keep adding the hidden layers until the loss function does not minimize to a certain extent. General rule is that we should a use a large number of hidden layers with proper regularization technique.
2. Number of units or neurons in a layer: Larger number of units in a layer may cause overfitting. Smaller number of units may cause underfitting. So, try to maintain a balance and use dropout technique.
3. Dropout: Dropout is regularization technique to avoid overfitting thus increasing the generalizing capabilities of the neural network. In this technique, we deliberately drop some units in a hidden layer to introduce generalization capabilities into it. Dropout value should range in between 20%-50% of number of neurons in a layer.
For more information on dropout, please consider going through my this article on dropout.
4. Activation Function: Activation functions introduce non-linearity in a neural network. Sigmoid, Step, Tanh, ReLU, Softmax are the activation functions. Mainly we use ReLU activation function for hidden layers and softmax for output layer.
For more details on activation functions, please consider going through my this article on activation functions.
5. Learning Rate: Learning rate determines how quickly weights and bias are updated in a neural network. If the learning rate is very small, learning process will significantly slow down and the model will converge too slowly. It may also also end up in local minima and never reach global minima. Larger learning rate speeds up the learning but may not converge.
Learning rate is normally set somewhere between 0.01 to 0.0001. Usually a decaying learning rate is preferred.
For more details on local and global minima, please refer my this article.
6. Momentum: Momentum helps in accelerating SGD in a relevant direction. Momentum helps to know the direction of the next step with the knowledge of the previous steps. It helps to prevent oscillations by adding up the speed. A typical choice of momentum should be between 0.5 to 0.9.
For more details on learning rate and momentum, please consider going through my this article on momentum and adaptive learning.
7. Number of epochs: Number of epochs is the number of times the whole training data is shown to the network while training. Default number of epochs is 1.
8. Batch size: Batch size is the number of samples passed to the network at one time after which parameter update happens. This is also called mini-batch. It should be in power of 2. Default batch size is 128.
9. Weight Initialization: Biases are typically initialized to 0 (or close to 0), but weights must be initialized carefully. Their initialization can have a big impact on the local minimum found by the training algorithm.
If weight is too large: During back-propagation, it will lead to exploding gradient problem. It means, the gradients of the cost with the respect to the parameters are too big. This leads the cost to oscillate around its minimum value.
If weight is too small: During back-propagation, it will lead to vanishing gradient problem. The gradients of the cost with respect to the parameters are too small, leading to convergence of the cost before it has reached the minimum value.
So, initializing weights with inappropriate values will lead to divergence or a slow-down in the training of the neural network.
To prevent this vanishing and exploding problem, we usually assign random numbers for weights in such a way that weights are normally distributed (mean = 0, standard deviation = 1).
For more details on weight initialization, please visit my this post.
10. Loss Function: The loss function compares the network's output for a training example against the intended output. A common general-purpose loss function is the Squared Errors loss function. When the output of the neural network is being treated as a probability distribution (e.g. a softmax output layer is being used), we generally use the cross-entropy as a loss function.
Hyperparameter Tuning: Following are some ways to tune hyperparameters in a neural network:
1. Coordinate Descent: It keeps all hyperparameters fixed except for one, and adjust that hyperparameter to minimize the validation error.
2. Grid Search: Grid search tries each and every hyperparameter setting over a specified range of values. This involves a cross-product of all intervals, so the computational expense is exponential in the number of parameters. Good part is that it can be easily parallelized.
3. Random Search: This is opposite of grid search. Instead of taking cross-product of all the intervals, it samples the hyperparameter space randomly. It performs better than grid search because grid search can take an exponentially long time to reach a good hyperparameter subspace. This can also be parallelized.
4. Cross-validation: We can also try cross-validation by trying different portions of dataset during training and testing.