Online Machine Learning Quiz

100+ Objective Machine Learning Questions. Lets see how many can you answer?

Start Quiz

Saturday, 15 June 2019

Difference between Sigmoid and Softmax function in deep learning

Softmax function can be understood as a generalized version of a sigmoid function or an extension of a sigmoid function. Sigmoid function is usually used in the output layers of neural networks. 

Following are some of the differences between Sigmoid and Softmax function:

1. The sigmoid function is used for the two-class (binary) classification problem, whereas the softmax function is used for the multi-class classification problem.

2. Sum of all softmax units are supposed to be 1. In sigmoid, it’s not really necessary. Sigmoid just makes output between 0 to 1. The softmax enforces that the sum of the probabilities of all the output classes are equal to one, so in order to increase the probability of a particular class, softmax must correspondingly decrease the probability of at least one of the other classes. 

When you use a softmax, basically you get a probability of each class (join distribution and a multinomial likelihood) whose sum is bound to be one. In case, you use sigmoid for multi class classification, it’d be like a marginal distribution and a Bernoulli likelihood.

3. Formula for Sigmoid and Softmax

Sigmoid function:

Softmax function:

Let me illustrate the point 2 with an example here. Lets say, we have 6 inputs: 


If we pass these inputs through the sigmoid function, we will get following output:

[0.5, 0.73, 0.88, 0.95, 0.98, 0.99] 

Sum of the above output units is 5.03 which is greater than 1. 

But in case of softmax, the sum of output units is always 1. Lets see how? Pass the same input to softmax function, and we get following output:

[0.001, 0.009, 0.03, 0.06, 0.1, 0.8] which sums up to 1.

4. Sigmoid is usually used as an activation function in hidden layers (but we use ReLU nowadays) while Softmax is used in output layers.

A general rule of thumb is to use ReLU as an activation function in hidden layers and softmax in output layer in a neural networks. For more information on activation functions, please visit my this post.

Friday, 14 June 2019

Regularization Techniques used in Neural Networks in Deep Learning

Ideally, the neural networks should never underfit and overfit and maintain good generalization capabilities. For this purpose, we use various regularization techniques in our neural networks. Below is the list of some of the regularization techniques which are commonly used to improve the performance and accuracy of the neural networks in deep learning.

1. L1 and L2 Regularization

L1 and L2 are the most common types of regularization techniques used in machine learning as well as in deep learning algorithms. These update the general cost function by adding another term known as the regularization penalty. 

For more details, please go through my this article.

2. Dropout

Dropout can be seen as temporarily deactivating or ignoring neurons in the hidden layers of a network. Probabilistically dropping out nodes in the network is a simple and effective regularization method. We can switch off some neurons in a layer so that they do not contribute any information or learn any information and the responsibility falls on other active neurons to learn harder and reduce the error.

For more details on dropout, please consider visiting my this post.

3. Data Augmentation

If we come to know that our model is performing poorly due to overfitting, we can increase the training data to handle this situation. In many cases in deep learning, increasing the amount of data is not a difficult task. Lets take an example of our MNIST dataset (hand written digits). We can easily generate thousands of other similar images by rotating, flipping, scaling and shifting the existing images. In machine learning, this task is not that easy as we need labelled data which is not easily available. This phenomenon of increasing the training data to reduce overfitting is called data augmentation.

4. Early Stopping

While training a neural network, there will be a point during training when the model will stop generalizing and start learning the noise in the training dataset. This leads to overfitting.

One approach to solve this problem is to treat the number of training epochs as a hyperparameter and train the model multiple times with different values, then select the number of epochs that result in the best performance. 

The downside of this approach is that it requires multiple models to be trained and discarded. This can be computationally inefficient and time-consuming.

Another approach is early stopping. The model is evaluated on a validation dataset after each epoch. If the performance of the model on the validation dataset starts to degrade (e.g. loss begins to increase or accuracy begins to decrease), then the training process is stopped. The model at the time when the training is stopped, is then used and is known to have good generalization performance.

Thursday, 13 June 2019

Hyperparameter Tuning in Neural Networks in Deep Learning

In order to minimize the loss and determine optimal values of weight and bias, we need to tune our neural network hyper-parameters. Hyperparameters are the parameters that the neural network can’t learn itself via gradient descent or some other variant. 

Hyper-parameters are opposite of learnable parameters. Learnable parameters are automatically learned and then optimized by the neural network. For example, weights and bias are learnable by the neural networks. These are also called trainable parameters as these are optimized during the training process using gradient descent.

This is our responsibility to provide optimal values for these hyper-parameters from our experience, domain knowledge and cross-validation. We need to manually tweak these hyperparameters to get better accuracy from the neural network.

Following is the list of hyperparameters used in neural networks:

1. Number of hidden layers: Keep adding the hidden layers until the loss function does not minimize to a certain extent. General rule is that we should a use a large number of hidden layers with proper regularization technique.

2. Number of units or neurons in a layer: Larger number of units in a layer may cause overfitting. Smaller number of units may cause underfitting. So, try to maintain a balance and use dropout technique.

3. Dropout: Dropout is regularization technique to avoid overfitting thus increasing the generalizing capabilities of the neural network. In this technique, we deliberately drop some units in a hidden layer to introduce generalization capabilities into it. Dropout value should range in between 20%-50% of number of neurons in a layer. 

For more information on dropout, please consider going through my this article on dropout.

4. Activation Function: Activation functions introduce non-linearity in a neural network. Sigmoid, Step, Tanh, ReLU, Softmax are the activation functions. Mainly we use ReLU activation function for hidden layers and softmax for output layer. 

For more details on activation functions, please consider going through my this article on activation functions.

5. Learning Rate: Learning rate determines how quickly weights and bias are updated in a neural network. If the learning rate is very small, learning process will significantly slow down and the model will converge too slowly. It may also also end up in local minima and never reach global minima. Larger learning rate speeds up the learning but may not converge. Usually a decaying learning rate is preferred.

For more details on local and global minima, please refer my this article.

6. Momentum: Momentum helps in accelerating SGD in a relevant direction. Momentum helps to know the direction of the next step with the knowledge of the previous steps. It helps to prevent oscillations by adding up the speed. A typical choice of momentum should be between 0.5 to 0.9.

For more details on learning rate and momentum, please consider going through my this article on momentum and adaptive learning.

7. Number of epochs: Number of epochs is the number of times the whole training data is shown to the network while training. Default number of epochs is 1.

8. Batch size: Batch size is the number of sub samples given to the network after which parameter update happens. It should be in power of 2. Default batch size is 128.

9. Weight Initialization: Biases are typically initialized to 0 (or close to 0), but weights must be initialized carefully. Their initialization can have a big impact on the local minimum found by the training algorithm. Usually we assign random numbers for weights in such a way that weights are normally distributed (mean = 0, standard deviation = 1).

10. Loss Function: The loss function compares the network's output for a training example against the intended output. A common general-purpose loss function is the Squared Errors loss function. When the output of the neural network is being treated as a probability distribution (e.g. a softmax output layer is being used), we generally use the cross-entropy as a loss function.

Hyperparameter Tuning: Following are some ways to tune hyperparameters in a neural network:

1. Coordinate Descent: It keeps all hyperparameters fixed except for one, and adjust that hyperparameter to minimize the validation error.

2. Grid Search: Grid search tries each and every hyperparameter setting over a specified range of values. This involves a cross-product of all intervals, so the computational expense is exponential in the number of parameters. Good part is that it can be easily parallelized.

3. Random Search: This is opposite of grid search. Instead of taking cross-product of all the intervals, it samples the hyperparameter space randomly. It performs better than grid search because grid search can take an exponentially long time to reach a good hyperparameter subspace. This can also be parallelized.

4. Cross-validation: We can also try cross-validation by trying different portions of dataset during training and testing. 

Wednesday, 12 June 2019

Global and Local Minima in Gradient Descent in Deep Learning

Task of a Gradient Descent optimizer is to find out optimal weights for the parameters. But sometimes, it may end up in finding weights which are less than the optimal value which leads to inaccuracy of the model. 

To understand it better, consider the following diagram.

The lowest point in the above diagram is referred to as the global minima while other lower points are referred to as local minima. Ideally our SGD should reach till global minima but sometimes it gets stuck in the local minima and it becomes very hard to know that whether our SGD is in global minima or stuck in local minima.

How to avoid local minima?

Local minima is a major issue with gradient descent. Hyper-parameter tuning plays a vital role in avoiding local minima. There is no universal solution to this problem, but there are some methods which we can use to avoid local minima.

1. Increasing the learning rate: If the learning rate of the algorithm is too small, then it is more likely that SGD will get stuck in a local minima.

2. Add some noise while updating weights: Adding random noise to weights also sometimes helps in finding out global minima.

3. Assign random weights: Repeated training with random starting weights is among the popular methods to avoid this problem, but it requires extensive computational time.

4. Use large number of hidden layers: Each hidden node in a layer starts out in a different random starting state. This allows each hidden node to converge to different patterns in the network. Parameterizing this size allows the neural network user to potentially try thousands (or tens of billions) of different local minima in a single neural network.

5. MOST EFFECTIVE ONE: Using momentum and adaptive learning based SGD: Instead of using conventional gradient descent optimizers, try using optimizers like Adagrad, AdaDelta, RMSprop and Adam. Adam uses momentum and adaptive learning rate to reach the global minima. You can find out more detail about momentum and adaptive learning based algorithms in my this article.

Sometimes local minimas are as good as global minimas

Usually, it is not always necessary to reach the true global minimum. It is generally agreed upon that most of the local minimas have values which are close to the global minimum. 

There are a lot of papers and research which shows sometimes reaching to global minima is not easy. So, in these cases, if we manage to find an optimal local minima which is as good as global minima, we should use that.

Momentum and Adaptive Learning based Gradient Descent Optimizers: Adagrad and Adam

In my previous article on Gradient Descent Optimizers, we had discussed about three types of Gradient Descent algorithms:

1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini Batch Gradient Descent

In this article, we will see some advanced versions of Gradient Descent which can be categorized as:

1. Momentum based (Nesterov Momentum)
2. Based on adaptive learning rate (Adagrad, Adadelta, RMSprop)
3. Combination of momentum and adaptive learning rate (Adam)

Lets first understand something about momentum.


Momentum helps in accelerating SGD in a relevant direction. So, its a good idea to also consider momentum for every parameter. It has following advantages:

1. Avoids local minima: As momentum adds up speed and hence increases the step size, optimizer will not get trapped in local minima.

2. Faster convergence: Momentum makes the convergence faster as it increases the step size due to the gained speed.

Now, lets see some flavors of SGD.

1. Nesterov Momentum

It finds out the current momentum and based upon that approximates the next position. And then, it calculates the gradient w.r.t next approximated position instead of calculating gradient w.r.t current position. This thing prevents us from going too fast and results in increased responsiveness, which significantly increases the performance of SGD.

2. Adagrad

It mainly focuses on adaptive learning rate instead of momentum

In standard SGD, learning rate is always constant. It means, we have to go with same speed irrespective of the slope. This seems impractical in real life. 

What happen if we know that we should slow down or speed up? What happen if we know that we should accelerate more in this direction and decelerate in that direction? Its not possible using the standard SGD.

Adagrad keeps updating the learning rate instead of using constant learning rate. It accumulates the sum of squared of all of the gradient, and use that to normalize the learning rate, so that now the learning rate could be smaller or larger depending on how the past gradients behaved.

It adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

2A. AdaDelta and RMSprop

AdaDelta and RMSprop are an extension of Adagrad.

As discussed in Adagrad section, Adagrad accumulates the sum of squared of all of the gradient, and use that to normalize the learning rate. Due to this, Adagrad encounters an issue. The issue is that learning rate in Adagrad keeps on decreasing due to which at a point learning almost stops. 

To handle this issue AdaDelta and RMSprop decay the past accumulated gradient, so only a portion of past gradients are considered. Now, instead of considering all of the past gradients, we consider the moving average.

3. Adam

Adam is the finest Gradient Descent Optimizer and is widely used. It uses powers of both momentum and adaptive learning. In other words, Adam is RMSprop or AdaDelta with momentum. It considers momentum and also normalize the learning rate using the moving average squared gradient.

Conclusion: Most of the above Gradient Descent methods are already implemented in the popular Deep Learning frameworks like Tensorflow, Keras, Caffe etc. However, Adam is currently the default recommended algorithm to be used as it utilizes both momentum and adaptive learning features.

For more details on above algorithms, I strongly refer this and this article.