Machine Learning Quiz (134 Objective Questions) Start ML Quiz

Deep Learning Quiz (152 Objective Questions) Start DL Quiz

Thursday, 27 June 2019

100+ Basic Deep Learning Interview Questions and Answers

I have listed down some basic deep learning interview questions with answers. These deep learning interview questions cover many concepts like perceptrons, neural networks, weights and biases, activation functions, gradient descent algorithm, CNN (ConvNets), CapsNets, RNN, LSTM, regularization techniques, dropout, hyperparameters, transfer learning, fine-tuning a model, autoencoders, deep learning frameworks like TensorFlow, Keras etc. I will keep adding more and more deep learning interview questions in this list. So, stay tuned.

Note: For Machine Learning Interview Questions, refer this link.

Introduction

1. What is Deep Learning? How is it different from machine learning? What are the pros and cons of deep learning over machine learning? Answer

2. How does deep learning mimic the behavior of human brain? How will you compare an artificial neuron to a biological neuron?

Perceptron

3. What is a Perceptron? How does it work? What is a multi-layer perceptron?

4. What are the various limitations of a Perceptron? Why cannot we implement XOR gate using Perceptron?

Answers to above questions

Neural Networks

5. What are the various layers in a neural network?

6. What are the various types of a neural network?

7. What are Deep and Shallow neural networks? What are the advantages and disadvantages of deep neural networks over shallow neural networks?

Answers to above questions

Weights and Bias

8. What is the importance of weights and biases in a neural network? What are the things to keep in mind while initializing weights and biases? Answer

9. What is Xavier Weight Initialization technique? How is it helpful in initializing the weights? How does weight initialization vary for different types of activation functions? Answer 

10. Explain forward and backward propagation in a neural network. How does a neural network update weights and biases during back propagation? (See Gradient Descent section for answer)

Activation Functions

11. What do you mean by activation functions in neural networks? Why do we call them squashing functions? How do activation functions bring non-linearity in neural networks?

12. Explain various activation functions like Step (Threshold)Logistic (Sigmoid), Hyperbolic Tangent (Tanh), and ReLU (Rectified Linear Unit)What are the various advantages and disadvantages of using these activation functions? 

Answers to above questions

13. Dying and Leaky ReLU: What do you mean by Dying ReLU? When a neuron is considered as dead in a neural network? How does leaky ReLU help in dealing with dying ReLU? Answer

14. What is the difference between Sigmoid and Softmax activation functions? Answer

Batches

15. Explain the terms: EpochsBatches and Iterations in neural networks.

16. What do you mean by Batch Normalization? What are its various advantages? Answer

Loss Function

17. What is the difference between categorical_crossentropy and sparse_categorical_crossentropy? Which one to use and when?

Hint: For one hot encoded labels, use categorical_crossentropy. Otherwise, use sparse_categorical_crossentropy.

Gradient Descent

18. What is Gradient Descent? How is it helpful in minimizing the loss function? What are its various types? 

19. Explain Batch, Stochastic, and Mini Batch Gradient Descent. What are the advantages and disadvantages of these Gradient Descent methods? Answer

20. Explain these terms in context of SGD: Momentum, Nesterov Momentum, AdaGrad, AdaDelta, RMSprop, Adam. Answer

21. What is the difference between Local and Global Minima? What are the ways to avoid local minima? Answer

22. Explain Vanishing and Exploding Gradients.

23. What is Learning Rate? How does low and high learning rate affect the performance and accuracy of a neural network? Answer

24. If loss in a neural network is not decreasing during training period after so many iterations, what could be the possible reasons?

Hint: Think of low / high learning rate, local and global minima (may be it stuck at local minima), high regularization parameter etc.

CNN (ConvNets)

25. What is Convolutional Neural Network? Explain various layers in a CNN? 

26. What are the Filters (Kernels) in CNN? What is Stride?

27. What do you mean by Padding in CNN? What is the difference between Zero Padding and Valid Padding?

28. What do you mean by Pooling in CNN? What are the various types of pooling? Explain Max Pooling, Min Pooling, Average Pooling and Sum Pooling.

29. What are the various hyperparameters in CNN which need to be tuned while training process?

30. How is CNN different from traditional fully connected neural networks? Why we cannot use fully connected neural networks for image recognition?

31. Suppose we have an input of n X n dimension and filter of f X f dimension. If we slide this filter over the input in the convolutional layer, what will be the dimension of the resulting output?

Answers to above questions

CapsNets

32. What is Capsule Neural Network (CapsNets)? How is it different from CNN (ConvNets)? Answer

Computer Vision

33. What is computer vision? How does deep learning help in solving various computer vision problems? Answer

RNN

34. Explain RNN (Recurrent Neural Network). Why is RNN best suited for sequential data?

35. What do you mean by feedback loop in RNN?

36. What are the various types of RNN? Explain with example: One to One, One to Many, Many to One, and Many to Many RNN.

37. What is Bidirectional RNN?

38. What are the various issues with RNN? Explain Vanishing and Exploding Gradients. What are the various ways to solve these gradient issues in RNN?

39. What are the various advantages and disadvantages of RNN?

40. What are the various applications of RNN?

41. What are the differences between CNN and RNN?

LSTM

42. How does LSTM (Long Short Term Memory) solve Vanishing Gradient issue in RNN?

43. What are the gated cells in LSTM? What are the various types of gates used in LSTM?

44. What are the various applications of LSTM?

Answers to all questions of RNN and LSTM

Regularization

45. What are the main causes of overfitting and underfitting in a neural network?

46. What are the various regularization techniques used in a neural network?

47. Explain L1 and L2 Regularization techniques used in a neural network.

48. What is Dropout? How does it prevent overfitting in a neural network? What are its various advantages and disadvantages? Answer

49. What is Data AugmentationHow does it prevent overfitting in a neural network?

50. What is Early Stopping? How does it prevent overfitting in a neural network?

Answers to above questions

Learnable Parameters and Hyperparameters

51. What are the learnable parameters in a neural network? Explain with an example.

52. What are the various hyperparameters used in a neural network? What are the various ways to optimize these hyper-parameters?

Answers to above questions

53. How will you manually calculate number of weights and biases in a fully connected neural network? Explain with an example. YouTube video

54. How will you manually calculate number of weights and biases in a convolutional neural network (CNN)? Explain with an example. YouTube video

Transfer Learning

55. What do you mean by Transfer Learning and Fine-tuning a model? What are its various advantages? What are the various steps to fine-tune a model? Answer

Autoencoders

56. What are Autoencoders? What are the various components of an autoencoder? Explain encoder, decoder and bottleneckHow does an autoencoder work?

57. What do you mean by latent space representation and reconstruction loss in an autoencoder?

58. What are the various properties of an autoencoder?

59. What are the various types of an autoencoder? Explain Undercomplete autoencoder, Sparse autoencoder, Denoising autoencoder, Convolutional autoencoder, Contractive autoencoders and Deep autoencoders.

60. How do we add regularization capabilities to autoencoders?

61. What are the various applications of an autoencoder?

62. What are the various hyperparameters we need to tune in an autoencoder?

63. How will you compare Autoencoders with PCA (Principal Component Analysis)?

64. What is RBM (Restricted Boltzman Machine)? What is the difference between an Autoencoder and RBM?

Answers to above questions

Frameworks

65. What are the various frameworks available to implement deep learning models? What should be the characteristics of an ideal deep learning framework? Answer

TensorFlow

66. Explain TensorFlow architecture.

67. What is a Tensor? Explain Tensor Datatypes and Ranks.

68. What are Constants, Placeholders and Variables in a TensorFlow? Why do we need to initialize variables explicitly?

69. What is a Computational Graph? What are the nodes and edges in it? How to build and run the graph using session? What are its various advantages?

70. What is a Tensor Board? How is it useful?

71. What is a TensorFlow Pipeline? How is it useful?

72. Explain these terms: Feed Dictionary and Estimators

Answers to above questions

73. Write a sample code to demonstrate constants, placeholders and variables in TensorFlow? Answer

74. Write a sample code using TensorFlow to demonstrate gradient descent? Answer

75. Implement a Linear Classification Model using TensorFlow Estimator. Answer

Keras

76. What do you know about Keras framework? What are its various advantages and limitations? Answer

77. How will you build a basic sequential model using Keras? Answer 

78. How will you solve a regression problem using sequential model in Keras? Answer

79. How will you build a basic CNN model using Keras? Answer 

80. How will you build a basic LSTM model using Keras?

81. What are the various pre-trained models available in Keras? How are these pre-trained models useful for us?

82. How will you fine-tune VGG16 model for image classification? Answer

83. How will you fine-tune MobileNet model for image classification? What is the difference between VGG16 and MobileNet model?

Some of the above questions don't have answers by now. I am still writing answers for these questions and will keep this list updated. Although above list does not contain 100+ questions as claimed in the title of the post, but very soon I will take the count beyond 100.

Tuesday, 25 June 2019

Xavier Weight Initialization Technique in Neural Networks

Weight initialization is the most important step while training the neural network. If weights are high, it may lead to exploding gradient. If weights are low, it may lead to vanishing gradient. Due to these issues, our model may take a long time to converge to global minima or sometimes it may never converge. So, weight initialization should be done with care. 

Normally, weights are randomly initialized at the beginning. We use Gaussian distribution to randomly distribute these weights such that the mean of the distribution is zero and standard deviation is one. But the problem with this approach was that variance or standard deviation tend to change in next layers which lead to explode or vanish the gradients.

Xavier Weight Initialization Technique

With each passing layer, we want the variance or standard deviation to remain the same. This helps us keep the signal from exploding to a high value or vanishing to zero. In other words, we need to initialize the weights in such a way that the variance remains the same with each passing layer. This initialization process is known as Xavier initialization. 

In Xavier initialization technique, we need to pick the weights from a Gaussian distribution with zero mean and a variance of 1/N (instead of 1), where N specifies the number of input neurons.

Notes

1. Initially, it was suggested to take variance of 1/(Nin + Nout) instead of 1/N. Nin is the number of weights coming into the neuron and Nout is the number of weights going out of the neuron. But it was computationally complex, so it was discarded and we take only 1/N as variance.

2. In Keras, Xavier technique is used by default to initialize the weights in the neural network.

For ReLU activation function

If we are using ReLU as activation function in hidden layers, we need to go through following steps to implement Xavier initialization technique:

1. Generate random weights from a Gaussian distribution having mean 0 and a standard deviation of 1.

2. Multiply those random weights with the square root of (2/n). Here n is number of input units for that layer.

For other activation functions like Sigmoid or Hyperbolic Tangent

If we are using Sigmoid or Tanh as activation function in hidden layers, we need to go through following steps to implement Xavier initialization technique:

1. Generate random weights from a Gaussian distribution having mean 0 and a standard deviation of 1.

2. Multiply those random weights with the square root of (1/n). Here n is number of input units for that layer.

Monday, 24 June 2019

Batch Normalization in Neural Networks in Deep Learning

Batch normalization (batchnorm) is a technique to improve performance and accuracy of a neural network. Many times, normalization and standardization terms are used interchangeably. For more details on normalization and standardization, you can visit my this article.

Batch normalization occurs per batch, that is why, it is called batch normalization. We normalize (mean = 0, standard deviation = 1) the output of a layer before applying the activation function, and then feed it into the next layer in a neural network. So, instead of just normalizing the inputs to the network, we normalize the inputs to each hidden layer within the network.

Advantages of Batch Normalization

1. Solves internal covariate shift: In a neural network, each hidden unit’s input distribution changes every time when there is a parameter update in the previous layer. This is called internal covariate shift. This makes training slow and requires a very small learning rate and a good parameter initialization. This problem is solved by normalizing the layer’s inputs over a mini-batch.

2. Solves Vanishing and Exploding Gradient issues: Unstable gradients like vanishing gradients and exploding gradients are the common issues which occur while training a neural network. By normalizing the outputs of each layer, we can significantly deal with this issue.

3. Training becomes faster: In traditional neural networks, if we use higher learning rate, we may face exploding gradient issue. Also, if we use higher learning rate, there are possibilities that network may not converge at all and keeps oscillating around the global minima. Due to this, we usually prefer lower learning rate in traditional neural networks. But, with lower learning rate, as the networks get deeper, gradients get smaller during back propagation, and so require even more iterations which increases training period. But, if we normalize the output of each layer, we can safely use higher learning rate due to which we can drastically reduce the training period.

4. Solves dying ReLU problem: ReLUs often die out during training the deep neural networks and many neurons stop contributing. But, with batch normalization, we can regulate the output of each hidden layer, which prevents this issue in deep neural networks. For more detail, on dying ReLU, you can refer my this article.

5. Introduces regularization: Batch normalization all provides some sort of regularization to the neural network which increases the generalization capabilities on the network.

Thursday, 20 June 2019

Transfer Learning and Fine Tuning a model in Deep Learning

Transfer learning and fine-tuning terms are very similar in many ways and widely used almost interchangeably. 

Fine-tuning: Suppose you already have an efficient deep learning model which performs task A. Now you have to perform a task B which is quite similar to task A. You don't need to create a separate model from scratch for task B. Just fine-tune the existing model which is efficiently performing task A. 

Example: You have a well trained model which identifies all types of cars. Car model has already learned a lot of features like edges, shapes, textures, head lights, door handles, tyres, windshield etc. Now you have to create a model which can identify trucks. We know that many features of cars and trucks are similar. So, why to create a new model for trucks from scratch. Lets just tweak the existing car model to create a new model for truck. 

Transfer Learning: We can transfer the learning from the existing model on cars to new model on trucks. So, transfer learning happens while fine-tuning an existing model.

Advantages of Transfer Learning and Fine Tuning:

Creating a new model is a very tough and time consuming task. We need to decide a lot of things while creating a model like:

1. Different types of layers to use (fully connected, convoluted, capsule, LSTM etc.)
2. How many layers to use?
3. How many nodes in a layer?
4. Which activation function to use in which layer?
5. Which regularization techniques to use?
6. Which optimizer to use?
7. Tuning various hyperparameters like initializing weights, learning rate, momentum, batch size, number of epochs etc.

So, if we can fine-tune an existing model, we can very well escape from above tasks and save our time and energy.

How to fine-tune a model? 

We need to make some reasonable changes and tweaks to our existing model to create a new model. Below are some basic steps to fine-tune an existing model:

1. Remove output layer: First of all remove output layer which was identifying cars. Add a new output layer which will now identify trucks.

2. Add and remove hidden layers: Trucks have some features different from cars. So accordingly, add some hidden layers which will learn new features of trucks. Remove those hidden layers which are not required in case of trucks.

3. Freeze the unchanged layers: Freeze the layers which are maintained (not changed) so that no weight update happens on them when we again train this model on the new data with trucks. Weight should only be updated on new hidden layers. 

Tuesday, 18 June 2019

Capsule Neural Networks: An enhancement of Convolutional Neural Networks (ConvNets vs CapsNets)

Capsule Neural Networks can be seen as an enhancement of Convolutional Neural Networks. In order to understand capsule neural networks, lets first recap convolutional neural networks (CNN). In CNN, initial layers detect simple features like edges, curves, color gradients etc. Deeper convolutional layers start combining the simple features into comparatively complex features and so on. But in doing so, CNN does not take care of orientational and relative spatial relationships between the features or components. So, sometimes, CNN can be easily tricked. 

For example, in face recognition, CNN does not take care of placements of eyes, nose, mouth, lips etc. Even if lips are near to eyes or eyes are below the mouth, it will still consider it a face. If all the features or components of face are available, it will consider it as a face without taking care of the orientation and placement of those components. Capsule networks take care of this.

I have written a separate post on CNN. Please go through it for detailed information on CNN.

Pooling layer problem in CNN: Pooling layer is used to perform down-sampling the data due to which a lot of information is lost. These layers reduce the spatial resolution, so their outputs are invariant to small changes in the inputs. This is a problem when detailed information must be preserved throughout the network. With CapsNets, detailed pose information (such as precise object position, rotation, thickness, skew, size, and so on) is preserved throughout the network. Small changes to the inputs result in small changes to the outputs—information is preserved. This is called "equivariance."

Capsule: Human brain is organized into modules called capsules. Considering this fact, concept of capsule was put forward by Hilton. A capsule can be considered as a group of neurons. We can add as many neurons to a capsule to capture different dimensions of an image like scale thickness, stroke thickness, width, skew, translation etc. It can maintain information such as equivariance, hue, pose, albedo, texture, deformation, speed, and location of the object.

Dynamic Routing Algorithm: Human brain has a mechanism to route information among capsules. On similar mechanism, dynamic routing algorithm was suggested by Hilton. This algorithm allows capsules to communicate with each other. For more details, please visit this article:

Dynamic Routing Between Capsules

Squashing Function: Instead of ReLU, a new squashing function was suggested by Hilton known as novel squashing function. It is used to normalize the magnitude of vectors so that it falls between 0 and 1. The outputs from these squash functions tell us how to route data through various capsules that are trained to learn different concepts.

Limitations of Capsule Neural Networks

1. As compared to the CNN, the training time for the capsule network is slower because of its computational complexity.

2. It has been tested over MNIST dataset, but how will it behave on complex dataset, is still unknown.

3. This concept is still under research. So, it has a lot of scope for improvement.

I would suggest to go through this PDF for more details on Capsule Neural Networks.

Saturday, 15 June 2019

Difference between Sigmoid and Softmax function in deep learning

Softmax function can be understood as a generalized version of a sigmoid function or an extension of a sigmoid function. Softmax function is usually used in the output layers of neural networks. 

Following are some of the differences between Sigmoid and Softmax function:

1. The sigmoid function is used for the two-class (binary) classification problem, whereas the softmax function is used for the multi-class classification problem.

2. Sum of all softmax units are supposed to be 1. In sigmoid, it’s not really necessary. Sigmoid just makes output between 0 to 1. The softmax enforces that the sum of the probabilities of all the output classes are equal to one, so in order to increase the probability of a particular class, softmax must correspondingly decrease the probability of at least one of the other classes. 

When you use a softmax, basically you get a probability of each class (join distribution and a multinomial likelihood) whose sum is bound to be one. In case, you use sigmoid for multi class classification, it’d be like a marginal distribution and a Bernoulli likelihood.

3. Formula for Sigmoid and Softmax

Sigmoid function:


Softmax function:







Let me illustrate the point 2 with an example here. Lets say, we have 6 inputs: 

[1,2,3,4,5,6]

If we pass these inputs through the sigmoid function, we will get following output:

[0.5, 0.73, 0.88, 0.95, 0.98, 0.99] 

Sum of the above output units is 5.03 which is greater than 1. 

But in case of softmax, the sum of output units is always 1. Lets see how? Pass the same input to softmax function, and we get following output:

[0.001, 0.009, 0.03, 0.06, 0.1, 0.8] which sums up to 1.

4. Sigmoid is usually used as an activation function in hidden layers (but we use ReLU nowadays) while Softmax is used in output layers.

A general rule of thumb is to use ReLU as an activation function in hidden layers and softmax in output layer in a neural networks. For more information on activation functions, please visit my this post.

Friday, 14 June 2019

Regularization Techniques used in Neural Networks in Deep Learning

Ideally, the neural networks should never underfit and overfit and maintain good generalization capabilities. For this purpose, we use various regularization techniques in our neural networks. Below is the list of some of the regularization techniques which are commonly used to improve the performance and accuracy of the neural networks in deep learning.

1. L1 and L2 Regularization

L1 and L2 are the most common types of regularization techniques used in machine learning as well as in deep learning algorithms. These update the general cost function by adding another term known as the regularization penalty. 

For more details, please go through my this article.

2. Dropout

Dropout can be seen as temporarily deactivating or ignoring neurons in the hidden layers of a network. Probabilistically dropping out nodes in the network is a simple and effective regularization method. We can switch off some neurons in a layer so that they do not contribute any information or learn any information and the responsibility falls on other active neurons to learn harder and reduce the error.

For more details on dropout, please consider visiting my this post.

3. Data Augmentation

Creating new data by making reasonable modifications to the existing data is called data augmentation. Lets take an example of our MNIST dataset (hand written digits). We can easily generate thousands of new similar images by rotating, flipping, scaling, shifting, zooming in and out, cropping, changing or varying the color of the existing images. 

We can use data augmentation technique when our model is overfitting due to less data.

In many cases in deep learning, increasing the amount of data is not a difficult task as we discussed above the case of MNIST dataset. In machine learning, this task is not that easy as we need labeled data which is not easily available. 

4. Early Stopping

While training a neural network, there will be a point during training when the model will stop generalizing and start learning the noise in the training dataset. This leads to overfitting.

One approach to solve this problem is to treat the number of training epochs as a hyperparameter and train the model multiple times with different values, then select the number of epochs that result in the best performance. 

The downside of this approach is that it requires multiple models to be trained and discarded. This can be computationally inefficient and time-consuming.

Another approach is early stopping. The model is evaluated on a validation dataset after each epoch. If the performance of the model on the validation dataset starts to degrade (e.g. loss begins to increase or accuracy begins to decrease), then the training process is stopped. The model at the time when the training is stopped, is then used and is known to have good generalization performance.

Thursday, 13 June 2019

Hyperparameter Tuning in Neural Networks in Deep Learning

In order to minimize the loss and determine optimal values of weight and bias, we need to tune our neural network hyper-parameters. Hyperparameters are the parameters that the neural network can’t learn itself via gradient descent or some other variant. 

Hyper-parameters are opposite of learnable parameters. Learnable parameters are automatically learned and then optimized by the neural network. For example, weights and bias are learnable by the neural networks. These are also called trainable parameters as these are optimized during the training process using gradient descent.

This is our responsibility to provide optimal values for these hyper-parameters from our experience, domain knowledge and cross-validation. We need to manually tweak these hyperparameters to get better accuracy from the neural network.

Following is the list of hyperparameters used in neural networks:

1. Number of hidden layers: Keep adding the hidden layers until the loss function does not minimize to a certain extent. General rule is that we should a use a large number of hidden layers with proper regularization technique.

2. Number of units or neurons in a layer: Larger number of units in a layer may cause overfitting. Smaller number of units may cause underfitting. So, try to maintain a balance and use dropout technique.

3. Dropout: Dropout is regularization technique to avoid overfitting thus increasing the generalizing capabilities of the neural network. In this technique, we deliberately drop some units in a hidden layer to introduce generalization capabilities into it. Dropout value should range in between 20%-50% of number of neurons in a layer. 

For more information on dropout, please consider going through my this article on dropout.

4. Activation Function: Activation functions introduce non-linearity in a neural network. Sigmoid, Step, Tanh, ReLU, Softmax are the activation functions. Mainly we use ReLU activation function for hidden layers and softmax for output layer. 

For more details on activation functions, please consider going through my this article on activation functions.

5. Learning Rate: Learning rate determines how quickly weights and bias are updated in a neural network. If the learning rate is very small, learning process will significantly slow down and the model will converge too slowly. It may also also end up in local minima and never reach global minima. Larger learning rate speeds up the learning but may not converge. 

Learning rate is normally set somewhere between 0.01 to 0.0001. Usually a decaying learning rate is preferred.

For more details on local and global minima, please refer my this article.

6. Momentum: Momentum helps in accelerating SGD in a relevant direction. Momentum helps to know the direction of the next step with the knowledge of the previous steps. It helps to prevent oscillations by adding up the speed. A typical choice of momentum should be between 0.5 to 0.9.

For more details on learning rate and momentum, please consider going through my this article on momentum and adaptive learning.

7. Number of epochs: Number of epochs is the number of times the whole training data is shown to the network while training. Default number of epochs is 1.

8. Batch size: Batch size is the number of samples passed to the network at one time after which parameter update happens. This is also called mini-batch. It should be in power of 2. Default batch size is 128. 

9. Weight Initialization: Biases are typically initialized to 0 (or close to 0), but weights must be initialized carefully. Their initialization can have a big impact on the local minimum found by the training algorithm. 

If weight is too large: During back-propagation, it will lead to exploding gradient problem. It means, the gradients of the cost with the respect to the parameters are too big. This leads the cost to oscillate around its minimum value.

If weight is too smallDuring back-propagation, it will lead to vanishing gradient problem. The gradients of the cost with respect to the parameters are too small, leading to convergence of the cost before it has reached the minimum value.

So, initializing weights with inappropriate values will lead to divergence or a slow-down in the training of the neural network.

To prevent this vanishing and exploding problem, we usually assign random numbers for weights in such a way that weights are normally distributed (mean = 0, standard deviation = 1).

For more details on weight initialization, please visit my this post.

10. Loss Function: The loss function compares the network's output for a training example against the intended output. A common general-purpose loss function is the Squared Errors loss function. When the output of the neural network is being treated as a probability distribution (e.g. a softmax output layer is being used), we generally use the cross-entropy as a loss function.

Hyperparameter Tuning: Following are some ways to tune hyperparameters in a neural network:

1. Coordinate Descent: It keeps all hyperparameters fixed except for one, and adjust that hyperparameter to minimize the validation error.

2. Grid Search: Grid search tries each and every hyperparameter setting over a specified range of values. This involves a cross-product of all intervals, so the computational expense is exponential in the number of parameters. Good part is that it can be easily parallelized.

3. Random Search: This is opposite of grid search. Instead of taking cross-product of all the intervals, it samples the hyperparameter space randomly. It performs better than grid search because grid search can take an exponentially long time to reach a good hyperparameter subspace. This can also be parallelized.

4. Cross-validation: We can also try cross-validation by trying different portions of dataset during training and testing. 

Wednesday, 12 June 2019

Global and Local Minima in Gradient Descent in Deep Learning

Task of a Gradient Descent optimizer is to find out optimal weights for the parameters. But sometimes, it may end up in finding weights which are less than the optimal value which leads to inaccuracy of the model. 

To understand it better, consider the following diagram.

















The lowest point in the above diagram is referred to as the global minima while other lower points are referred to as local minima. Ideally our SGD should reach till global minima but sometimes it gets stuck in the local minima and it becomes very hard to know that whether our SGD is in global minima or stuck in local minima.

How to avoid local minima?

Local minima is a major issue with gradient descent. Hyper-parameter tuning plays a vital role in avoiding local minima. There is no universal solution to this problem, but there are some methods which we can use to avoid local minima.

1. Increasing the learning rate: If the learning rate of the algorithm is too small, then it is more likely that SGD will get stuck in a local minima.

2. Add some noise while updating weights: Adding random noise to weights also sometimes helps in finding out global minima.

3. Assign random weights: Repeated training with random starting weights is among the popular methods to avoid this problem, but it requires extensive computational time.

4. Use large number of hidden layers: Each hidden node in a layer starts out in a different random starting state. This allows each hidden node to converge to different patterns in the network. Parameterizing this size allows the neural network user to potentially try thousands (or tens of billions) of different local minima in a single neural network.

5. MOST EFFECTIVE ONE: Using momentum and adaptive learning based SGD: Instead of using conventional gradient descent optimizers, try using optimizers like Adagrad, AdaDelta, RMSprop and Adam. Adam uses momentum and adaptive learning rate to reach the global minima. You can find out more detail about momentum and adaptive learning based algorithms in my this article.

Sometimes local minimas are as good as global minimas

Usually, it is not always necessary to reach the true global minimum. It is generally agreed upon that most of the local minimas have values which are close to the global minimum. 













There are a lot of papers and research which shows sometimes reaching to global minima is not easy. So, in these cases, if we manage to find an optimal local minima which is as good as global minima, we should use that.

Momentum and Adaptive Learning based Gradient Descent Optimizers: Adagrad and Adam

In my previous article on Gradient Descent Optimizers, we had discussed about three types of Gradient Descent algorithms:

1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini Batch Gradient Descent

In this article, we will see some advanced versions of Gradient Descent which can be categorized as:

1. Momentum based (Nesterov Momentum)
2. Based on adaptive learning rate (Adagrad, Adadelta, RMSprop)
3. Combination of momentum and adaptive learning rate (Adam)

Lets first understand something about momentum.

Momentum

Momentum helps in accelerating SGD in a relevant direction. So, its a good idea to also consider momentum for every parameter. It has following advantages:

1. Avoids local minima: As momentum adds up speed and hence increases the step size, optimizer will not get trapped in local minima.

2. Faster convergence: Momentum makes the convergence faster as it increases the step size due to the gained speed.

Now, lets see some flavors of SGD.

1. Nesterov Momentum

It finds out the current momentum and based upon that approximates the next position. And then, it calculates the gradient w.r.t next approximated position instead of calculating gradient w.r.t current position. This thing prevents us from going too fast and results in increased responsiveness, which significantly increases the performance of SGD.

2. Adagrad

It mainly focuses on adaptive learning rate instead of momentum

In standard SGD, learning rate is always constant. It means, we have to go with same speed irrespective of the slope. This seems impractical in real life. 

What happen if we know that we should slow down or speed up? What happen if we know that we should accelerate more in this direction and decelerate in that direction? Its not possible using the standard SGD.

Adagrad keeps updating the learning rate instead of using constant learning rate. It accumulates the sum of squared of all of the gradient, and use that to normalize the learning rate, so that now the learning rate could be smaller or larger depending on how the past gradients behaved.

It adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

2A. AdaDelta and RMSprop

AdaDelta and RMSprop are an extension of Adagrad.

As discussed in Adagrad section, Adagrad accumulates the sum of squared of all of the gradient, and use that to normalize the learning rate. Due to this, Adagrad encounters an issue. The issue is that learning rate in Adagrad keeps on decreasing due to which at a point learning almost stops. 

To handle this issue AdaDelta and RMSprop decay the past accumulated gradient, so only a portion of past gradients are considered. Now, instead of considering all of the past gradients, we consider the moving average.

3. Adam

Adam is the finest Gradient Descent Optimizer and is widely used. It uses powers of both momentum and adaptive learning. In other words, Adam is RMSprop or AdaDelta with momentum. It considers momentum and also normalize the learning rate using the moving average squared gradient.

Conclusion: Most of the above Gradient Descent methods are already implemented in the popular Deep Learning frameworks like Tensorflow, Keras, Caffe etc. However, Adam is currently the default recommended algorithm to be used as it utilizes both momentum and adaptive learning features.

For more details on above algorithms, I strongly refer this and this article.