Online Machine Learning Quiz

100+ Objective Machine Learning Questions. Lets see how many can you answer?

Start Quiz

Saturday, 15 June 2019

Difference between Sigmoid and Softmax function in deep learning

Softmax function can be understood as a generalized version of a sigmoid function or an extension of a sigmoid function. Sigmoid function is usually used in the output layers of neural networks. 

Following are some of the differences between Sigmoid and Softmax function:

1. The sigmoid function is used for the two-class (binary) classification problem, whereas the softmax function is used for the multi-class classification problem.

2. Sum of all softmax units are supposed to be 1. In sigmoid, it’s not really necessary. Sigmoid just makes output between 0 to 1. The softmax enforces that the sum of the probabilities of all the output classes are equal to one, so in order to increase the probability of a particular class, softmax must correspondingly decrease the probability of at least one of the other classes. 

When you use a softmax, basically you get a probability of each class (join distribution and a multinomial likelihood) whose sum is bound to be one. In case, you use sigmoid for multi class classification, it’d be like a marginal distribution and a Bernoulli likelihood.

3. Formula for Sigmoid and Softmax

Sigmoid function:


Softmax function:







Let me illustrate the point 2 with an example here. Lets say, we have 6 inputs: 

[1,2,3,4,5,6]

If we pass these inputs through the sigmoid function, we will get following output:

[0.5, 0.73, 0.88, 0.95, 0.98, 0.99] 

Sum of the above output units is 5.03 which is greater than 1. 

But in case of softmax, the sum of output units is always 1. Lets see how? Pass the same input to softmax function, and we get following output:

[0.001, 0.009, 0.03, 0.06, 0.1, 0.8] which sums up to 1.

4. Sigmoid is usually used as an activation function in hidden layers (but we use ReLU nowadays) while Softmax is used in output layers.

A general rule of thumb is to use ReLU as an activation function in hidden layers and softmax in output layer in a neural networks. For more information on activation functions, please visit my this post.

Friday, 14 June 2019

Regularization Techniques used in Neural Networks in Deep Learning

Ideally, the neural networks should never underfit and overfit and maintain good generalization capabilities. For this purpose, we use various regularization techniques in our neural networks. Below is the list of some of the regularization techniques which are commonly used to improve the performance and accuracy of the neural networks in deep learning.

1. L1 and L2 Regularization

L1 and L2 are the most common types of regularization techniques used in machine learning as well as in deep learning algorithms. These update the general cost function by adding another term known as the regularization penalty. 

For more details, please go through my this article.

2. Dropout

Dropout can be seen as temporarily deactivating or ignoring neurons in the hidden layers of a network. Probabilistically dropping out nodes in the network is a simple and effective regularization method. We can switch off some neurons in a layer so that they do not contribute any information or learn any information and the responsibility falls on other active neurons to learn harder and reduce the error.

For more details on dropout, please consider visiting my this post.

3. Data Augmentation

If we come to know that our model is performing poorly due to overfitting, we can increase the training data to handle this situation. In many cases in deep learning, increasing the amount of data is not a difficult task. Lets take an example of our MNIST dataset (hand written digits). We can easily generate thousands of other similar images by rotating, flipping, scaling and shifting the existing images. In machine learning, this task is not that easy as we need labelled data which is not easily available. This phenomenon of increasing the training data to reduce overfitting is called data augmentation.

4. Early Stopping

While training a neural network, there will be a point during training when the model will stop generalizing and start learning the noise in the training dataset. This leads to overfitting.

One approach to solve this problem is to treat the number of training epochs as a hyperparameter and train the model multiple times with different values, then select the number of epochs that result in the best performance. 

The downside of this approach is that it requires multiple models to be trained and discarded. This can be computationally inefficient and time-consuming.

Another approach is early stopping. The model is evaluated on a validation dataset after each epoch. If the performance of the model on the validation dataset starts to degrade (e.g. loss begins to increase or accuracy begins to decrease), then the training process is stopped. The model at the time when the training is stopped, is then used and is known to have good generalization performance.

Thursday, 13 June 2019

Hyperparameter Tuning in Neural Networks in Deep Learning

In order to minimize the loss and determine optimal values of weight and bias, we need to tune our neural network hyper-parameters. Hyperparameters are the parameters that the neural network can’t learn itself via gradient descent or some other variant. 

Hyper-parameters are opposite of learnable parameters. Learnable parameters are automatically learned and then optimized by the neural network. For example, weights and bias are learnable by the neural networks. These are also called trainable parameters as these are optimized during the training process using gradient descent.

This is our responsibility to provide optimal values for these hyper-parameters from our experience, domain knowledge and cross-validation. We need to manually tweak these hyperparameters to get better accuracy from the neural network.

Following is the list of hyperparameters used in neural networks:

1. Number of hidden layers: Keep adding the hidden layers until the loss function does not minimize to a certain extent. General rule is that we should a use a large number of hidden layers with proper regularization technique.

2. Number of units or neurons in a layer: Larger number of units in a layer may cause overfitting. Smaller number of units may cause underfitting. So, try to maintain a balance and use dropout technique.

3. Dropout: Dropout is regularization technique to avoid overfitting thus increasing the generalizing capabilities of the neural network. In this technique, we deliberately drop some units in a hidden layer to introduce generalization capabilities into it. Dropout value should range in between 20%-50% of number of neurons in a layer. 

For more information on dropout, please consider going through my this article on dropout.

4. Activation Function: Activation functions introduce non-linearity in a neural network. Sigmoid, Step, Tanh, ReLU, Softmax are the activation functions. Mainly we use ReLU activation function for hidden layers and softmax for output layer. 

For more details on activation functions, please consider going through my this article on activation functions.

5. Learning Rate: Learning rate determines how quickly weights and bias are updated in a neural network. If the learning rate is very small, learning process will significantly slow down and the model will converge too slowly. It may also also end up in local minima and never reach global minima. Larger learning rate speeds up the learning but may not converge. Usually a decaying learning rate is preferred.

For more details on local and global minima, please refer my this article.

6. Momentum: Momentum helps in accelerating SGD in a relevant direction. Momentum helps to know the direction of the next step with the knowledge of the previous steps. It helps to prevent oscillations by adding up the speed. A typical choice of momentum should be between 0.5 to 0.9.

For more details on learning rate and momentum, please consider going through my this article on momentum and adaptive learning.

7. Number of epochs: Number of epochs is the number of times the whole training data is shown to the network while training. Default number of epochs is 1.

8. Batch size: Batch size is the number of sub samples given to the network after which parameter update happens. It should be in power of 2. Default batch size is 128.

9. Weight Initialization: Biases are typically initialized to 0 (or close to 0), but weights must be initialized carefully. Their initialization can have a big impact on the local minimum found by the training algorithm. Usually we assign random numbers for weights in such a way that weights are normally distributed (mean = 0, standard deviation = 1).

10. Loss Function: The loss function compares the network's output for a training example against the intended output. A common general-purpose loss function is the Squared Errors loss function. When the output of the neural network is being treated as a probability distribution (e.g. a softmax output layer is being used), we generally use the cross-entropy as a loss function.

Hyperparameter Tuning: Following are some ways to tune hyperparameters in a neural network:

1. Coordinate Descent: It keeps all hyperparameters fixed except for one, and adjust that hyperparameter to minimize the validation error.

2. Grid Search: Grid search tries each and every hyperparameter setting over a specified range of values. This involves a cross-product of all intervals, so the computational expense is exponential in the number of parameters. Good part is that it can be easily parallelized.

3. Random Search: This is opposite of grid search. Instead of taking cross-product of all the intervals, it samples the hyperparameter space randomly. It performs better than grid search because grid search can take an exponentially long time to reach a good hyperparameter subspace. This can also be parallelized.

4. Cross-validation: We can also try cross-validation by trying different portions of dataset during training and testing. 

Wednesday, 12 June 2019

Global and Local Minima in Gradient Descent in Deep Learning

Task of a Gradient Descent optimizer is to find out optimal weights for the parameters. But sometimes, it may end up in finding weights which are less than the optimal value which leads to inaccuracy of the model. 

To understand it better, consider the following diagram.

















The lowest point in the above diagram is referred to as the global minima while other lower points are referred to as local minima. Ideally our SGD should reach till global minima but sometimes it gets stuck in the local minima and it becomes very hard to know that whether our SGD is in global minima or stuck in local minima.

How to avoid local minima?

Local minima is a major issue with gradient descent. Hyper-parameter tuning plays a vital role in avoiding local minima. There is no universal solution to this problem, but there are some methods which we can use to avoid local minima.

1. Increasing the learning rate: If the learning rate of the algorithm is too small, then it is more likely that SGD will get stuck in a local minima.

2. Add some noise while updating weights: Adding random noise to weights also sometimes helps in finding out global minima.

3. Assign random weights: Repeated training with random starting weights is among the popular methods to avoid this problem, but it requires extensive computational time.

4. Use large number of hidden layers: Each hidden node in a layer starts out in a different random starting state. This allows each hidden node to converge to different patterns in the network. Parameterizing this size allows the neural network user to potentially try thousands (or tens of billions) of different local minima in a single neural network.

5. MOST EFFECTIVE ONE: Using momentum and adaptive learning based SGD: Instead of using conventional gradient descent optimizers, try using optimizers like Adagrad, AdaDelta, RMSprop and Adam. Adam uses momentum and adaptive learning rate to reach the global minima. You can find out more detail about momentum and adaptive learning based algorithms in my this article.

Sometimes local minimas are as good as global minimas

Usually, it is not always necessary to reach the true global minimum. It is generally agreed upon that most of the local minimas have values which are close to the global minimum. 













There are a lot of papers and research which shows sometimes reaching to global minima is not easy. So, in these cases, if we manage to find an optimal local minima which is as good as global minima, we should use that.

Momentum and Adaptive Learning based Gradient Descent Optimizers: Adagrad and Adam

In my previous article on Gradient Descent Optimizers, we had discussed about three types of Gradient Descent algorithms:

1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini Batch Gradient Descent

In this article, we will see some advanced versions of Gradient Descent which can be categorized as:

1. Momentum based (Nesterov Momentum)
2. Based on adaptive learning rate (Adagrad, Adadelta, RMSprop)
3. Combination of momentum and adaptive learning rate (Adam)

Lets first understand something about momentum.

Momentum

Momentum helps in accelerating SGD in a relevant direction. So, its a good idea to also consider momentum for every parameter. It has following advantages:

1. Avoids local minima: As momentum adds up speed and hence increases the step size, optimizer will not get trapped in local minima.

2. Faster convergence: Momentum makes the convergence faster as it increases the step size due to the gained speed.

Now, lets see some flavors of SGD.

1. Nesterov Momentum

It finds out the current momentum and based upon that approximates the next position. And then, it calculates the gradient w.r.t next approximated position instead of calculating gradient w.r.t current position. This thing prevents us from going too fast and results in increased responsiveness, which significantly increases the performance of SGD.

2. Adagrad

It mainly focuses on adaptive learning rate instead of momentum

In standard SGD, learning rate is always constant. It means, we have to go with same speed irrespective of the slope. This seems impractical in real life. 

What happen if we know that we should slow down or speed up? What happen if we know that we should accelerate more in this direction and decelerate in that direction? Its not possible using the standard SGD.

Adagrad keeps updating the learning rate instead of using constant learning rate. It accumulates the sum of squared of all of the gradient, and use that to normalize the learning rate, so that now the learning rate could be smaller or larger depending on how the past gradients behaved.

It adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

2A. AdaDelta and RMSprop

AdaDelta and RMSprop are an extension of Adagrad.

As discussed in Adagrad section, Adagrad accumulates the sum of squared of all of the gradient, and use that to normalize the learning rate. Due to this, Adagrad encounters an issue. The issue is that learning rate in Adagrad keeps on decreasing due to which at a point learning almost stops. 

To handle this issue AdaDelta and RMSprop decay the past accumulated gradient, so only a portion of past gradients are considered. Now, instead of considering all of the past gradients, we consider the moving average.

3. Adam

Adam is the finest Gradient Descent Optimizer and is widely used. It uses powers of both momentum and adaptive learning. In other words, Adam is RMSprop or AdaDelta with momentum. It considers momentum and also normalize the learning rate using the moving average squared gradient.

Conclusion: Most of the above Gradient Descent methods are already implemented in the popular Deep Learning frameworks like Tensorflow, Keras, Caffe etc. However, Adam is currently the default recommended algorithm to be used as it utilizes both momentum and adaptive learning features.

For more details on above algorithms, I strongly refer this and this article.

Monday, 10 June 2019

Implement a Linear Classification Model using TensorFlow Estimator

Lets see how can we perform linear classification using TensorFlow library in Python. We will use LinearClassifier function from TensorFlow Estimator. We will use California Census Data and try to predict what class of income (>50k or <=50k) people belong to. You can download this dataset from here. This dataset has 32561 observations and 15 features. You can also download my Jupyter notebook containing below code from here. So, lets get started.

Step 1: Import required libraries

import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Step 2: Load and explore the dataset

dataset = pd.read_csv('adult.csv')
dataset.head()
dataset.size
dataset.shape
dataset.columns
dataset.dtypes
dataset.describe()

Step 3: Drop fnlwgt column

We are not going to use this column as it does not seem to contribute any relevant information in our prediction. So, better drop it.

dataset.drop('fnlwgt', axis=1, inplace=True)

Step 4: Convert label into 0 and 1

dataset['income'].unique()

Output: array(['<=50K', '>50K'], dtype=object)

It means, we have only two string labels. Lets convert these into numeric labels (0 and 1).

def label_fix(label):
    if label == '<=50K':
        return 0
    else:     
        return 1

dataset['income'] = dataset['income'].apply(label_fix)
dataset.head()
dataset['income'].unique()
dataset['income'].value_counts()

Step 5: Split dataset into training and testing set

X = dataset.drop('income', axis=1)
y = dataset['income']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Step 6: Create Feature Columns

All the independent variables need to be converted into a proper type of tensor. The estimator needs to have a list of features to train the model. Hence, the column's data requires to be converted into a tensor. 

We need to create feature columns for our numeric and categorical data. Feature columns act as the intermediaries between raw data and TensorFlow Estimators.

Convert numeric columns into feature columns.

tf.feature_column.numeric_column: Use this to convert numeric column into feature columns.

Convert categorical columns into feature columns.

tf.feature_column.categorical_column_with_hash_bucket: Use this if you don’t know the set of possible values for a categorical column in advance and there are too many of them.

tf.feature_column.categorical_column_with_vocabulary_list: Use this if you know the set of all possible feature values of a column and there are only a few of them

So, lets convert our all the columns into feature columns as discussed above.

workclass = tf.feature_column.categorical_column_with_hash_bucket('workclass', hash_bucket_size=1000)

education = tf.feature_column.categorical_column_with_hash_bucket('education', hash_bucket_size=1000)

marital_status = tf.feature_column.categorical_column_with_hash_bucket('marital_status', hash_bucket_size=1000)

occupation = tf.feature_column.categorical_column_with_hash_bucket('occupation', hash_bucket_size=1000)

relationship = tf.feature_column.categorical_column_with_hash_bucket('relationship', hash_bucket_size=1000)

race = tf.feature_column.categorical_column_with_hash_bucket('race', hash_bucket_size=1000)

sex = tf.feature_column.categorical_column_with_vocabulary_list('sex', ['Female', 'Male'])

native_country = tf.feature_column.categorical_column_with_hash_bucket('native_country', hash_bucket_size=1000)

age = tf.feature_column.numeric_column('age')

education_num = tf.feature_column.numeric_column('education_num')

capital_gain = tf.feature_column.numeric_column('capital_gain')

capital_loss = tf.feature_column.numeric_column('capital_loss')

hours_per_week = tf.feature_column.numeric_column('hours_per_week')

feature_columns = [workclass, education, marital_status, occupation, relationship, race, sex, native_country, age, education_num, capital_gain, capital_loss, hours_per_week]

Step 7: Create Input Function

We now create an input function that would feed Pandas DataFrame into our classifier model. It requires you to specify the features, labels and batch size. It also has a special argument called shuffle,which allows the model to read the records in a random order, thereby improving model performance. You can also specify number of epochs you want to use.

input_fn = tf.estimator.inputs.pandas_input_fn(x=X_train, y=y_train, batch_size=128, num_epochs=None, shuffle=True)

I have set the batch size of 128 and None for number of epochs. By default number of epochs is 1.

Step 8: Create a model using feature columns and input function

model = tf.estimator.LinearClassifier(feature_columns = feature_columns)
model.train(input_fn = input_fn, steps=1000)

Let the optimizer perform 1000 steps.

Step 9: Make predictions

pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_test, batch_size=len(X_test), shuffle=False)
predictions = list(model.predict(input_fn = pred_fn))
predictions[0]
final_preds = []
for pred in predictions:
    final_preds.append(pred['class_ids'][0])
final_preds[:10]
df=pd.DataFrame({'Actual':y_test, 'Predicted':final_preds})  
df 

Step 10: Check accuracy

print(classification_report(y_test, final_preds))
print(confusion_matrix(y_test, final_preds))
print(accuracy_score(y_test, final_preds))

We got around 82.5% accuracy. You can play around with hyper-parameters like number of epochs, number of steps, batch size etc. to improve the accuracy.

Sunday, 9 June 2019

Implement a simple Gradient Descent Optimizer in TensorFlow

Lets implement a simple Gradient Descent Optimizer in TensorFlow for a linear model. We will use GradientDescentOptimizer function present in TensorFlow. We will use Gradient Descent optimizer to find out optimal values of weight and bias so that the loss is minimized. You can download my Jupyter notebook containing below code on Gradient Descent from here.

Step 1: Import TensorFlow library

import tensorflow as tf

Step 2: Declare all the variables and placeholders

W = tf.Variable([0.3], tf.float32)
b = tf.Variable([-0.3], tf.float32)

We have initialized the weight and bias with a random number 0.3 and -0.3 respectively. Task of Gradient Descent Optimizer is to find out an optimal value for both of these variables so that our loss is minimum. Lets see how it happens in next steps?

x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)

Step 3: Create a linear model

linear_model = W * x + b

Step 4: Create a loss function

squared_delta = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_delta)

1. We are using sum of squared errors as a loss function.

2. "linear_model - y" expression is used to compute the error. "linear_model" contains the predicted values and "y" contains the actual values.

3. "square" function is used to square all the errors. 

4. "reduce_sum" function is used to sum all the squared errors.

Step 5: Create a Gradient Descent Optimizer

optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

We are passing 0.01 as a learning rate.

Step 6: Initialize all the variables

init = tf.global_variables_initializer()

Step 7: Create a session and run the graph

session = tf.Session()
session.run(init)
print(session.run(loss, {x:[1,2,3,4], y:[0,-1,-2,-3]}))

Output: 23.66

So, our loss is 23.66 which is quite high. It means the initial values of weight and bias which we took as 0.3 and -0.3 in step 2 are not optimal values. We need to take help of Gradient Descent to optimize our weight and bias.

In next step, we will use Gradient Descent Optimizer (with 1000 iterations and learning rate of 0.01) and try to minimize this loss.

for _ in range(1000):
    session.run(train, {x:[1,2,3,4], y:[0,-1,-2,-3]})
print(session.run([W,b]))
session.close()

Output: [array([-0.9999969], dtype=float32), array([0.9999908], dtype=float32)]

Now we get W as -0.9999969 (approx -1) and b as 0.9999908 (approx 1). So, final conclusion is that optimized value of W is -1 and b is 1. If we initialize W and b as 1 and -1 in our step 2, we will get zero loss. So, our Gradient Descent Optimizer has done a pretty decent job for us.

Saturday, 8 June 2019

Constants, Placeholders, Variables and Sessions in TensorFlow

In this article on TensorFlow, we will see how to build and run a graph taking simple examples of constants, placeholders and variables. We will also learn something about sessions and feed dictionary. In order to learn some theory about TensorFlow, you can look at my this post. You can download my Jupyter notebook containing following code from here.

First thing first, lets import the TensorFlow library.

import tensorflow as tf

Constants

We have declared three nodes (a, b and c). Operations being performed at node "a" and "b" are to declare constant values. Node "c" is performing multiplication operation. 

a = tf.constant(5.0)
b = tf.constant(6.0)
c = a * b

Till now, we have just built a graph. We need to run it. In order to run a graph in TensorFlow, we need to create a session. Following is a way to create a session and run the graph.

session = tf.Session()
result = session.run(c)
print(result)
session.close()

Output: 30.0

Another example

We can also pass the data type to a node.

a = tf.constant(5.0, tf.float32)
b = tf.constant(6.0)
print(a,b)

If we print the nodes before running the graph, we will get following output:

Output: Tensor("Const_2:0", shape=(), dtype=float32) Tensor("Const_3:0", shape=(), dtype=float32)

To print the actual values, we need to create a session and run the graph.

session = tf.Session()
result = session.run([a,b])
print(result)
session.close()

Output: [5.0, 6.0]

Placeholders

We provide values to placeholders while running the graph.

a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
c = a * b
d = 2 * a

session = tf.Session()
result = session.run(c, {a:[1,3], b:[2,4]})
print(result)

Output: [ 2. 12.]

Feed Dictionary: We can also create a dictionary and input it into a placeholder while running a graph. We use feed_dict parameter for this.

dictionary = {a:[[[1,2,3],[4,5,6],[7,8,9]],[[1,2,3],[4,5,6],[7,8,9]]]}
result = session.run(d, feed_dict=dictionary)
print(result)
session.close()

Output:

[[[ 2.  4.  6.]
  [ 8. 10. 12.]
  [14. 16. 18.]]

 [[ 2.  4.  6.]
  [ 8. 10. 12.]
  [14. 16. 18.]]]

Naming a node

We can also provide a name to a node. This is helpful in visualizing the nodes in Tensor Board.

a = tf.placeholder(tf.float32, name="A")
b = tf.placeholder(tf.float32, name="B")
c = tf.multiply(a, b, name="C")

with tf.Session() as session:
    result = session.run(c, feed_dict={a:[1,2,3], b:[4,5,6]})
    print(result)

Output: [ 4. 10. 18.]

Variables

Lets create some nodes and assign them some operations.

zero = tf.Variable(0)
one = tf.constant(1)
new_value = tf.add(zero, one)
updated_variable = tf.assign(zero, new_value)

We must initialize all the variables while running the graph. So, following line is must.
init = tf.global_variables_initializer()
Now, create a session and run the graph.
session = tf.Session()
session.run(init) 
print(session.run(zero)) 
print(session.run(one)) 
print(session.run(new_value)) 
print(session.run(updated_variable))

Output:

0
1
1
1

Lets run the "updated_variable" node 5 times and observe the results.

for _ in range(5):
    session.run(updated_variable)
    print(session.run(zero))

Output:

2
3
4
5
6

Finally, close the session.

session.close()  

Strings

Below is the illustration of the string concatenation operation in TensorFlow.

hello = tf.constant('hello')
world = tf.constant('world')
hello_world = tf.add(hello, world)
with tf.Session() as session:
    print(session.run(hello_world))

Output: b'helloworld'

Friday, 7 June 2019

What is Dropout? How does it prevent overfitting in a neural network?

Dropout is an effective regularization technique used in neural networks which increases generalization capabilities of a deep learning model and prevent it from overfitting.

Overfitting in neural networks

Large neural networks trained on relatively small datasets can overfit the training data. Over-fitted neural networks results in poor performance when the model is evaluated on new data. Dropout is an efficient solution to handle this over-fitting problem in neural networks.

What happens in dropout?

Dropout can be seen as temporarily deactivating or ignoring neurons in the hidden layers of a network. Probabilistically dropping out nodes in the network is a simple and effective regularization method. We can switch off some neurons in a layer so that they do not contribute any information or learn any information and the responsibility falls on other active neurons to learn harder and reduce the error.

Points to note about dropout

1. Dropout is implemented per-layer in a neural network. Dropout can be implemented in hidden and input layers, but not in output layers. 

We can use different probabilities for dropout on each layer. As mentioned previously, dropout should not be implemented on output layer, so the output layer would always have keep_prob = 1 and the input layer has high keep_prob such as 0.9 or 1. 

If a hidden layer has keep_prob = 0.8, this means that on each iteration, each unit has 80% probability of being included and 20% probability of being dropped out.

This probability acts as a hyper-parameter and we should carefully decide how many neurons we want to deactivate in a given hidden layer.

2. Dropout can be used with many types of layers, such as dense fully connected layers, convolutional layers, and recurrent layers such as the long short-term memory network (LSTM) layers.

3. Dropout should be implemented only during training phase, not in testing phase. 

4. Dropout can be compared to bagging technique in machine learning. In bagging, all trees are not trained on all the features. Similarly, using dropout, all the hidden layers are not trained on all the features.

Advantages of dropout

1. Reduces overfitting and hence increases the accuracy of the model

2. Improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark datasets.

3. Computationally cheap as compared to other regularization methods.

Disadvantages of dropout

1. Introduces sparsity: If we use dropout to a large extent, activations inside the hidden layers may become sparse. You can correlate it with sparse autoencoders.

2. Dropout makes training process noisy as it forces nodes within a layer to probabilistically take on more or less responsibility for the inputs.

Thursday, 6 June 2019

Dying ReLU: Causes and Solutions (Leaky ReLU)

ReLU (Rectified Linear Unit) is a widely used activation function in a neural network which outputs zero if the input is negative or zero and outputs the same value if the input is positive. 

Mathematically, relu(z) = max(0, z)

For more details on ReLU and other activation functions, you can visit my this post on activation functions in neural networks.

What is a Dying ReLU?

The dying ReLU refers to the problem when ReLU neurons become inactive and only output 0 for any input. So, once a neuron gets negative input, it will always output zero and is unlikely for it to recover. It will become inactive forever. Such neurons will not play any role in discriminating the input and become useless in the neural network. If this process continues, over the time you may end up with a large part of your network doing nothing.

What is the cause of Dying ReLU?

Lets see why dying ReLU problem occurs? The dying ReLU problem is likely to occur when:

1. Learning rate is too high or 
2. There is a large negative bias.

Consider the following statement which is used to calculate the new weights during back-propagation:

New Weight = Old Weight - (Derivative of Loss Function * Learning Rate) + Bias

So, if the learning rate is too high, we may end up with a new weight which is negative. Also, if the bias is too negative, we may again end up in negative weight. 

Once it becomes negative, ReLU activation function of that neuron will never be activated which will lead that neuron to die forever.

What is the solution of Dying ReLU?

Leaky ReLU is the most common and effective method to alleviate a dying ReLU. It adds a slight slope in the negative range to prevent the dying ReLU issue.
















Leaky ReLU has a small slope for negative values, instead of altogether zero. For example, leaky ReLU may have y = 0.0001x when x < 0.

Parametric ReLU (PReLU) is a type of leaky ReLU that, instead of having a predetermined slope like 0.0001, makes it a parameter for the neural network to figure out itself: y = αx when x < 0.









Lower learning rates often mitigates the problem.