Thursday 14 March 2019

Difference between Ridge Regression (L2 Regularization) and Lasso Regression (L1 Regularization)

Regularization is mainly used to solve the overfitting problem in Machine Learning algorithms and helps in generalizing the prediction ability of ML algorithms.

If a model is simple, it may be the case that it is not exposed to the significant amount of training data and it may underfit. This model will not be able to generalize the data.

A complex model can also capture the noisy data which is totally irrelevant to our predictions. This model may perform well in the training data but will not perform well in test data due to overfitting.

We need to choose the right model in between the simple and the complex model. Regularization helps to choose the preferred model complexity, so that model does not overfit and is better at generalization.

Regularization is of 3 types:

1. Ridge Regression (L2 Regularization)
2. Lasso Regression (L1 Regularizaion)
3. Elastic Net Regreesion

Regularization adds some amount of bias (called Regularization Penalty) to the objective function and in return the algorithm gets significant drop in the variance.

For example, Linear Regression tries to minimize the Loss Function (lets say Sum of the Squared Errors) to get the best fit line. In order to prevent this model from overfitting, we can add Regularization Penalty to the Loss Function. Now the model has to minimize both the Loss Function and the Regularization Penalty.

The severity of the penalty is found by cross validation. In this way, the final model will never overfit. The severity of the penalty can vary from 0 to positive infinity. If severity is zero, it means we are not considering the regularization at all in our model.

Difference between Ridge Regression (L2 Regularization) and Lasso Regression (L1 Regularization)

1. In L1 regularization, we penalize the absolute value of the weights while in L2 regularization, we penalize the squared value of the weights.

2. In L1 regularization, we can shrink the parameters to zero while in L2 regularization, we can shrink the parameters to as small as possible but not to zero. So, L1 can simply discard the useless features in the dataset and make it simple.

When to use what?

There is no any hard and fast rule. If you need to eliminate some useless features from the dataset, L1 should be preferred. But, if you cannot afford to eliminate any feature from your dataset, use L2. In fact we should try both L1 and L2 regularization and check which results in better generalization. We can also use Elastic Net Regression which combines the features of both L1 and L2 regularization.