Pages

Friday, 15 February 2019

What is the cause of Overfitting and how can we avoid it in Machine Learning?

How to avoid overfitting in Machine Learning? What are the various ways to deal with the overfitting of the data in Machine Learning? This is the very important question to consider in the world of Data Science and Machine Learning. 

Lets start this discussion with a small story of two brothers: Tom and Harry.

Tom and Harry study in First standard. Tomorrow is their Maths exam. Syllabus contains 20 questions (5 questions each from Addition, Subtraction, Multiplication and Division questions).

5 Addition questions in the syllabus are:

2+5=7
3+2=5
1+8=9
4+2=6
7+1=8

Tom learnt only Addition and Subtraction.
Harry memorized all the 20 questions.

Exam Result: Tom just managed to pass while Harry became one of the toppers.

Next day, Harry was given following Addition problems to solve:

3+2 (Harry answered 5 because this was in syllabus and he had memorized it)
1+1 (Harry failed to answer this simple question because this was not in the syllabus)
2+1 (Harry failed to answer this simple question because this was not in the syllabus)

Tom was able to solve above all but was not able to solve Multiplication and Division problems.

Story finished. Now lets relate it to Underfitting and Overfitting concepts in Machine Learning

Consider Maths syllabus as a training dataset.

Tom learnt only small portion of the syllabus. This is Underfitting. Algorithm is not aware of all the scenarios, so will not be able to predict those scenarios in the test dataset as it is not trained in on those scenarios in the training dataset. But it will be able to generalize the scenarios which it learnt.

Harry memorized all the 20 questions. This is Overfitting. Algorithm will not be able to generalize the data in the real environment and will result in high variance.

So, following points should be noted down regarding Overfitting and Underfitting:

1. A good algorithm should have low/reasonable bias in the training dataset. Then it will also have low variance in test dataset which is good sign of a consistent algorithm.

2. If an algorithm overfits in the training dataset (have zero bias), there is a lot of possibility that it will have high variance in the test dataset which is a bad.

3. When a model tries to overfit, it loses its generalization capacity, due to which its shows poor performance in the test dataset.

4. The model which tries to overfit the training set, mainly becomes too complex.

5. The model which tries to underfit the training set, mainly becomes too simple.

What causes Overfitting? 

1. Small training dataset
2. Large number of features in a dataset
3. Noise in the dataset

How to avoid Overfitting? 

We need to find out a way in-between of overfit and underfit. This is achieved by following techniques:

1. Dimensionality Reduction

2. Regularization Technique (Ridge Regression - L2, Lasso Regression - L1, Elastic Net Regression)

3. Cross Validation Techniques

4. Ensemble Learning Techniques (Bagging and Boosting)
  • Bagging: Random Forest
  • Boosting: AdaBoost, Gradient Boosting Machine (GBM), XGBoost 
I will discuss all these techniques in detail in my next post.

No comments:

Post a Comment