Pages

Friday, 10 May 2019

Difference between Decision Tree and Random Forest in Machine Learning

Random Forest is a collection of Decision Trees. Decision Tree makes its final decision based on the output of one tree but Random Forest combines the output of a large number of small trees while making its final prediction. Following is the detailed list of differences between Decision Tree and Random Forest:

1. Random Forest is an Ensemble Learning (Bagging) Technique unlike Decision Tree: In Decision Tree, only one tree is grown using all the features and observations. But in case of Random Forest, features and observations are spitted into multiple parts and a lot of small trees (instead of one big tree) are grown based on the spitted data. So, instead of one full tree like Decision Tree, Random Forest uses multiple trees. Larger the number of trees, better is the accuracy and generalization capability. But at some point, increasing the number of trees does not contribute to the accuracy, so one should stop growing trees at that point. 

2. Random Forest uses voting system unlike Decision Tree: All the trees grown in Random Forest are called weak learners. Each weak learner casts a vote as per its prediction. The class which gets maximum votes is considered as the final output of the prediction. You can think of it like a democracy system. On the other hand, there is no voting system in Decision Tree. Only one tree predicts the outcome. No democracy at all!! 

3. Random Forest rarely overfits unlike Decision Tree: Decision Tree is very much prone to overfitting as there is only one tree which is responsible for predicting the outcome. If there is a lot of noise in the dataset, it will start considering the noise while creating the model and will lead to very low bias (or no bias at all). Due to this, it will show a lot of variance in the final predictions in real world data. This scenario is called overfitting. In Random Forest, noise has very little role in spoiling the model as there are so many trees in it and noise cannot affect all the trees.

4. Random Forest reduces variance instead of bias: Random forest reduces variance part of the error rather than bias part, so on a given training dataset, Decision Tree may be more accurate than a Random Forest. But on an unexpected validation dataset, Random Forest always wins in terms of accuracy.

5. Performance: The downside of Random Forest is that it can be slow if you have a single process but it can be parallelized.

6. Decision Tree is easier to understand and interpret: Decision Tree is simple and easy to interpret. You know what variable and what value of that variable is used to split the data and predict the outcome. On the other hand, Random Forest is like a Black Box. You can specify the number of trees you want in your forest (n_estimators) and also you can specify maximum number of features to be used in each tree. But you cannot control the randomness, you cannot control which feature is part of which tree in the forest, you cannot control which data point is part of which tree. 

No comments:

Post a Comment