Pages

Friday, 29 March 2019

What are Outliers? How to find and remove outliers using JointPlot in Seaborn Library?

Outliers are some of the data points which deviate a lot from the normal observation of the data. These outliers drastically degrade the performance and accuracy of the model. So, it is utmost important to remove the outliers from our dataset to get consistent results from the Machine Learning algorithms. 

We will use Ames Housing dataset and concentrate at "GrLivArea" feature. "GrLivArea" refers to the living area (in sq ft.) above ground. We will try to find out and remove outliers in this feature.

Step 1: Load the dataset

import pandas as pd
dataset = pd.read_csv("C:/datasets/train.csv")

Step 2: Data Exploration

dataset.shape 
dataset[["GrLivArea", "SalePrice"]]

We see that dataset has 1460 rows and 81 columns. Please note that I am not going to explore the entire dataset. I have written a complete post on data exploration here. I will only concentrate on the "GrLivArea" feature.

Step 3: Draw a plot between GrLivArea and SalePrice

import seaborn as sns
sns.jointplot(x=dataset['GrLivArea'], y=dataset['SalePrice'])


























We can see from the above plot that there is a direct correlation of living area with sale price. We can also spot 4 outlier value i,e. GrLivArea > 4000 (see the data points in red highlighted box).  

Step 4: Remove the outliers

dataset.drop(dataset[dataset['GrLivArea'] > 4000].index, inplace=True)
dataset.shape

Now we get 1456 rows. It means we have successfully removed 4 outliers from our dataset.

Note: Tree based algorithms are usually robust to outliers and handle them automatically.

No comments:

Post a Comment