Outliers are some of the data points which deviate a lot from the normal observation of the data. These outliers drastically degrade the performance and accuracy of the model. So, it is utmost important to remove the outliers from our dataset to get consistent results from the Machine Learning algorithms.
We will use Ames Housing dataset and concentrate at "GrLivArea" feature. "GrLivArea" refers to the living area (in sq ft.) above ground. We will try to find out and remove outliers in this feature.
Step 1: Load the dataset
import pandas as pd
dataset = pd.read_csv("C:/datasets/train.csv")
Step 2: Data Exploration
dataset.shape
dataset[["GrLivArea", "SalePrice"]]
We see that dataset has 1460 rows and 81 columns. Please note that I am not going to explore the entire dataset. I have written a complete post on data exploration here. I will only concentrate on the "GrLivArea" feature.
Step 3: Draw a plot between GrLivArea and SalePrice
import seaborn as sns
sns.jointplot(x=dataset['GrLivArea'], y=dataset['SalePrice'])
We can see from the above plot that there is a direct correlation of living area with sale price. We can also spot 4 outlier value i,e. GrLivArea > 4000 (see the data points in red highlighted box).
Step 4: Remove the outliers
dataset.drop(dataset[dataset['GrLivArea'] > 4000].index, inplace=True)
dataset.shape
Now we get 1456 rows. It means we have successfully removed 4 outliers from our dataset.
Note: Tree based algorithms are usually robust to outliers and handle them automatically.
We will use Ames Housing dataset and concentrate at "GrLivArea" feature. "GrLivArea" refers to the living area (in sq ft.) above ground. We will try to find out and remove outliers in this feature.
Step 1: Load the dataset
import pandas as pd
dataset = pd.read_csv("C:/datasets/train.csv")
Step 2: Data Exploration
dataset.shape
dataset[["GrLivArea", "SalePrice"]]
We see that dataset has 1460 rows and 81 columns. Please note that I am not going to explore the entire dataset. I have written a complete post on data exploration here. I will only concentrate on the "GrLivArea" feature.
Step 3: Draw a plot between GrLivArea and SalePrice
import seaborn as sns
sns.jointplot(x=dataset['GrLivArea'], y=dataset['SalePrice'])
We can see from the above plot that there is a direct correlation of living area with sale price. We can also spot 4 outlier value i,e. GrLivArea > 4000 (see the data points in red highlighted box).
Step 4: Remove the outliers
dataset.drop(dataset[dataset['GrLivArea'] > 4000].index, inplace=True)
dataset.shape
Now we get 1456 rows. It means we have successfully removed 4 outliers from our dataset.
Note: Tree based algorithms are usually robust to outliers and handle them automatically.
No comments:
Post a Comment