Pages

Wednesday, 3 April 2019

How to visualize outliers in categorical variables using boxplots?

Outliers must be removed from a dataset. In my last post, we saw how to visualize outliers in numeric variables? In this post, we will use barplots to visualize the outliers in the categorical variables. 

Consider Ames Housing dataset. 

Step 1: Load the required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Load the dataset
dataset = pd.read_csv("C:/datasets/train.csv")

Step 3: Create barplots for all the categorical variables

First separate out all the categorical variables from the dataset and then draw barplot for each variable.

def boxplot(x,y,**kwargs):
            sns.boxplot(x=x,y=y)
            x = plt.xticks(rotation=90)

cat_vars = [f for f in dataset.columns if dataset.dtypes[f] == 'object']
p = pd.melt(dataset, id_vars='SalePrice', value_vars=cat_vars)
g = sns.FacetGrid (p, col='variable', col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, 'value','SalePrice')
g

It will draw 43 plots representing outliers in each variable. You need to clearly examine each graph and try to remove the outliers from it. 

RelatedWhat are Outliers? How to find and remove outliers using JointPlot in Seaborn Library?

No comments:

Post a Comment

About the Author

I have more than 10 years of experience in IT industry. Linkedin Profile

I am currently messing up with neural networks in deep learning. I am learning Python, TensorFlow and Keras.

Author: I am an author of a book on deep learning.

Quiz: I run an online quiz on machine learning and deep learning.