Pages

Wednesday, 3 April 2019

How to visualize skewness of numeric variables by plotting histograms?

It is utmost important to remove skewness of variables before applying any Machine Learning algorithm. Skewed variables have outliers which must to be removed otherwise the accuracy of the model is adversely affected. 

Lets plot distribution plot for each numeric variable and examine its skewness. 

Consider Ames Housing dataset. 

Step 1: Load the required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Load the dataset
dataset = pd.read_csv("C:/datasets/train.csv")

Step 3: Create histogram for all the numeric variables

First separate out all the numeric variables from the dataset. Remove the Id column and then draw the distribution plot.

num_vars = [f for f in dataset.columns if dataset.dtypes[f] != 'object']
num_vars.remove('Id')
nd = pd.melt(dataset, value_vars = num_vars)
n1 = sns.FacetGrid (nd, col='variable', col_wrap=4, sharex=False, sharey = False)
n1 = n1.map(sns.distplot, 'value')
n1

It will draw 37 plots representing skewness of each variable. You need to clearly examine each graph and try to remove the outliers from it. One of the way to remove skewness of variable is log transformation. I have written a detailed article on log transformation in my this post.

RelatedWhat are Outliers? How to find and remove outliers using JointPlot in Seaborn Library?

No comments:

Post a Comment