Pages

Friday, 5 April 2019

What is Skewnesss? How to visualize it with Histogram and how to remove it?

Skewness is a measure of the asymmetry in a variable. It can be positive (right skewed), negative (left skewed), and zero. Ideally there should be zero skewness in a variable. Larger the skewness, greater the number of outliers in a variable.


























How to remove skewness from variables? 

Our aim should be to have near zero skewness in our variables in the dataset. Taking log of the skewed variable helps a lot in decreasing the skewness. So, lets see how to do that?

Consider a Load Prediction dataset. We will analyze skewness of LoanAmount variable.

Step 1: Import the required libraries

import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
import seaborn as sns

Step 2: Load the dataset

dataset = pd.read_csv("C:/train_loan_prediction.csv")

Step 3: Draw histogram of LoanAmount variable with 20 bins

dataset['LoanAmount'].hist(bins=20)




















Step 4: Create a new variable by taking log of LoanAmount variable

dataset['LoanAmount_Log'] = np.log(dataset['LoanAmount'])

Step 5: Draw histogram of newly created variable

dataset['LoanAmount_Log'].hist(bins=20)



















We can see that distribution of the values in the LoanAmount_Log variable is normal and symmetrical and skewness is near to zero. In this way, you should check skewness of all the variables and remove it.

RelatedLog Transforming the Skewed Data to get Normal Distribution

1 comment: