Skewness is a measure of the asymmetry in a variable. It can be positive (right skewed), negative (left skewed), and zero. Ideally there should be zero skewness in a variable. Larger the skewness, greater the number of outliers in a variable.
How to remove skewness from variables?
Our aim should be to have near zero skewness in our variables in the dataset. Taking log of the skewed variable helps a lot in decreasing the skewness. So, lets see how to do that?
Consider a Load Prediction dataset. We will analyze skewness of LoanAmount variable.
Step 1: Import the required libraries
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
import seaborn as sns
Step 2: Load the dataset
dataset = pd.read_csv("C:/train_loan_prediction.csv")
Step 3: Draw histogram of LoanAmount variable with 20 bins
dataset['LoanAmount'].hist(bins=20)
Step 4: Create a new variable by taking log of LoanAmount variable
dataset['LoanAmount_Log'] = np.log(dataset['LoanAmount'])
Step 5: Draw histogram of newly created variable
dataset['LoanAmount_Log'].hist(bins=20)
We can see that distribution of the values in the LoanAmount_Log variable is normal and symmetrical and skewness is near to zero. In this way, you should check skewness of all the variables and remove it.
Related: Log Transforming the Skewed Data to get Normal Distribution
How to remove skewness from variables?
Our aim should be to have near zero skewness in our variables in the dataset. Taking log of the skewed variable helps a lot in decreasing the skewness. So, lets see how to do that?
Consider a Load Prediction dataset. We will analyze skewness of LoanAmount variable.
Step 1: Import the required libraries
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
import seaborn as sns
Step 2: Load the dataset
dataset = pd.read_csv("C:/train_loan_prediction.csv")
Step 3: Draw histogram of LoanAmount variable with 20 bins
dataset['LoanAmount'].hist(bins=20)
Step 4: Create a new variable by taking log of LoanAmount variable
dataset['LoanAmount_Log'] = np.log(dataset['LoanAmount'])
Step 5: Draw histogram of newly created variable
dataset['LoanAmount_Log'].hist(bins=20)
We can see that distribution of the values in the LoanAmount_Log variable is normal and symmetrical and skewness is near to zero. In this way, you should check skewness of all the variables and remove it.
Related: Log Transforming the Skewed Data to get Normal Distribution
thanks
ReplyDelete