How to remove skewness from variables?
Our aim should be to have near zero skewness in our variables in the dataset. Taking log of the skewed variable helps a lot in decreasing the skewness. So, lets see how to do that?
Consider a Load Prediction dataset. We will analyze skewness of LoanAmount variable.
Step 1: Import the required libraries
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
Step 2: Load the dataset
dataset = pd.read_csv("C:/train_loan_prediction.csv")
Step 3: Draw histogram of LoanAmount variable with 20 bins
Step 4: Create a new variable by taking log of LoanAmount variable
dataset['LoanAmount_Log'] = np.log(dataset['LoanAmount'])
Step 5: Draw histogram of newly created variable
We can see that distribution of the values in the LoanAmount_Log variable is normal and symmetrical and skewness is near to zero. In this way, you should check skewness of all the variables and remove it.
Related: Log Transforming the Skewed Data to get Normal Distribution