Skewness is a measure of the asymmetry in a variable. It can be

Our aim should be to have near zero skewness in our variables in the dataset. Taking log of the skewed variable helps a lot in decreasing the skewness. So, lets see how to do that?

Consider a Load Prediction dataset. We will analyze skewness of

import pandas as pd

import numpy as np

import matplotlib as plt

%matplotlib inline

import seaborn as sns

dataset = pd.read_csv("C:/train_loan_prediction.csv")

dataset['LoanAmount']

dataset['LoanAmount_Log'] =

dataset['LoanAmount_Log'].hist(bins=20)

We can see that distribution of the values in the

**positive (right skewed)**,**negative (left skewed)**, and zero. Ideally there should be zero skewness in a variable. Larger the skewness, greater the number of outliers in a variable.

**How to remove skewness from variables?**Our aim should be to have near zero skewness in our variables in the dataset. Taking log of the skewed variable helps a lot in decreasing the skewness. So, lets see how to do that?

Consider a Load Prediction dataset. We will analyze skewness of

**LoanAmount**variable.**Step 1: Import the required libraries**import pandas as pd

import numpy as np

import matplotlib as plt

%matplotlib inline

import seaborn as sns

**Step 2: Load the dataset**dataset = pd.read_csv("C:/train_loan_prediction.csv")

**Step 3: Draw histogram of LoanAmount variable with 20 bins**dataset['LoanAmount']

**.hist(bins=20)**

**Step 4: Create a new variable by taking log of LoanAmount variable**dataset['LoanAmount_Log'] =

**np.log**(dataset['LoanAmount'])**Step 5: Draw histogram of newly created variable**dataset['LoanAmount_Log'].hist(bins=20)

We can see that distribution of the values in the

**LoanAmount_Log**variable is normal and symmetrical and skewness is near to zero. In this way, you should check skewness of all the variables and remove it.**Related**: Log Transforming the Skewed Data to get Normal Distribution
thanks

ReplyDelete