Saturday, 6 April 2019

Frequency Table: How to use pandas value_counts() function to impute missing values?

value_counts() function is present in pandas library and is very useful in Data Wrangling step. It is used to analyze frequency distribution of values in a variable by plotting frequency table.

value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

So, pandas value_counts() function is very useful in imputing the missing values. 

Consider a Load Prediction dataset. We will try to impute missing values in Self_Employed variable.

Step 1: Import the required libraries

import pandas as pd
import numpy as np

Step 2: Load the dataset

dataset = pd.read_csv("C:/train_loan_prediction.csv")
Output: (614, 13)

So, this dataset has 614 observations. Self_Employed column contains either "Yes" or "No". Lets see how many missing values are there in this columns:

Output: 32

So, this column contains 32 missing observations (out of total 614 observations). Now lets use pandas value_counts() function to calculate number of "Yes" and number of "No".

No     500
Yes     82

We can clearly observe that out of 582 observations, there are 500 "No" values which is around 86%. So if we easily impute "No" in the 32 missing values.

dataset['Self_Employed'].fillna('No', inplace=True)

So, in this way, we can use value_counts to impute missing values.

