Pages

Saturday, 6 April 2019

Frequency Table: How to use pandas value_counts() function to impute missing values?

value_counts() function is present in pandas library and is very useful in Data Wrangling step. It is used to analyze frequency distribution of values in a variable by plotting frequency table.

value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

So, pandas value_counts() function is very useful in imputing the missing values. 

Consider a Load Prediction dataset. We will try to impute missing values in Self_Employed variable.

Step 1: Import the required libraries

import pandas as pd
import numpy as np

Step 2: Load the dataset

dataset = pd.read_csv("C:/train_loan_prediction.csv")
dataset.shape
Output: (614, 13)

So, this dataset has 614 observations. Self_Employed column contains either "Yes" or "No". Lets see how many missing values are there in this columns:

dataset["Self_Employed"].isnull().sum()
Output: 32

So, this column contains 32 missing observations (out of total 614 observations). Now lets use pandas value_counts() function to calculate number of "Yes" and number of "No".

dataset['Self_Employed'].value_counts()
Output:
No     500
Yes     82

We can clearly observe that out of 582 observations, there are 500 "No" values which is around 86%. So if we easily impute "No" in the 32 missing values.

dataset['Self_Employed'].fillna('No', inplace=True)

So, in this way, we can use value_counts to impute missing values.

No comments:

Post a Comment