value_counts() function is present in pandas library and is very useful in Data Wrangling step. It is used to analyze frequency distribution of values in a variable by plotting frequency table.
value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
So, pandas value_counts() function is very useful in imputing the missing values.
Consider a Load Prediction dataset. We will try to impute missing values in Self_Employed variable.
Step 1: Import the required libraries
import pandas as pd
import numpy as np
Step 2: Load the dataset
dataset = pd.read_csv("C:/train_loan_prediction.csv")
dataset.shape
Output: (614, 13)
So, this dataset has 614 observations. Self_Employed column contains either "Yes" or "No". Lets see how many missing values are there in this columns:
dataset["Self_Employed"].isnull().sum()
Output: 32
So, this column contains 32 missing observations (out of total 614 observations). Now lets use pandas value_counts() function to calculate number of "Yes" and number of "No".
dataset['Self_Employed'].value_counts()
Output:
No 500
Yes 82
We can clearly observe that out of 582 observations, there are 500 "No" values which is around 86%. So if we easily impute "No" in the 32 missing values.
dataset['Self_Employed'].fillna('No', inplace=True)
So, in this way, we can use value_counts to impute missing values.
value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
So, pandas value_counts() function is very useful in imputing the missing values.
Consider a Load Prediction dataset. We will try to impute missing values in Self_Employed variable.
Step 1: Import the required libraries
import pandas as pd
import numpy as np
Step 2: Load the dataset
dataset = pd.read_csv("C:/train_loan_prediction.csv")
dataset.shape
Output: (614, 13)
So, this dataset has 614 observations. Self_Employed column contains either "Yes" or "No". Lets see how many missing values are there in this columns:
dataset["Self_Employed"].isnull().sum()
Output: 32
So, this column contains 32 missing observations (out of total 614 observations). Now lets use pandas value_counts() function to calculate number of "Yes" and number of "No".
dataset['Self_Employed'].value_counts()
Output:
No 500
Yes 82
We can clearly observe that out of 582 observations, there are 500 "No" values which is around 86%. So if we easily impute "No" in the 32 missing values.
dataset['Self_Employed'].fillna('No', inplace=True)
So, in this way, we can use value_counts to impute missing values.
No comments:
Post a Comment