Pages

Sunday, 17 March 2019

Data Wrangling: How to handle missing values in numeric column in a dataset?

Handling missing values in a dataset is a very common practice in data wrangling and there are various ways to deal with it. Today, we will see how to handle missing values in the numeric column? For this, we will consider the "Age" column (which contains numeric values) from the Titanic dataset. 

Note: Please go through my previous post where I have loaded this Titanic dataset into pandas data frame.

Lets see how many missing values are there in the "Age" column of Titanic dataset?

dataset['Age'].isnull().sum()

There are 177 missing values out of 891 observations. Now how to handle these 177 missing values? 

The general method to handle such kind of scenarios is to replace the missing values with some meaningful value. This meaningful values can be obtained by taking the mean, median or mode of all the not null values in the "Age" column. This is a statistical approach of handling the missing values and is well suited for linear data. 

Step 1: Calculate mean, median and mode 

mean_age = dataset['Age'].mean()
median_age = dataset['Age'].median()
mode_age = dataset['Age'].mode()
display('Mean Age: ' + str(mean_age))
display('Median Age: ' + str(median_age))
display('Mode Age: ' + str(mode_age))

Output

'Mean Age: 29.69911764705882'
'Median Age: 28.0'
'Mode Age: 0    24.0\ndtype: float64'

Step 2: Replace the missing values 

Replace the missing values in the "Age" column with any of the above calculated values. In this case, I am going to replace the missing values with the mean value.

dataset['Age'].replace(np.NaN, mean_age)

Please note that, this is just an approximation of the missing values and it may lead to variance in the prediction but we have to deal with it. There is no way around. But this approach is far better than dropping the "Age" column due to which we will lose a lot of significant data.

No comments:

Post a Comment