Pages

Sunday, 17 March 2019

Data Wrangling: How to handle missing values in categorical column in a dataset?

In my previous article, we had seen how to impute missing values in numeric columns?Today, we will see how can we impute missing values in categorical columns? 

Again, we will take example of Titanic dataset. There are two categorical columns (Cabin and Embarked) in the Titanic dataset which have missing values. Cabin has 687 missing values (out of 891), so its better to drop this column as it has more than 77% of null data. So, lets concentrate on Embarked column which has only 2 missing values.

In categorical columns, we introduce a new category usually called "Unknown" to impute missing values. As this column has 'S', 'C' and 'Q' categories, lets impute 'U' (Unknown) as a new category for 2 missing values.

dataset['Embarked'].fillna('U')

If you think you should not drop "Cabin" column, you can try imputing missing cabin category in a same way we did for "Embarked" column. Its up to you. Please note that introducing a new category ("Unknown") which is not a part of the original dataset may lead to variance in the prediction. 

As it is a categorical variable, next step is to one hot encode this column. I have already explained this step here. You can also use get_dummies method of Pandas to one hot encode this categorical variable. So before one hot encoding any categorical column, you must search for the missing values in it and if found, you must impute it with unknown category.

One more thing to note, instead of creating a new category, you can also impute missing values with frequently occurring category in that variable. 

No comments:

Post a Comment