Pages

Saturday, 16 March 2019

Data Wrangling: Convert Categorical Variables into Dummies (Numbers: 0 and 1) in Python using Pandas Library

Categorical variables are those variables which contain categorical values. For example, consider the "Sex" column in our Titanic dataset. It is categorical variable containing male and female. We need to convert these categories (male and female) into numbers (0 and 1) because most of the machine learning algorithms don't accept string values.

I found 3 categorical columns (Sex, Embarked and Pclass) in the Titanic dataset. So, lets convert them into numbers (0 and 1). Or, we can say lets one hot encode these variables.

Note: Before going through this article, its required that you go through my previous article where I had loaded this Titanic dataset into pandas data frame and removed the null values from it. This article is in continuation to my previous article on removing null values from dataset.

Pandas library in Python contains get_dummies method which does the one hot encoding of the categorical variables (converts them into numbers - 0 and 1). The method get_dummies creates a new data frame which consists of zeros and ones. 

Step 1: Convert categorical variables to their respective one hot encoded representation

sex = pd.get_dummies(dataset['Sex'], drop_first=True)
embark = pd.get_dummies(dataset['Embarked'], drop_first=True)
pclass = pd.get_dummies(dataset['Pclass'], drop_first=True)

If you want to keep all the newly created columns, then don't use drop_first parameter. If you want to view the data which got converted into zeros and ones, use head() method.

sex.head()
embark.head()
pclass.head()

Step 2: Concatenate all the one hot encoded columns to the original dataset

dataset = pd.concat([dataset, sex, embark, pclass], axis=1)

Step 3: Drop original columns

As we have already one hot encoded the Sex, Embarked and Pclass columns, lets drop these columns from the original dataset.

dataset.drop(['Sex', 'Embarked', 'Pclass'], axis=1, inplace=True)

There are also some columns (like PassengerId, Name and Ticket) which are not going to contribute in any kind of prediction. Also, Name and Ticket columns contain string values. So, its better to remove them. 

dataset.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)

I will write a complete post on Label Encoding and One Hot Encoding in my upcoming articles. So, stay tuned.

No comments:

Post a Comment