Pages

Saturday, 16 March 2019

Difference between Label Encoder and One Hot Encoder in Python (Scikit Learn Library)

SciKit Learn library contains Label Encoder and One Hot Encoder. These two encoders are used to convert categorical data into numbers (zeros and ones). We will implement them and also see the differences between them. 

Consider the Titanic dataset in which we have "Sex" column which contains "male" and "female" values. As these are string values, we need to convert these categorical values to numbers before using this data in any machine learning algorithm. 

Note: There is a similar method in Pandas library called get_dummies which does the same. You can see more details on get_dummies in my this post.

I my previous article, I had used get_dummies to generate new columns "male" and "female" which contain zeros and ones. Now will we use Label Encoder and One Hot Encoder for the same purpose. While implementing both the encoders, the difference between them will also get cleared.

Step 1: Use LabelEncoder to convert "male" and "female" to zeros and ones

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
x[:, 0] = labelencoder.fit_transform(x[:, 0])

In the above line, I am assuming "Sex" is the first column in my dataset. You can change the index as per your dataset. 

After running the above code, I will have all the zeros and ones under the "Sex" column. LabelEncoder does this part. It does not create new columns corresponding to each categorical value. For this, we need to further use One Hot Encoder on the Label Encoded values. 

Why we need to one hot encode the label encoded values?

Consider a situation where I have more than two categorical values. For example, in our Titanic dataset, there is a column called Embarked which has 3 categorical values ('S', 'C', 'Q'). Label Encoder will convert these values into 0, 1 and 2. Although in the original dataset, there is no relation between 'S', 'C' and 'Q' but after label encoding it appears that there is some kind of relation like 'Q' > 'C' > 'S' (which is not true) as 'Q' is encoded to 2, 'C' is encoded to 1 and 'S' is encoded to 0. So, in order to remove this confusion, we need to further use one hot encoding on it to create different columns corresponding to 'S', 'C' and 'Q' which will contain only zero and ones.

Step 2: Convert the Label Encoded values to One Hot Encoded values

One Hot Encoder takes a column which has been label encoded, and then splits the column into multiple columns. The numbers are replaced by zeros and ones, depending on which column has what value.

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()

No comments:

Post a Comment