Pages

Sunday, 31 March 2019

How to separate numeric and categorical variables in a dataset using Pandas and Numpy Libraries in Python?

We treat numeric and categorical variables differently in Data Wrangling. So, you should always make at least two sets of data: one contains numeric variables and other contains categorical variables. We will use "select_dtypes" method of pandas library to differentiate between numeric and categorical variables.

Consider Ames Housing dataset. 

Step 1: Load the required libraries

import pandas as pd
import numpy as np

Step 2: Load the dataset

dataset = pd.read_csv("C:/datasets/train.csv")

Step 3: Separate numeric and categorical variables

numeric_data = dataset.select_dtypes(include=[np.number])
categorical_data = dataset.select_dtypes(exclude=[np.number])

numeric_data.shape
categorical_data.shape

There are 38 numeric and 43 categorical columns in the dataset. 

With numeric variables, you can impute missing values using mean, mode or median, replace invalid values, remove outliers, study the correlation among them, create bins using binning technique, implement feature engineering like standardization, normalization etc.

With categorical variables, you can impute missing values with new category or frequently occurring category, use label encoding, one hot encoding, dummies etc.

To know about detailed Data Wrangling steps, please visit my this post.

2 comments: