We treat numeric and categorical variables differently in Data Wrangling. So, you should always make at least two sets of data: one contains numeric variables and other contains categorical variables. We will use "select_dtypes" method of pandas library to differentiate between numeric and categorical variables.
Consider Ames Housing dataset.
Step 1: Load the required libraries
import pandas as pd
import numpy as np
Step 2: Load the dataset
dataset = pd.read_csv("C:/datasets/train.csv")
Step 3: Separate numeric and categorical variables
numeric_data = dataset.select_dtypes(include=[np.number])
categorical_data = dataset.select_dtypes(exclude=[np.number])
numeric_data.shape
categorical_data.shape
There are 38 numeric and 43 categorical columns in the dataset.
With numeric variables, you can impute missing values using mean, mode or median, replace invalid values, remove outliers, study the correlation among them, create bins using binning technique, implement feature engineering like standardization, normalization etc.
With categorical variables, you can impute missing values with new category or frequently occurring category, use label encoding, one hot encoding, dummies etc.
To know about detailed Data Wrangling steps, please visit my this post.
Consider Ames Housing dataset.
Step 1: Load the required libraries
import pandas as pd
import numpy as np
Step 2: Load the dataset
dataset = pd.read_csv("C:/datasets/train.csv")
Step 3: Separate numeric and categorical variables
numeric_data = dataset.select_dtypes(include=[np.number])
categorical_data = dataset.select_dtypes(exclude=[np.number])
numeric_data.shape
categorical_data.shape
There are 38 numeric and 43 categorical columns in the dataset.
With numeric variables, you can impute missing values using mean, mode or median, replace invalid values, remove outliers, study the correlation among them, create bins using binning technique, implement feature engineering like standardization, normalization etc.
With categorical variables, you can impute missing values with new category or frequently occurring category, use label encoding, one hot encoding, dummies etc.
To know about detailed Data Wrangling steps, please visit my this post.
thank you naresh for this :)
ReplyDeleteThanks for sharing
ReplyDelete