Thursday 21 March 2019

Implement Imputer in Python using Scikit Learn Library

Imputer class present in Scikit Learn library is used to replace the missing values in the numeric feature with some meaningful value like mean, median or mode. Lets see its implementation in Python using sklearn library.

You can download my Jupyter notebook implementing Imputer from here.

Step 1: Import the required libraries like pandas, numpy and sklearn

import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer

Step 2: Create a pandas data frame

dataframe = pd.DataFrame()
dataframe['Feature_1'] = [0.42,0.56,0.36,0.90,0.98,0.64,0.76,0.56,0.39,0.77]
dataframe['Feature_2'] = [np.nan,0.90,0.75,0.45,np.nan,0.88,0.67,0.34,0.72,0.28]

I have added two features (added 10 values in each feature) in this data frame and deliberately put two nan values in the second feature. We will impute these nan values using Imputer class present in sklearn library.

Step 3: Impute nan values with mean value using Imputer class

imputer = Imputer(missing_values='NaN', strategy='mean', axis=0, verbose=0, copy="True")
dataframe = imputer.fit_transform(dataframe.values)

Explanation of Imputer class parameters:

missing_values — This is the value which has to be replaced in the dataset. This could either be an integer, or NaN. If you don’t pass this value, NaN will be the default value. So, wherever we have NaN in our dataset, the Imputer object will replace it with a new value.

strategy — This is the strategy we’ll be using to calculate the value which has to replace the NaN occurrences in the dataset. There are three different strategies we can use: meanmedianmost_frequent.

axis — This can take one of two values — 0 and 1. This will decide if the Imputer will apply the strategy along the rows or along the columns. 0 for columns, and 1 for rows.

verbose — This will just decide the verbosity of the Imputer. By default, it’s set to 0.

copy — This will decide if a copy of the original object has to be made, or if the Imputer should change the dataset in-place. By default, it is set to True.

No comments:

Post a Comment

About the Author

I have more than 10 years of experience in IT industry. Linkedin Profile

I am currently messing up with neural networks in deep learning. I am learning Python, TensorFlow and Keras.

Author: I am an author of a book on deep learning.

Quiz: I run an online quiz on machine learning and deep learning.