Handling missing values in a dataset is a very common practice in data wrangling and there are various ways to deal with it. Today, we will see how to handle missing values in the numeric column? For this, we will consider the "Age" column (which contains numeric values) from the Titanic dataset.

Lets see how many missing values are there in the "Age" column of Titanic dataset?

There are

The general method to handle such kind of scenarios is to replace the missing values with some meaningful value. This meaningful values can be obtained by taking the mean, median or mode of all the not null values in the "Age" column. This is a

mean_age = dataset['Age'].

median_age = dataset['Age'].

mode_age = dataset['Age'].

display('Mean Age: ' + str(mean_age))

display('Median Age: ' + str(median_age))

display('Mode Age: ' + str(mode_age))

'Mean Age: 29.69911764705882'

'Median Age: 28.0'

'Mode Age: 0 24.0\ndtype: float64'

Replace the missing values in the "Age" column with any of the above calculated values. In this case, I am going to replace the missing values with the mean value.

Please note that, this is just an approximation of the missing values and it may lead to variance in the prediction but we have to deal with it. There is no way around. But this approach is far better than dropping the "Age" column due to which we will lose a lot of significant data.

**Note:**Please go through my previous post where I have loaded this Titanic dataset into pandas data frame.Lets see how many missing values are there in the "Age" column of Titanic dataset?

**dataset['Age'].isnull().sum()**There are

**177**missing values out of**891**observations. Now how to handle these 177 missing values?The general method to handle such kind of scenarios is to replace the missing values with some meaningful value. This meaningful values can be obtained by taking the mean, median or mode of all the not null values in the "Age" column. This is a

**statistical approach of handling the missing values**and is well suited for linear data.**Step 1: Calculate mean, median and mode**mean_age = dataset['Age'].

**mean**()median_age = dataset['Age'].

**median**()mode_age = dataset['Age'].

**mode**()display('Mean Age: ' + str(mean_age))

display('Median Age: ' + str(median_age))

display('Mode Age: ' + str(mode_age))

**Output**'Mean Age: 29.69911764705882'

'Median Age: 28.0'

'Mode Age: 0 24.0\ndtype: float64'

**Step 2: Replace the missing values**Replace the missing values in the "Age" column with any of the above calculated values. In this case, I am going to replace the missing values with the mean value.

**dataset['Age'].replace(np.NaN, mean_age)**Please note that, this is just an approximation of the missing values and it may lead to variance in the prediction but we have to deal with it. There is no way around. But this approach is far better than dropping the "Age" column due to which we will lose a lot of significant data.

## No comments:

## Post a Comment