Domain knowledge plays a crucial role in data wrangling. Sometimes, there are no missing values in the dataset but there are a lot of invalid values which we need to manually identify and remove those invalid values.
For example, consider "Pima Indians Diabetes" dataset which predicts the onset of diabetes within 5 years in Pima Indians, given medical details. This dataset has a lot of invalid values which we will try to remove in this article.
Lets load this dataset. You can download it from here.
import pandas as pd
import numpy as np
names = ['PregCount', 'Glucose', 'BP', 'SkinFold', 'Insulin', 'BMI', 'Pedigree', 'Age', 'Class']
dataset = pd.read_csv('pima-indians-diabetes.csv', names=names)
dataset.shape
This dataset has 768 observations and 8 parameters like:
1. Number of times pregnant.
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
3. Diastolic blood pressure (mm Hg).
4. Triceps skinfold thickness (mm).
5. 2-Hour serum insulin (mu U/ml).
6. Body mass index (weight in kg/(height in m)^2).
7. Diabetes pedigree function.
8. Age (years).
Lets see count of null or missing values in this dataset:
dataset.isnull().sum()
We find out there are no missing values in it. Now lets find statistics of the data:
dataset.describe()
We find that following columns have min zero value:
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
Here domain knowledge plays a vital role. Although there are no null or missing values in this dataset, but there are a lot of invalid values (zero) in the above column. Lets see how many zero values are there in the above columns:
print((dataset[['Glucose', 'BP', 'SkinFold', 'Insulin', 'BMI']] == 0).sum())
Glucose - 5
BP - 35
SkinFold - 227
Insulin - 374
BMI - 11
In order to handle these invalid zero values, we will mark these values as NaN.
dataset[['Glucose', 'BP', 'SkinFold', 'Insulin', 'BMI']] = dataset[['Glucose', 'BP', 'SkinFold', 'Insulin', 'BMI']].replace(0, np.NaN)
Now print null values:
dataset.isnull().sum()
PregCount - 0
Glucose - 5
BP - 35
SkinFold - 227
Insulin - 374
BMI - 11
Pedigree - 0
Age - 0
Class - 0
NaN values are ignored from operations like sum, count etc. As these are numeric values, we can take mean, median or mode of these values and replace the NaN with those values. To know more about it, please go through my this post.
For example, consider "Pima Indians Diabetes" dataset which predicts the onset of diabetes within 5 years in Pima Indians, given medical details. This dataset has a lot of invalid values which we will try to remove in this article.
Lets load this dataset. You can download it from here.
import pandas as pd
import numpy as np
names = ['PregCount', 'Glucose', 'BP', 'SkinFold', 'Insulin', 'BMI', 'Pedigree', 'Age', 'Class']
dataset = pd.read_csv('pima-indians-diabetes.csv', names=names)
dataset.shape
This dataset has 768 observations and 8 parameters like:
1. Number of times pregnant.
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
3. Diastolic blood pressure (mm Hg).
4. Triceps skinfold thickness (mm).
5. 2-Hour serum insulin (mu U/ml).
6. Body mass index (weight in kg/(height in m)^2).
7. Diabetes pedigree function.
8. Age (years).
Lets see count of null or missing values in this dataset:
dataset.isnull().sum()
We find out there are no missing values in it. Now lets find statistics of the data:
dataset.describe()
We find that following columns have min zero value:
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
Here domain knowledge plays a vital role. Although there are no null or missing values in this dataset, but there are a lot of invalid values (zero) in the above column. Lets see how many zero values are there in the above columns:
print((dataset[['Glucose', 'BP', 'SkinFold', 'Insulin', 'BMI']] == 0).sum())
Glucose - 5
BP - 35
SkinFold - 227
Insulin - 374
BMI - 11
In order to handle these invalid zero values, we will mark these values as NaN.
dataset[['Glucose', 'BP', 'SkinFold', 'Insulin', 'BMI']] = dataset[['Glucose', 'BP', 'SkinFold', 'Insulin', 'BMI']].replace(0, np.NaN)
Now print null values:
dataset.isnull().sum()
PregCount - 0
Glucose - 5
BP - 35
SkinFold - 227
Insulin - 374
BMI - 11
Pedigree - 0
Age - 0
Class - 0
NaN values are ignored from operations like sum, count etc. As these are numeric values, we can take mean, median or mode of these values and replace the NaN with those values. To know more about it, please go through my this post.
Dear Sir,
ReplyDeleteMy dataset is something like this.
product_ID Prodcut_Price Product_monthly_sale
0 1 1.0 1.0
1 1 NaN NaN
2 3 5.0 5.0
3 3 NaN NaN
4 3 9.0 5.0
I am using the code but its not working can you help me why its not working.
M = dataset.fillna(dataset.groupby("product_id").transform('mean'))
data2 = pd.concat([data1.product_id, M], axis=1)