Removing null values from the dataset is one of the important steps in data wrangling. These null values adversely affect the performance and accuracy of any machine learning algorithm. So, it is very important to remove null values from the dataset before applying any machine learning algorithm to that dataset. Although some algorithms like XGBoost have built-in feature to handle null values, but we should also do it manually as a good practice while preparing the data.
We will use Python library (pandas) to remove null values from the Titanic dataset. Lets try it out.
Step 1: Import the required Python libraries
import pandas as pd
Step 2: Load and examine the dataset (Data Exploration)
dataset = pd.read_csv('titanic.csv')
dataset.shape
dataset.info()
dataset.head()
You can download Titanic dataset from here. There are 891 observations and 12 features in this dataset.
Step 3: Data Wrangling (Removing null values)
In this tutorial, we will just remove null values from our Titanic dataset as a part of data wrangling step in order to make our article short and crisp.
Step 3.1: Lets see how many null values are there in our dataset?
dataset.isnull()
This will display the entire dataset in terms of True and False. Not Null values are represented by False and Null values are represented by True.
Step 3.2: Lets display which feature contains how many null values?
dataset.isnull().sum()
We will get a list of features with the count of null values. From the output of the above code, it is clear that Age column contains 177 null values and Cabin column contains 687 null values.
Step 3.3: Lets drop the Cabin column
We see that Cabin column contains 687 null values out of 891 rows / observations. So, it makes sense to drop this column from the dataset. Lets drop it.
dataset.drop('Cabin', axis=1, inplace=True)
Please note that inplace parameter is used to permanently affect our dataset. By default, it is false. If we don't set it to True explicitly, the Cabin column is not dropped permanently from our dataset.
Step 3.4: Lets drop all the rows in the dataset which contain null values
dataset.dropna(inplace=True)
It will remove all the rows which contain null values from the dataset. Now our dataset does not contain any null value. This step is not recommended. I added this step just for illustration. We can loose significant information by executing this step. There are methods to replace the null values with some meaningful values. I will explore those methods in my future posts on data wrangling.
We will use Python library (pandas) to remove null values from the Titanic dataset. Lets try it out.
Step 1: Import the required Python libraries
import pandas as pd
Step 2: Load and examine the dataset (Data Exploration)
dataset = pd.read_csv('titanic.csv')
dataset.shape
dataset.info()
dataset.head()
You can download Titanic dataset from here. There are 891 observations and 12 features in this dataset.
Step 3: Data Wrangling (Removing null values)
In this tutorial, we will just remove null values from our Titanic dataset as a part of data wrangling step in order to make our article short and crisp.
Step 3.1: Lets see how many null values are there in our dataset?
dataset.isnull()
This will display the entire dataset in terms of True and False. Not Null values are represented by False and Null values are represented by True.
Step 3.2: Lets display which feature contains how many null values?
dataset.isnull().sum()
We will get a list of features with the count of null values. From the output of the above code, it is clear that Age column contains 177 null values and Cabin column contains 687 null values.
Step 3.3: Lets drop the Cabin column
We see that Cabin column contains 687 null values out of 891 rows / observations. So, it makes sense to drop this column from the dataset. Lets drop it.
dataset.drop('Cabin', axis=1, inplace=True)
Please note that inplace parameter is used to permanently affect our dataset. By default, it is false. If we don't set it to True explicitly, the Cabin column is not dropped permanently from our dataset.
Step 3.4: Lets drop all the rows in the dataset which contain null values
dataset.dropna(inplace=True)
It will remove all the rows which contain null values from the dataset. Now our dataset does not contain any null value. This step is not recommended. I added this step just for illustration. We can loose significant information by executing this step. There are methods to replace the null values with some meaningful values. I will explore those methods in my future posts on data wrangling.
thanks a lot
ReplyDeleteThanks for the above insight. Really helpful for a beginner to learn data wrangling.
ReplyDeleteCould you please guide us with other methods to replace the null values, as the above mentioned method is not recommended as there could be significant data loss.