Pages

Saturday, 6 April 2019

Boxplot Grouping: Visualizing one variable based on another variable using boxplot

Boxplots are mainly used to visualize the distribution of the data in different variables in a dataset. We can easily predict outliers by drawing a boxplot for a variable. We can also group the results based on the another variable in the dataset. Lets see how?

Consider a Load Prediction dataset. We will analyze ApplicantIncome and Education variables in this dataset.

Step 1: Import the required libraries

import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
import seaborn as sns

Step 2: Load the dataset

dataset = pd.read_csv("C:/train_loan_prediction.csv")

Step 3: Draw boxplot for ApplicantIncome

dataset.boxplot(column='ApplicantIncome')

We can see a lot of outliers/extreme values in the applicant income column. From this, we can conclude that there is a lot of income disparity in the society. But hold on, we are analyzing income of all the people by disregarding their education levels which is practically not right. There is a good probability that educated people will be having higher income as compared to the uneducated / less educated people. Lets segregate the income by education:

dataset.boxplot(column='ApplicantIncome', by = 'Education')

We can see that there is no substantial different between the mean income of graduate and non-graduates. But there are higher number of graduates with very high incomes, which are appearing to be the outliers.

No comments:

Post a Comment