Friday, 19 April 2019

Data Visualization using Pair Grid and Pair Plot (Seaborn Library)

Lets visualize our data with Pair Grid and Pair Plot which are present in Seaborn library. We will use Iris dataset. We can pass various parameters to pair grid and pair plot like color, palette, marker (diamond, plus sign, circle, square), linewidth, edgecolor, hue, hue_kws, vars, x_vars, y_vars,  height, kind, diag_kind etc. Lets explore pair grid and pair plot in detail: 

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#matplotlib inline

Step 2: Load Tips datasets

iris=sns.load_dataset('iris')
iris.head()

Step 3: Explore data using Pair Grid

Please note that I am not displaying the resulting plots in this post. Please explore it yourself in your Jupyter notebook.

x = sns.PairGrid(iris)
x = x.map(plt.scatter)  #draw scatter plot on pair grid

x = sns.PairGrid(iris)
x = x.map_diag(plt.hist)  #draw histogram on diagonal
x = x.map_offdiag(plt.scatter) #draw scatter plot on rest of the grid

x = sns.PairGrid(iris)
x = x.map_diag(plt.hist)  #draw histogram on diagonal
x = x.map_upper(plt.scatter)  #draw scatter plot on upper grid w.r.t diagonal
x = x.map_lower(sns.kdeplot)  #draw kde plot on lower grid w.r.t diagonal

x = sns.PairGrid(iris, hue='species', palette='Blues_d')
x = x.map_diag(plt.hist)
x = x.map_offdiag(plt.scatter)
x = x.add_legend()

x = sns.PairGrid(iris, hue='species')
x = x.map_diag(plt.hist, histtype='step', linewidth=2, edgecolor='black')
x = x.map_offdiag(plt.scatter, edgecolor='black')
x = x.add_legend()

x = sns.PairGrid(iris, vars=['petal_length', 'petal_width'])
x = x.map_diag(plt.hist)
x = x.map_offdiag(plt.scatter)

x = sns.PairGrid(iris, x_vars=['petal_length', 'petal_width'], y_vars=['sepal_length', 'sepal_width'])
x = x.map(plt.scatter)

x = sns.PairGrid(iris, hue='species')
x = x.map_diag(plt.hist)
x = x.map_upper(plt.scatter)
x = x.map_lower(sns.kdeplot)
x = x.add_legend()

x = sns.PairGrid(iris, hue='species', hue_kws={'marker' : ['D', 's', '+']})
x = x.map(plt.scatter, s=30, edgecolor='black')
x = x.add_legend()

Step 4: Explore data using Pair Plot

sns.pairplot(iris)

sns.pairplot(iris, kind='reg', diag_kind='kde')

sns.pairplot(iris, kind='reg', diag_kind='kde', hue='species')

sns.pairplot(iris, vars=['petal_length', 'petal_width'], height=4)

sns.pairplot(iris, x_vars=['petal_length', 'petal_width'], y_vars=['sepal_length', 'sepal_width'])

Thursday, 18 April 2019

Data Visualization using Regression Plot (Seaborn Library)

Lets visualize our data with Regression Plot which is present in Seaborn library. We will use Tips dataset. We can pass various parameters to regplot like color, marker (diamond, plus sign, circle), linewidth, jitter, estimator etc. We can also change style of scattering points and regression lines differently using scatter_kws and line_kws functions. Lets explore regplot in detail: 

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#matplotlib inline

Step 2: Load Tips datasets

tips=sns.load_dataset('tips')
tips.head()

Step 3: Explore data using Regression Plot

Please note that I am not displaying the resulting plots in this post. Please explore it yourself in your Jupyter notebook.

sns.regplot(x='total_bill', y='tip', data=tips)

sns.regplot(x='total_bill', y='tip', data=tips, color='purple')

sns.regplot(x='total_bill', y='tip', data=tips, marker='D')  #diamond
sns.regplot(x='total_bill', y='tip', data=tips, marker='+')  #plus sign
sns.regplot(x='total_bill', y='tip', data=tips, marker='o')  #circle

sns.regplot(x='total_bill', y='tip', data=tips, marker='D', \
           scatter_kws={'color' : 'blue'}, \
           line_kws={'color' : 'red', 'linewidth' : 3.1})

sns.regplot(x='total_bill', y='tip', data=tips, ci=64)  #confidence interval

sns.regplot(x='size', y='total_bill', data=tips)
sns.regplot(x='size', y='total_bill', data=tips, x_jitter=0.3)

sns.regplot(x='size', y='total_bill', data=tips, x_estimator=np.mean)

Data Visualization using FacetGrid (Seaborn Library)

Lets visualize our data with FacetGrid which is present in Seaborn library. We will use Tips dataset. FacetGrid can be used with Histogram, Scatter Plot, Regression Plot, Box Plot etc. We can pass various parameters to facetgrid like height, aspect, hue, palette, col_order etc. To add legend to facetgrid, you can use add_legend() function. Lets explore facetgrid in detail: 

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#matplotlib inline

Step 2: Load Tips datasets

tips=sns.load_dataset('tips')
tips.head()

Step 3: Explore data using Facet Grid

Please note that I am not displaying the resulting plots in this post. Please explore it yourself in your Jupyter notebook.

#Facet Grid with Histogram

x = sns.FacetGrid(tips, row='smoker', col='time')
x = x.map(plt.hist, 'total_bill')

x = sns.FacetGrid(tips, row='smoker', col='time')
x = x.map(plt.hist, 'total_bill', color='green', bins=15)

#Facet Grid with Scatter Plot

x = sns.FacetGrid(tips, row='smoker', col='time')
x = x.map(plt.scatter, 'total_bill', 'tip')

#Facet Grid with Regression Plot

x = sns.FacetGrid(tips, row='smoker', col='time', height=6, aspect=0.7)
x = x.map(sns.regplot, 'total_bill', 'tip')

x = sns.FacetGrid(tips, col='time', hue='smoker', palette='husl')
x = x.map(sns.regplot, 'total_bill', 'tip')

x = sns.FacetGrid(tips, col='time', hue='smoker')
x = x.map(sns.regplot, 'total_bill', 'tip').add_legend()  #add legend

#Facet Grid with Box Plot

x = sns.FacetGrid(tips, col='day', height=10, aspect=0.2)
x = x.map(sns.boxplot, 'time', 'total_bill')

x = sns.FacetGrid(tips, col='day', height=10, aspect=0.2, col_order=['Sat', 'Sun', 'Thur', 'Fri'])
x = x.map(sns.boxplot, 'time', 'total_bill', color='red')

Wednesday, 17 April 2019

Data Visualization using Violin Plot (Seaborn Library)

Lets visualize our data with Violin Plot which is present in Seaborn library. We will use Tips and Iris dataset. We can pass various parameters to violinplot like hue, split, palette, order, inner (quartile, stick), scale, scale_hue, bandwidth (bw) etc. Lets explore violinplot in detail: 

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#matplotlib inline

Step 2: Load Tips datasets

tips=sns.load_dataset('tips')
tips.head()

Step 3: Explore data using Violin Plot

Please note that I am not displaying the resulting plots in this post. Please explore it yourself in your Jupyter notebook.

sns.violinplot(x=tips['tip'])

sns.violinplot(x='day', y='total_bill', data=tips)

sns.violinplot(x='day', y='total_bill', data=tips, hue='sex')

sns.violinplot(x='day', y='total_bill', data=tips, hue='sex', palette='RdBu')

sns.violinplot(x='day', y='total_bill', data=tips, hue='sex', split=True)

sns.violinplot(x='day', y='total_bill', data=tips, hue='sex', order=['Sat', 'Sun', 'Thur', 'Fri'])

sns.violinplot(x='day', y='total_bill', data=tips, hue='smoker', inner='quartile')

sns.violinplot(x='day', y='total_bill', data=tips, hue='smoker', inner='quartile', split='True')

sns.violinplot(x='day', y='total_bill', data=tips, hue='smoker', inner='quartile', split='True', scale='count')

sns.violinplot(x='day', y='total_bill', data=tips, hue='smoker', inner='stick', split='True', scale='count')

sns.violinplot(x='day', y='total_bill', data=tips, hue='smoker', inner='stick', split='True', scale='count', scale_hue=False)

sns.violinplot(x='day', y='total_bill', data=tips, hue='smoker', inner='stick', split='True', scale='count', scale_hue=False, bw=0.1)

Data Visualization using Heatmap (Seaborn Library)

Lets visualize our data with Heatmap which is present in Seaborn library. Heatmap is full of colors. Darker the color, higher is the value and vice versa. Values closer to 1 represent higher values and values closer to 0 represent lower values. We will use Flights dataset and analyze it through heatmap. We can pass various parameters to heatmap like annot, fmt, vmin, vmax, cbar, cmap, linewidths, center etc. Lets explore heatmap in detail: 

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#matplotlib inline

Step 2: Load Flights datasets

flights = sns.load_dataset('flights')
flights.head()
flights.tail()

Step 3: Explore data using Heat Map

Please note that I am not displaying the resulting maps in this post. Please explore it yourself in your Jupyter notebook.

Before exploring Flights dataset with Heatmap, lets first analyze some random numbers using Heatmap:

numbers = np.random.randn(12, 15)
numbers

sns.heatmap(numbers)

sns.heatmap(numbers, annot=True)  #to show actual values in the heatmap

sns.heatmap(numbers, annot=True, vmin=0, vmax=2)  #to change the key value of heatmap, by default key varies from 0 and 1.

sns.heatmap(flights, cbar=False)  #to hide the color bar

Now, lets jump to our Flights dataset. Lets pivot this dataset so that we have "year" on x-axis and "month" on y-axis.

flights = flights.pivot('month', 'year', 'passengers')
flights

sns.heatmap(flights)

sns.heatmap(flights, annot=True)

sns.heatmap(flights, annot=True, fmt='d')  #format the annotation to contain only digits

sns.heatmap(flights, annot=True, fmt='d', linewidths=0.9)  #add linewidth to heatmap

sns.heatmap(flights, annot=True, fmt='d', linewidths=0.9, cmap='RdBu')  #add color map to heatmap to change the color

sns.heatmap(flights, annot=True, fmt='d', linewidths=0.9, cmap='summer')

sns.heatmap(flights, annot=True, fmt='d', linewidths=0.9, cmap='winter_r')

sns.heatmap(flights, annot=True, fmt='d', linewidths=0.9, cmap='coolwarm')

sns.heatmap(flights, annot=True, fmt='d', center=flights.loc['June', 1954])  #center color theme to a particular cell

Tuesday, 16 April 2019

Data Visualization using Joint Plot (Seaborn Library)

Lets visualize our data with Joint Plot which is present in Seaborn library. By default joint plot shows scatter plot and histogram. We will use Tips and Iris dataset. We can pass various parameters to jointplot like kind (reg, hex, kde), stat_func (spearmanr), color, ratio, size etc. Lets explore jointplot in detail: 

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#matplotlib inline
from scipy.stats import spearmanr

Step 2: Load Tips and Iris datasets

tips=sns.load_dataset('tips')
tips.head()

iris=sns.load_dataset('iris')
iris.head()

Step 3: Explore data using Joint Plot

Please note that I am not displaying the resulting plots in this post. Please explore it yourself in your Jupyter notebook.

sns.jointplot(x='total_bill', y='tip', data=tips)

sns.jointplot(x='sepal_length', y='sepal_width', data=iris)

sns.jointplot(x='total_bill', y='tip', data=tips, kind='reg')  #add regression line to scatter plot and kernel density estimate to histogram

sns.jointplot(x='total_bill', y='tip', data=tips, kind='hex')  #hexagonal representation

sns.jointplot(x='total_bill', y='tip', data=tips, kind='kde')  #kernel density estimate instead of scatter plot and histogram

sns.jointplot(x='total_bill', y='tip', data=tips, kind='reg', color='green')

sns.jointplot(x='sepal_length', y='sepal_width', data=iris, kind='kde')

sns.jointplot(x='total_bill', y='tip', data=tips, stat_func=spearmanr)

sns.jointplot(x='total_bill', y='tip', data=tips, ratio=4, size=6)

Monday, 15 April 2019

Data Visualization using Distribution Plot (Seaborn Library)

Lets visualize our data with Distribution Plot which is present in Seaborn library. By default distribution plot shows histograms. We will create 150 random numbers and plot them on distribution plot. We can pass various parameters to distplot like color, hist, rug, bins, vertical etc. Lets explore distplot in detail: 

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#matplotlib inline

Step 2: Create 150 random numbers

num = np.random.randn(150)
num

Step 3: Explore data using Distribution Plot

Please note that I am not displaying the resulting plots in this post. Please explore it yourself in your Jupyter notebook.

sns.distplot(num)

sns.distplot(num, color='red')

label_dist = pd.Series(num, name="variable x")  #convert random numbers into pandas series

sns.distplot(label_dist)

sns.distplot(label_dist, vertical=True)  #vertical histogram

sns.distplot(label_dist, hist=False)  #remove histogram from distribution plot

sns.distplot(label_dist, hist=False, rug=True)  #specify rug parameter

sns.distplot(label_dist, bins=20)  #specify number of bins you want to create

Data Visualization using Bar Plot (Seaborn Library)

Lets visualize our data with Bar Plot which is present in Seaborn library. We will use Tips dataset. We can pass various parameters to barplot like palette, color, saturation, estimator, hue, order, ci, capsize etc. Lets explore barplot in detail:

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#matplotlib inline

Step 2: Load Tips dataset

tips=sns.load_dataset('tips')
tips.head()

Step 3: Explore data using Bar Plot

Please note that I am not displaying the resulting plots in this post. Please explore it yourself in your Jupyter notebook.

sns.barplot(x='day', y='total_bill', data=tips)

sns.barplot(x='day', y='total_bill', data=tips, color='green')  #pass color parameter if you want to display all the bars in same color

sns.barplot(x='day', y='total_bill', data=tips, color='green', saturation=0.3)  #you can also set saturation level of the color

sns.barplot(x='day', y='total_bill', data=tips, estimator=np.median)  #by default, estimator is mean, you can also set it to median or anything else

sns.barplot(x='day', y='total_bill', data=tips, hue='sex')

sns.barplot(x='day', y='total_bill', data=tips, hue='sex', palette='autumn')

sns.barplot(x='day', y='total_bill', data=tips, hue='sex', color='green')

sns.barplot(x='day', y='total_bill', data=tips, hue='sex', palette='spring', order=['Sat', 'Sun', 'Thur', 'Fri'])

sns.barplot(x='sex', y='total_bill', data=tips, hue='sex', palette='spring', order=['Male', 'Female'])

Note: Black lines in bar plot represent error parts. We can set the cap size and confidence interval (ci) of the error parts. A confidence interval is a range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter.

sns.barplot(x='day', y='total_bill', data=tips, ci=99)
sns.barplot(x='day', y='total_bill', data=tips, ci=34)

sns.barplot(x='day', y='total_bill', data=tips, capsize=0.3)

Data Visualization using Box Plot (Seaborn Library)

Lets visualize our data with Box Plot which is present in Seaborn library. We will use Tips dataset. We can pass various parameters to boxplot like palette, color, hue, order, orient etc. Lets explore boxplot in detail:

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#matplotlib inline

Step 2: Load Tips dataset

tips=sns.load_dataset('tips')
tips.head()

Step 3: Explore data using Box Plot

Please note that I am not displaying the resulting plots in this post. Please explore it yourself in your Jupyter notebook.

#Visualizing one variable using Box Plot

sns.boxplot(x=tips['tip'])

sns.boxplot(x=tips['total_bill'])

sns.boxplot(x='total_bill', data=tips)

#Visualizing two variables using Box Plot

sns.boxplot(x='sex', y='total_bill', data=tips)

sns.boxplot(x='day', y='total_bill', data=tips)

sns.boxplot(x='day', y='total_bill', data=tips, hue='sex')

sns.boxplot(x='day', y='total_bill', data=tips, hue='sex', palette='husl')

sns.boxplot(x='day', y='total_bill', data=tips, hue='smoker', palette='coolwarm')

sns.boxplot(x='day', y='total_bill', data=tips, hue='time', palette='coolwarm')

sns.boxplot(x='day', y='total_bill', data=tips, order=['Sat', 'Sun', 'Thur', 'Fri'])

sns.boxplot(data=tips)

sns.boxplot(data=tips, orient='horizontal')
sns.boxplot(data=tips, orient='h')

sns.boxplot(data=tips, orient='vertical')
sns.boxplot(data=tips, orient='v')

#Combining Box Plot and Swarm Plot

sns.boxplot(x='day', y='total_bill', data=tips, palette='husl')
sns.swarmplot(x='day', y='total_bill', data=tips, color='black')

sns.boxplot(x='day', y='total_bill', data=tips, palette='husl')
sns.swarmplot(x='day', y='total_bill', data=tips, color='0.35')

Sunday, 14 April 2019

Data Visualization using Strip Plot (Seaborn Library)

Lets visualize our data with Strip Plot which is present in Seaborn library. We will use Tips dataset. We can pass various parameters to stripplot like palette, color, jitter, linewidth, hue, dodge, marker, size, order, edgecolor, alpha etc. Lets explore stripplot in detail:

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#matplotlib inline

Step 2: Load Tips dataset

tips=sns.load_dataset('tips')
tips.head()

Step 3: Explore data using Strip Plot

Please note that I am not displaying the resulting plots in this post. Please explore it yourself in your Jupyter notebook.

#Visualizing one variable using Strip Plot

sns.stripplot(x=tips['tip'])

sns.stripplot(x=tips['total_bill'])

sns.stripplot(x='total_bill', data=tips)

sns.stripplot(x='total_bill', data=tips, color='green')

#Visualizing two variables using Strip Plot

sns.stripplot(x='day', y='total_bill', data=tips)

sns.stripplot(x='total_bill', y='day', data=tips)

sns.stripplot(x='day', y='total_bill', data=tips, jitter=False)

sns.stripplot(x='day', y='total_bill', data=tips, jitter=0.3)

sns.stripplot(x='day', y='total_bill', data=tips, jitter=0.3, linewidth=1.2)

sns.stripplot(x='day', y='total_bill', data=tips, hue='sex')

sns.stripplot(x='day', y='total_bill', data=tips, hue='sex', jitter=False)

sns.stripplot(x='day', y='total_bill', data=tips, hue='sex', dodge=True)

sns.stripplot(x='day', y='total_bill', data=tips, hue='sex', dodge=True, palette='winter_r')

sns.stripplot(x='day', y='total_bill', data=tips, hue='sex', dodge=True, palette='winter_r', order=['Sat', 'Sun', 'Thur', 'Fri'])

sns.stripplot(x='day', y='total_bill', data=tips, hue='sex', dodge=True, marker='D')

sns.stripplot(x='day', y='total_bill', data=tips, hue='sex', dodge=True, marker='D', size=10)

sns.stripplot(x='day', y='total_bill', data=tips, hue='sex', dodge=True, marker='D', size=10, edgecolor='gray', alpha=0.3)

#Combining Strip Plot and Box Plot

sns.stripplot(x='day', y='total_bill', data=tips)
sns.boxplot(x='day', y='total_bill', data=tips)

sns.stripplot(x='day', y='total_bill', data=tips, jitter=False, palette='husl', color=0.1)
sns.boxplot(x='day', y='total_bill', data=tips)

#Combining Strip Plot and Violin Plot

sns.stripplot(x='day', y='total_bill', data=tips, jitter=False, palette='husl', color=0.1)
sns.violinplot(x='day', y='total_bill', data=tips)

Friday, 12 April 2019

Creating Pandas DataFrame using CSV, Excel, Dictionary, List and Tuple

We can create pandas data frame in different ways. We can load data from CSV and Excel files. We can also create data frame using dictionary, lists and tuples. Following are some of the examples of loading data into pandas data frame:

Creating Pandas DataFrame using CSV

data_frame_csv = pd.read_csv("dataset.csv")
data_frame_csv 

Creating Pandas DataFrame using Excel Sheet

data_frame_xlsx = pd.read_excel("dataset.xlsx", "Sheet1")
data_frame_xlsx 

Note: You also have to specify sheet name of the Excel.

Creating Pandas DataFrame using Python Dictionary

dataset={
'day' : ['Sunday', 'Monday', 'Tuesday'],
'temperature' : [31, 25, 32],
'windspeed' : [6, 7, 5],
'event' : ['Rain', 'Sunny', 'Humid']
}

data_frame_dictionary = pd.DataFrame(dataset)
data_frame_dictionary

Creating Pandas DataFrame using Python List of Dictionary

dataset=[
{'day' : 'Sunday',  'temperature' : 31, 'windspeed' : 6, 'event' : 'Rain'},
{'day' : 'Monday', 'temperature' : 25, 'windspeed' : 7, 'event' : 'Sunny'},
{'day' : 'Tuesday', 'temperature' : 32, 'windspeed' : 5, 'event' : 'Humid'}
]

data_frame_dictionary_list = pd.DataFrame(dataset)
data_frame_dictionary_list

Creating Pandas DataFrame using Python List of Tuples

dataset=[
('Sunday',  31, 6, 'Rain'),
('Monday',  25, 7, 'Sunny'),
('Tuesday', 32, 5, 'Humid')
]

data_frame_tuple_list = pd.DataFrame(dataset, columns=['day', 'temperature', 'windspeed', 'event'])
data_frame_tuple_list

Note: You need to specify column names explicitly.

Documentation: Pandas IO Tools

Wednesday, 10 April 2019

How to create bins for continuous numeric variables using cut function of Pandas library?

In binning technique, we divide continuous numeric values in some groups or ranges called bins. It helps in better understanding of some of the continuous numeric features. To know more about binning technique, you can visit my this post. I have written a complete theory on it. Today, we will see how to create bins using cut function of pandas library?

Consider a Load Prediction dataset. We will create bins of LoanAmount variable. We will divide it into four bins: low, medium, high, very high.

Step 1: Import the required libraries

import pandas as pd
import numpy as np

Step 2: Load the dataset

dataset = pd.read_csv("C:/train_loan_prediction.csv")

Step 3: Create bins of a numeric variable using cut function

We will define cut points for binning in our variable and pass it to binning function so that it can create bins based upon the cut points which we have passed to it as a parameter.

#Create a binning function
def binning(col, cut_points, labels=None):
  
  #Define min and max values:
  minval = col.min()
  maxval = col.max()

  #Create a list by adding min and max to cut_points
  break_points = [minval] + cut_points + [maxval]

  #If no labels provided, use default labels 0 ... (n-1)
  if not labels:
    labels = range(len(cut_points)+1)

  #Binning using cut function of pandas
  colBin = pd.cut(col, bins=break_points, labels=labels, include_lowest=True)
  return colBin

#Binning LoanAmount variable:
cut_points = [90,140,190]
labels = ["low","medium","high","very high"]
dataset["LoanAmount_Bin"] = binning(dataset["LoanAmount"], cut_points, labels)
print (pd.value_counts(dataset["LoanAmount_Bin"], sort=False))

In the above code, we have passed 3 cut points and it will create 4 bins:
First bin contains all the values from minimum values to 90 (Label: low).
Second bin contains all the values from 91 values to 140 (Label: medium).
Third bin contains all the values from 141 values to 190 (Label: high).
Fourth bin contains all the values from 191 values to maximum value (Label: very high).

Instead of "low", "medium", "high" and "very high" labels, you can pass numeric values like 0, 1, 2 and 3 etc.

Now print the new variable dataset["LoanAmount_Bin"] and see the results. Instead of actual values, you will see labels in the data.

How to encode and transform all the categorical variables to numeric variables using LabelEncoder?

Machine Learning algorithms require all inputs to be numeric, so we should convert all our categorical variables into numeric variables by encoding the categories. Before that, please make sure that you have imputed all the missing values in all the categorical variables. We will use LabelEncoder which is present in Scikit Learn library to encode and transform categorical variables.

Consider a Load Prediction dataset. We will encode and transform all the categorical variables to numeric variables.

Step 1: Import the required libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

Step 2: Load the dataset

dataset = pd.read_csv("C:/train_loan_prediction.csv")

Step 3: Encode categorical variables using LabelEncoder

Categorical variables are Gender, Married, Dependents, Education, Self_Employed, Property_Area, Loan_Status. Lets encode and transform all these categorical variables to numeric variables in one go using following Python code.

categorical_vars = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
label_encoder = LabelEncoder()
for i in categorical_vars:
    dataset[i] = label_encoder.fit_transform(dataset[i])

Now, look at the datatypes of variables:

dataset.dtypes 

You will see that datatype of all the categorical variables has been changed from object to other datatypes like int32, float64 etc. So, now our dataset is ready for Machine Leaning algorithms.

Related: Difference between Label Encoder and One Hot Encoder

Tuesday, 9 April 2019

Boolean Indexing: How to filter Pandas Data Frame?

We can easily filter out any subset of data from the pandas data frame. We can filter values of a column based on conditions from another set of columns? Boolean indexing is very useful here. 

Consider a Load Prediction dataset. We will filter out the data based on some condition using boolean indexing.

Step 1: Import the required libraries

import pandas as pd
import numpy as np

Step 2: Load the dataset

dataset = pd.read_csv("C:/train_loan_prediction.csv")

Step 3: Filter data using boolean indexing

Suppose we want a list of all females who are not graduate and got a loan. Lets use boolean indexing to filter out the data. You can use the following code:

dataset.loc[(dataset["Gender"]=="Female") & (dataset["Education"]=="Not Graduate") & (dataset["Loan_Status"]=="Y"), ["Gender","Education","Loan_Status"]]

Above code selects the data showing all the females who are not graduate and their loan status is approved. It will only display three columns "Gender", "Education" and "Loan_Status". You can display n number of columns based on your requirement. Please try other conditions to filter out the data for the sake of practice.

How to find missing values in each row and column using Apply function in Pandas library?

apply function returns some value after passing each row/column of a data frame with some function. The function can be default or user-defined or lambda. We will create a user defined function which calculates missing values and returns the count. First we will call this function for all columns and then for all rows using apply function.

Consider a Load Prediction dataset. We will try to find out count of missing values in each row and column using apply function.

Step 1: Import the required libraries

import pandas as pd
import numpy as np

Step 2: Load the dataset

dataset = pd.read_csv("C:/train_loan_prediction.csv")

Step 3: Create a function which returns count of missing values

def num_missing(x):
  return sum(x.isnull())
  
Step 4: Find out number of missing values in each column
  
print("Missing values per column:")
print(dataset.apply(num_missing, axis=0)) 

axis=0 defines that function is to be applied on each column.

Step 5: Find out number of missing values in each row

print("Missing values per row:")
print(dataset.apply(num_missing, axis=1).head()) 

axis=1 defines that function is to be applied on each row.

You can also use lambda function with apply. Here is an example.