The Professionals Point: April 2019

Tuesday, 30 April 2019

Data Exploration and Visualization Techniques in Python

Data Exploration and Visualization is the first step in the process of creating a robust Machine Learning model. We need to understand and explore the data using various graphs and plots present in matplotlib and seaborn libraries. This step takes a lot of time and patience.

Plots and graphs help us to analyze relationships among various variables present in the dataset. We can visualize and analyze missing values, outliers, skewed data, correlation among variables etc.

Main Python libraries used in data exploration and visualization are pandas, matplotlib and seaborn.

There are mainly three types of plots: Univariate, Bivariate and Multivariate Analysis

Some commonly used plots and graphs are: Joint Plot, Distribution Plot, Box Plot, Bar Plot, Regression Plot, Strip Plot, Heatmap, Violin Plot, Pair Plot and Grid, Facet Grid.

Visualize missing values

Visualize missing values in Bar Plot using Seaborn Library

Visualize outliers

What are Outliers? How to find and remove outliers using JointPlot in Seaborn Library?

What is Boxplot? How is it used to find outliers in a dataset?

How to visualize outliers in categorical variables using boxplots?

Visualize skewed data

What is Skewnesss? How to visualize it with Histogram and how to remove it?

How to visualize skewness of numeric variables by plotting histograms?

Log Transforming the Skewed Data to get Normal Distribution

Visualize correlation among variables

How to find Correlation Score and plot Correlation Heatmap using Seaborn Library in Python?

Other links

Boxplot Grouping: Visualizing one variable based on another variable using boxplot

Wednesday, 24 April 2019

Tuples in Python: Indexing, Slicing, Packing, Unpacking, Concatenation, Repetition, Comparison, Membership, Iteration

Tuples are one of the basic data types in Python. Tuples are widely used in Python programming and have very simple syntax. In this article, we will see various Tuple operations like indexing, slicing, packing, unpacking, comparison, concatenation, repetition, updating and deleting a tuple, in-built functions for tuples, membership, iteration etc.

You can also download my Jupyter notebook containing below code.

Declaration

tup1 = ()
tup2 = (50, )
tup3 = (50, 8)
tup4 = 'a', 'b', 'c', 'd'
x, y = 1, 2;
print(tup1, tup2, tup3, tup4)
print(x, y)

Output

() (50,) (50, 8) ('a', 'b', 'c', 'd')
1 2

Indexing and Slicing

tup5 = ('a', 'b', 100, 'abc', 'xyz', 2.5);
tup6 = (1, 2, 3, 4, 5, 6, 7);
print(tup5[0], tup5[1], tup5[-1], tup5[-2])
print(tup6[0:4])
print(tup6[:])
print(tup6[:4])
print(tup6[1:])
print(tup6[1:4])
print(tup6[1:-1])
print(tup6[1:-2])

Output

a b 2.5 xyz
(1, 2, 3, 4)
(1, 2, 3, 4, 5, 6, 7)
(1, 2, 3, 4)
(2, 3, 4, 5, 6, 7)
(2, 3, 4)
(2, 3, 4, 5, 6)
(2, 3, 4, 5)

Packing and Unpacking

In packing, we place value into a new tuple while in unpacking we extract those values back into variables.

x = ('Google', 208987, 'Software Engineer')
print(x[1])
print(x[-1])
(company, emp_no, profile) = x
print(company)
print(emp_no)
print(profile)

Output

208987
Software Engineer
Google
208987
Software Engineer

Comparison

a = (5, 6)
b = (1, 4)
if(a > b): print('a is bigger')
else: print('b is bigger')

Output: a is bigger

a = (5, 6)
b = (5, 4)
if(a > b): print('a is bigger')
else: print('b is bigger')

Output: a is bigger

a = (5, 6)
b = (6, 4)
if(a > b): print('a is bigger')
else: print('b is bigger')

Output: b is bigger

Concatenation

a = (1, 1.5)
b = ('abc', 'xyz')
c = a + b
print(c)

Output: (1, 1.5, 'abc', 'xyz')

Repetition

a = (1, 1.5)
b = a * 3
print(b)

Output: (1, 1.5, 1, 1.5, 1, 1.5)

Update Tuple

Tuples are immutable which means you cannot update or change the values of tuple elements. It does not support item assignment.

a = (1, 1.5)
b = ('abc', 'xyz')
a[0] = 2; #TypeError: 'tuple' object does not support item assignment

Delete Tuple

Tuples are immutable and cannot be deleted, but deleting tuple entirely is possible by using the keyword "del."

a = (5, 6)
print(a)
del a
print(a) #NameError: name 'a' is not defined

In-built Functions

a = (5, 2, 8, 3, 6, 2, 5, 5)
print('Length:', len(a))
print('Min:', min(a))
print('Max:', max(a))
print('Count of 5:', a.count(5))
print('Index of 2:', a.index(2))
print('Sorted:', sorted(a))
print('Tuple:', tuple(a))
print('List:', list(a))

Output

Length: 8
Min: 2
Max: 8
Count of 5: 3
Index of 2: 1
Sorted: [2, 2, 3, 5, 5, 5, 6, 8]
Tuple: (5, 2, 8, 3, 6, 2, 5, 5)
List: [5, 2, 8, 3, 6, 2, 5, 5]

Membership

3 in (1, 2, 3)

Output: True

tuple_alphabets = ('a', 'b', 'c', 'd', 'e')
if 'c' in tuple_alphabets:
print('Found')
else:
print('Not Found')

Output: Found

Iteration

Iterating through tuple is faster than with list, since tuples are immutable.

for x in (1, 2, 3):
print (x)

Output

1
2
3

Tuple in Dictionary

Dictionary can return the list of tuples by calling items, where each tuple is a key value pair.

a = {'x':100, 'y':200}
b = (a.items())
c = list(a.items())
print(a)
print(b)
print(c)

Output

{'x': 100, 'y': 200}
dict_items([('x', 100), ('y', 200)])
[('x', 100), ('y', 200)]

Friday, 19 April 2019

Data Visualization using Pair Grid and Pair Plot (Seaborn Library)

Lets visualize our data with Pair Grid and Pair Plot which are present in Seaborn library. We can draw various plots (like scatter plot, histogram and KDE plot) in Pair Grid. Pair Plot shows histograms at diagonal and scatter plots at rest of the grid cells by default.

We can pass various parameters to PairGrid like hue, hue_kws, vars, x_vars, y_vars, palette, marker (diamond, plus sign, circle, square) etc.

We can pass various parameters to pairplot like kind, diag_kind, hue, vars, x_vars, y_vars, height etc.

Lets explore Pair Grid and Pair Plot using Iris dataset.

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Step 2: Load Tips datasets

iris=sns.load_dataset('iris')
iris.head()

Step 3: Explore data using Pair Grid

Draw scatter plots on all grid cells

x = sns.PairGrid(iris)
x = x.map(plt.scatter)

Draw histograms on diagonals and scatter plots on rest of the grid cells

x = sns.PairGrid(iris)
x = x.map_diag(plt.hist)
x = x.map_offdiag(plt.scatter)

Draw histograms on diagonals, scatter plots at top and KDE plots at bottom

x = sns.PairGrid(iris)
x = x.map_diag(plt.hist)
x = x.map_upper(plt.scatter)
x = x.map_lower(sns.kdeplot)

Add hue and legend

x = sns.PairGrid(iris, hue='species')
x = x.map_diag(plt.hist)
x = x.map_offdiag(plt.scatter)

x = sns.PairGrid(iris, hue='species')
x = x.map_diag(plt.hist)
x = x.map_offdiag(plt.scatter)
x = x.add_legend()

x = sns.PairGrid(iris, hue='species')
x = x.map_diag(plt.hist)
x = x.map_upper(plt.scatter)
x = x.map_lower(sns.kdeplot)
x = x.add_legend()

x = sns.PairGrid(iris, hue='species', palette='Blues_d')
x = x.map_diag(plt.hist, histtype='step', linewidth=2, edgecolor='black')
x = x.map_offdiag(plt.scatter, edgecolor='black')
x = x.add_legend()

x = sns.PairGrid(iris, hue='species', hue_kws={'marker' : ['D', 's', '+']})
x = x.map(plt.scatter, s=30, edgecolor='black')
x = x.add_legend()

Add specific variables

x = sns.PairGrid(iris, vars=['petal_length', 'petal_width'])
x = x.map_diag(plt.hist)
x = x.map_offdiag(plt.scatter)

x = sns.PairGrid(iris, x_vars=['petal_length', 'petal_width'], y_vars=['sepal_length', 'sepal_width'])
x = x.map(plt.scatter)

Step 4: Explore data using Pair Plot

sns.pairplot(iris)

Add regression line to scatter plot

sns.pairplot(iris, kind='reg')

Change diagonal to KDE, by default its histogram

sns.pairplot(iris, diag_kind='kde')

Add hue parameter

sns.pairplot(iris, hue='species')

sns.pairplot(iris, hue='species', kind='reg')

sns.pairplot(iris, hue='species', kind='reg', diag_kind='kde')

sns.pairplot(iris, hue='species', kind='reg', diag_kind='hist')

Add specific variables

sns.pairplot(iris, vars=['petal_length', 'petal_width'], height=4)

sns.pairplot(iris, x_vars=['petal_length', 'petal_width'], y_vars=['sepal_length', 'sepal_width'])

You can download my Jupyter notebook from here. I recommend to also try above code with Tips dataset.

Thursday, 18 April 2019

Data Visualization using Regression Plot (Seaborn Library)

Lets visualize our data with Regression Plot which is present in Seaborn library. By default, Regression Plot uses Scatter Plot. It draws a best fit line (regression line) passing through the data points.

We can pass various parameters to regplot like confidence interval (ci), estimators (mean, median etc.), jitter, color, marker (diamond, plus sign, circle, square), linewidth etc. We can also change style of scattering points and regression lines differently using scatter_kws and line_kws functions.

Lets explore Regression Plot using Tips dataset.

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Step 2: Load Tips datasets

tips=sns.load_dataset('tips')
tips.head()

Step 3: Explore data using Regression Plot

sns.regplot(x='total_bill', y='tip', data=tips)

Specify confidence interval

sns.regplot(x='total_bill', y='tip', data=tips, ci=95)

Specify estimators like mean, median etc.

sns.regplot(x='total_bill', y='tip', data=tips, x_estimator=np.mean)
sns.regplot(x='total_bill', y='tip', data=tips, x_estimator=np.median)

Specify jitter parameter

sns.regplot(x='size', y='total_bill', data=tips)
sns.regplot(x='size', y='total_bill', data=tips, x_jitter=True)
sns.regplot(x='size', y='total_bill', data=tips, x_jitter=0.3)

Cosmetic parameters like color, marker, line width etc.

sns.regplot(x='total_bill', y='tip', data=tips, color='purple')

sns.regplot(x='total_bill', y='tip', data=tips, marker='D') #diamond
sns.regplot(x='total_bill', y='tip', data=tips, marker='+') #plus sign
sns.regplot(x='total_bill', y='tip', data=tips, marker='o') #circle
sns.regplot(x='total_bill', y='tip', data=tips, marker='s') #square

sns.regplot(x='total_bill', y='tip', data=tips, marker='D', \
scatter_kws={'color' : 'blue'}, \
line_kws={'color' : 'red', 'linewidth' : 3.1})

You can download my Jupyter notebook from here. I recommend to also try above code with Iris dataset.

Data Visualization using FacetGrid (Seaborn Library)

Lets visualize our data with Facet Grid which is present in Seaborn library. Facet Grid can be used with Histogram, Scatter Plot, Regression Plot, Box Plot etc.

We can pass various parameters to FacetGrid like row, col, col_order, hue, palette, height, aspect etc. To add legend to Facet Grid, you can use add_legend() function.

Lets explore Facet Grid with Tips dataset.

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Step 2: Load Tips datasets

tips=sns.load_dataset('tips')
tips.head()

Step 3: Explore data using Facet Grid

Facet Grid with Histogram

x = sns.FacetGrid(tips, row='smoker', col='time')
x = x.map(plt.hist, 'total_bill')

x = sns.FacetGrid(tips, row='smoker', col='time')
x = x.map(plt.hist, 'total_bill', bins=15, color='green')

Facet Grid with Scatter Plot

x = sns.FacetGrid(tips, row='smoker', col='time')
x = x.map(plt.scatter, 'total_bill', 'tip')

Facet Grid with Regression Plot

x = sns.FacetGrid(tips, row='smoker', col='time', height=6, aspect=0.7)
x = x.map(sns.regplot, 'total_bill', 'tip')

x = sns.FacetGrid(tips, col='time', hue='smoker', palette='husl')
x = x.map(sns.regplot, 'total_bill', 'tip')

x = sns.FacetGrid(tips, col='time', hue='smoker')
x = x.map(sns.regplot, 'total_bill', 'tip').add_legend()

Facet Grid with Box Plot

x = sns.FacetGrid(tips, col='day', height=10, aspect=0.2)
x = x.map(sns.boxplot, 'time', 'total_bill')

x = sns.FacetGrid(tips, col='day', height=10, aspect=0.2, col_order=['Sat', 'Sun', 'Thur', 'Fri'])
x = x.map(sns.boxplot, 'time', 'total_bill', color='red')

You can download my Jupyter notebook from here. I recommend to also try above code with Iris dataset.

Wednesday, 17 April 2019

Data Visualization using Violin Plot (Seaborn Library)

Lets visualize our data with Violin Plot which is present in Seaborn library.

We can pass various parameters to violinplot like hue, split, inner (quartile, stick), scale, scale_hue, bandwidth (bw), palette, order etc.

Lets explore Violin Plot using Tips dataset.

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Step 2: Load Tips datasets

tips=sns.load_dataset('tips')
tips.head()

Step 3: Explore data using Violin Plot

sns.violinplot(x=tips['tip'])

sns.violinplot(x='day', y='total_bill', data=tips)

Add hue and split parameter

sns.violinplot(x='day', y='total_bill', data=tips, hue='sex')

sns.violinplot(x='day', y='total_bill', data=tips, hue='sex', split=True)

sns.violinplot(x='day', y='total_bill', data=tips, hue='sex', palette='RdBu')

sns.violinplot(x='day', y='total_bill', data=tips, hue='sex', order=['Sat', 'Sun', 'Thur', 'Fri'])

Add inner and scale parameter

sns.violinplot(x='day', y='total_bill', data=tips, hue='smoker', inner='quartile')

sns.violinplot(x='day', y='total_bill', data=tips, hue='smoker', inner='quartile', split='True')

sns.violinplot(x='day', y='total_bill', data=tips, hue='smoker', inner='quartile', split='True', scale='count')

sns.violinplot(x='day', y='total_bill', data=tips, hue='smoker', inner='stick', split='True', scale='count')

sns.violinplot(x='day', y='total_bill', data=tips, hue='smoker', inner='stick', split='True', scale='count', scale_hue=False)

sns.violinplot(x='day', y='total_bill', data=tips, hue='smoker', inner='stick', split='True', scale='count', scale_hue=False, bw=0.1)

You can download my Jupyter notebook from here. I recommend to also try above code with Iris dataset.

Data Visualization using Heatmap (Seaborn Library)

Lets visualize our data with Heatmap which is present in Seaborn library. Heatmap is full of colors. Darker the color, higher is the value and vice versa. Values closer to 1 represent higher values and values closer to 0 represent lower values. We will use Flights dataset and analyze it through heatmap. We can pass various parameters to heatmap like annot, fmt, vmin, vmax, cbar, cmap, linewidths, center etc. Lets explore heatmap in detail:

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#matplotlib inline

Step 2: Load Flights datasets

flights = sns.load_dataset('flights')
flights.head()
flights.tail()

Step 3: Explore data using Heat Map

Please note that I am not displaying the resulting maps in this post. Please explore it yourself in your Jupyter notebook.

Before exploring Flights dataset with Heatmap, lets first analyze some random numbers using Heatmap:

numbers = np.random.randn(12, 15)
numbers

sns.heatmap(numbers)

sns.heatmap(numbers, annot=True) #to show actual values in the heatmap

sns.heatmap(numbers, annot=True, vmin=0, vmax=2) #to change the key value of heatmap, by default key varies from 0 and 1.

sns.heatmap(flights, cbar=False) #to hide the color bar

Now, lets jump to our Flights dataset. Lets pivot this dataset so that we have "year" on x-axis and "month" on y-axis.

flights = flights.pivot('month', 'year', 'passengers')
flights

sns.heatmap(flights)

sns.heatmap(flights, annot=True)

sns.heatmap(flights, annot=True, fmt='d') #format the annotation to contain only digits

sns.heatmap(flights, annot=True, fmt='d', linewidths=0.9) #add linewidth to heatmap

sns.heatmap(flights, annot=True, fmt='d', linewidths=0.9, cmap='RdBu') #add color map to heatmap to change the color

sns.heatmap(flights, annot=True, fmt='d', linewidths=0.9, cmap='summer')

sns.heatmap(flights, annot=True, fmt='d', linewidths=0.9, cmap='winter_r')

sns.heatmap(flights, annot=True, fmt='d', linewidths=0.9, cmap='coolwarm')

sns.heatmap(flights, annot=True, fmt='d', center=flights.loc['June', 1954]) #center color theme to a particular cell

Tuesday, 16 April 2019

Data Visualization using Joint Plot (Seaborn Library)

Lets visualize our data with Joint Plot which is present in Seaborn library. By default, Joint Plot uses Scatter Plot and Histogram. Joint Plot can also display data using Kernel Density Estimate (KDE) and Hexagons. We can also draw a Regression Line in Scatter Plot. By using spearmanr function, we can print the correlation between two variables.

We can pass various parameters to jointplot like kind (reg, hex, kde), stat_func(spearmanr), color, size, ratio etc.

Spearmanr Parameter

Spearmanr parameter displays the correlation between two variables.

Value varies between -1 and +1 with 0 implying no correlation.

Correlations of -1 or +1 imply an exact monotonic relationship.

Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

Spearmanr correlation does not assume that both variables are normally distributed.

For more details on spearmanr parameter, please visit documentation.

Lets explore Joint Plot using Tips dataset.

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import spearmanr

Step 2: Load Tips dataset

tips=sns.load_dataset('tips')
tips.head()

Step 3: Explore data using Joint Plot

sns.jointplot(x='total_bill', y='tip', data=tips)

Add regression line to scatter plot and kernel density estimate to histogram

sns.jointplot(x='total_bill', y='tip', data=tips, kind='reg')

Display kernel density estimate instead of scatter plot and histogram

sns.jointplot(x='total_bill', y='tip', data=tips, kind='kde')

Display hexagons instead of points in scatter plot

sns.jointplot(x='total_bill', y='tip', data=tips, kind='hex')

Display correlation using spearmanr function

sns.jointplot(x='total_bill', y='tip', data=tips, stat_func=spearmanr)

Cosmetic parameters like color, size and ratio

sns.jointplot(x='total_bill', y='tip', data=tips, color='green')

sns.jointplot(x='total_bill', y='tip', data=tips, ratio=4, size=6)

You can download my Jupyter notebook from here. I recommend to also try above code with Iris dataset.

Monday, 15 April 2019

Data Visualization using Distribution Plot (Seaborn Library)

Lets visualize our data with Distribution Plot which is present in Seaborn library. By default, Distribution Plot uses Histogram and KDE (Kernel Density Estimate). We can specify number of bins to the histogram as per our requirement. Please note that Distribution Plot is a univariate plot.

We can pass various parameters to distplot like bins, hist, kde, rug, vertical, color etc.

Lets explore Distribution Plot by generating 150 random numbers.

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Step 2: Generate 150 random numbers

num = np.random.randn(150)
num

Step 3: Explore data using Distribution Plot

sns.distplot(num)

Specify number of bins

sns.distplot(num, bins=20)

Remove histogram from distribution plot

sns.distplot(num, hist=False)

Remove KDE from distribution plot

sns.distplot(num, kde=False)

Add rug parameter to distribution plot

sns.distplot(num, hist=False, rug=True)

Add label to distribution plot

label_dist = pd.Series(num, name="variable x")
sns.distplot(label_dist)

Change orientation of distribution plot

sns.distplot(label_dist, vertical=True)

Add cosmetic parameter: color

sns.distplot(label_dist, color='red')

You can download my Jupyter notebook from here. I recommend to also try above code with Tips and Iris dataset.

Data Visualization using Bar Plot (Seaborn Library)

Lets visualize our data with Bar Plot which is present in Seaborn library.

We can pass various parameters to barplot like hue, confidence interval (ci), capsize, estimator (mean, median etc.), order, palette, color, saturation etc.

Lets explore Bar Plot using Tips dataset.

Step 1: Import required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Step 2: Load Tips dataset

tips=sns.load_dataset('tips')
tips.head()

Step 3: Explore data using Bar Plot

sns.barplot(x='day', y='total_bill', data=tips)

Horizontal Bar Plot

sns.barplot(x='total_bill', y='day', data=tips)

Set color and saturation level

sns.barplot(x='day', y='total_bill', data=tips, color='green')

sns.barplot(x='day', y='total_bill', data=tips, color='green', saturation=0.3)

By default, estimator is mean, you can also set it to median or anything else

sns.barplot(x='day', y='total_bill', data=tips, estimator=np.median)

Add hue parameter

sns.barplot(x='day', y='total_bill', data=tips, hue='sex')

sns.barplot(x='day', y='total_bill', data=tips, hue='sex', palette='autumn')

sns.barplot(x='day', y='total_bill', data=tips, hue='sex', color='green')

sns.barplot(x='day', y='total_bill', data=tips, hue='sex', palette='spring', order=['Sat', 'Sun', 'Thur', 'Fri'])

sns.barplot(x='sex', y='total_bill', data=tips, hue='sex', palette='spring', order=['Male', 'Female'])

Add confidence interval and capsize parameter

Black lines in bar plot represent error parts. We can set the capsize and confidence interval (ci) of the error parts. A confidence interval is a range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter.

sns.barplot(x='day', y='total_bill', data=tips, ci=99)
sns.barplot(x='day', y='total_bill', data=tips, ci=34)

sns.barplot(x='day', y='total_bill', data=tips, capsize=0.3)

You can download my Jupyter notebook from here. I recommend to also try above code with Iris dataset.