Pages

Monday, 25 March 2019

Data Exploration using Pandas Library in Python

Exploratory analysis of data is the mandatory step while creating a Machine Learning model. Pandas library provides various methods like head, tailshape, columns, info, dtypesdescribe, mean, var, std, corr for data exploration in Python.

We will load Ames Housing dataset in pandas dataframe and then explore it.

Load the dataset
import pandas as pd
dataset = pd.read_csv('train.csv')

Display rows and columns of the dataset
dataset  #displays all the rows and columns
dataset[['LotArea', 'LotShape']]  #displays all the rows and two columns

dataset.head()  #displays top 5 rows and all the columns
dataset.head(20)  #displays top 20 rows and all the columns
dataset['LotArea'].head()  #displays top 5 rows and one column
dataset[['LotArea', 'LotShape']].head() #displays top 5 rows and two columns

Similarly tail() is used to display bottom rows of the dataset.

Display number of rows and columns in the dataset
dataset.shape

Display number of rows in the dataset
dataset.shape[0]

Display number of columns in the dataset
dataset.shape[1]

Display the list of all the columns in the dataset
dataset.columns

Display summary of the columns in the dataset
dataset.info()

Display datatypes of all the variables
dataset.dtypes

Display statistical summary of the dataset like count, mean, standard deviation, min, max etc.
dataset.describe() 
dataset['LotArea'].describe()  #for one column
dataset[['Id', 'LotArea', 'Street', 'MSSubClass', 'SaleType]].describe()  #for some of the columns, it will simply ignore the non-numeric columns.

Display mean, variance and standard deviation
dataset.mean()
dataset.var()
dataset.std()

Display correlation
dataset.corr()

Display all the columns having null values
dataset.columns[dataset.isnull().any()]

Display columns with count of null values
dataset.isnull().sum()

Display count of null values in a particular column
dataset['PoolQC'].isnull().sum() 

Display count on null values in some selected columns
dataset[['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu']].isnull().sum() 

Display percentage of null values in each column
miss = dataset.isnull().sum() / len(dataset) * 100
miss = miss[miss > 0]
miss.sort_values(inplace=True)
miss

No comments:

Post a Comment