Exploratory analysis of data is the mandatory step while creating a Machine Learning model. Pandas library provides various methods like head, tail, shape, columns, info, dtypes, describe, mean, var, std, corr for data exploration in Python.
We will load Ames Housing dataset in pandas dataframe and then explore it.
Load the dataset
import pandas as pd
dataset = pd.read_csv('train.csv')
Display rows and columns of the dataset
dataset #displays all the rows and columns
dataset[['LotArea', 'LotShape']] #displays all the rows and two columns
dataset.head() #displays top 5 rows and all the columns
dataset.head(20) #displays top 20 rows and all the columns
dataset['LotArea'].head() #displays top 5 rows and one column
dataset[['LotArea', 'LotShape']].head() #displays top 5 rows and two columns
Similarly tail() is used to display bottom rows of the dataset.
Display number of rows and columns in the dataset
dataset.shape
Display number of rows in the dataset
dataset.shape[0]
Display number of columns in the dataset
dataset.shape[1]
Display the list of all the columns in the dataset
dataset.columns
Display summary of the columns in the dataset
dataset.info()
Display datatypes of all the variables
dataset.dtypes
Display statistical summary of the dataset like count, mean, standard deviation, min, max etc.
dataset.describe()
dataset['LotArea'].describe() #for one column
dataset[['Id', 'LotArea', 'Street', 'MSSubClass', 'SaleType]].describe() #for some of the columns, it will simply ignore the non-numeric columns.
Display mean, variance and standard deviation
dataset.mean()
dataset.var()
dataset.std()
Display correlation
dataset.corr()
Display all the columns having null values
dataset.columns[dataset.isnull().any()]
Display columns with count of null values
dataset.isnull().sum()
Display count of null values in a particular column
dataset['PoolQC'].isnull().sum()
Display count on null values in some selected columns
dataset[['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu']].isnull().sum()
Display percentage of null values in each column
miss = dataset.isnull().sum() / len(dataset) * 100
miss = miss[miss > 0]
miss.sort_values(inplace=True)
miss
We will load Ames Housing dataset in pandas dataframe and then explore it.
Load the dataset
import pandas as pd
dataset = pd.read_csv('train.csv')
Display rows and columns of the dataset
dataset #displays all the rows and columns
dataset[['LotArea', 'LotShape']] #displays all the rows and two columns
dataset.head() #displays top 5 rows and all the columns
dataset.head(20) #displays top 20 rows and all the columns
dataset['LotArea'].head() #displays top 5 rows and one column
dataset[['LotArea', 'LotShape']].head() #displays top 5 rows and two columns
Similarly tail() is used to display bottom rows of the dataset.
Display number of rows and columns in the dataset
dataset.shape
Display number of rows in the dataset
dataset.shape[0]
Display number of columns in the dataset
dataset.shape[1]
Display the list of all the columns in the dataset
dataset.columns
Display summary of the columns in the dataset
dataset.info()
Display datatypes of all the variables
dataset.dtypes
Display statistical summary of the dataset like count, mean, standard deviation, min, max etc.
dataset.describe()
dataset['LotArea'].describe() #for one column
dataset[['Id', 'LotArea', 'Street', 'MSSubClass', 'SaleType]].describe() #for some of the columns, it will simply ignore the non-numeric columns.
Display mean, variance and standard deviation
dataset.mean()
dataset.var()
dataset.std()
Display correlation
dataset.corr()
Display all the columns having null values
dataset.columns[dataset.isnull().any()]
Display columns with count of null values
dataset.isnull().sum()
Display count of null values in a particular column
dataset['PoolQC'].isnull().sum()
Display count on null values in some selected columns
dataset[['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu']].isnull().sum()
Display percentage of null values in each column
miss = dataset.isnull().sum() / len(dataset) * 100
miss = miss[miss > 0]
miss.sort_values(inplace=True)
miss
No comments:
Post a Comment