Pages

Wednesday, 10 April 2019

How to create bins for continuous numeric variables using cut function of Pandas library?

In binning technique, we divide continuous numeric values in some groups or ranges called bins. It helps in better understanding of some of the continuous numeric features. To know more about binning technique, you can visit my this post. I have written a complete theory on it. Today, we will see how to create bins using cut function of pandas library?

Consider a Load Prediction dataset. We will create bins of LoanAmount variable. We will divide it into four bins: low, medium, high, very high.

Step 1: Import the required libraries

import pandas as pd
import numpy as np

Step 2: Load the dataset

dataset = pd.read_csv("C:/train_loan_prediction.csv")

Step 3: Create bins of a numeric variable using cut function

We will define cut points for binning in our variable and pass it to binning function so that it can create bins based upon the cut points which we have passed to it as a parameter.

#Create a binning function
def binning(col, cut_points, labels=None):
  
  #Define min and max values:
  minval = col.min()
  maxval = col.max()

  #Create a list by adding min and max to cut_points
  break_points = [minval] + cut_points + [maxval]

  #If no labels provided, use default labels 0 ... (n-1)
  if not labels:
    labels = range(len(cut_points)+1)

  #Binning using cut function of pandas
  colBin = pd.cut(col, bins=break_points, labels=labels, include_lowest=True)
  return colBin

#Binning LoanAmount variable:
cut_points = [90,140,190]
labels = ["low","medium","high","very high"]
dataset["LoanAmount_Bin"] = binning(dataset["LoanAmount"], cut_points, labels)
print (pd.value_counts(dataset["LoanAmount_Bin"], sort=False))

In the above code, we have passed 3 cut points and it will create 4 bins:
First bin contains all the values from minimum values to 90 (Label: low).
Second bin contains all the values from 91 values to 140 (Label: medium).
Third bin contains all the values from 141 values to 190 (Label: high).
Fourth bin contains all the values from 191 values to maximum value (Label: very high).

Instead of "low", "medium", "high" and "very high" labels, you can pass numeric values like 0, 1, 2 and 3 etc.

Now print the new variable dataset["LoanAmount_Bin"] and see the results. Instead of actual values, you will see labels in the data.

No comments:

Post a Comment