We should have all the numeric features in the dataset before running any Machine Learning algorithm on it to make predictions. But, in real world, we get different types of features (like string, categories, dates etc.) in the dataset. Today, we will see how can we deal with dates in the dataset? How can we convert dates into numbers?
When you load data into Pandas dataframe, dates are loaded as strings by default. We need to convert it into numeric columns.
First approach is to split date into multiple columns like year, month, day, hour etc.
Second approach is to convert dates into numbers based on the nature of the feature and domain knowledge.
Consider a scenario where dataset has date of birth column. Now, we know we can't simply drop this column as it has a significant impact on the dependent variable. We can create a new feature out of it. We can create age column from date of birth column by subtracting date of birth from today's date. In this way, we will get a numeric column.
Consider another scenario where we have a dataset of credit card users. We have to find out which customers generally delay their credit card payment and which customers pay on or before the due date. We have two columns called "Payment Due Date" and "Payment Date". Now these two date features are very crucial in our prediction but we cannot use these as such. So, we can create a new feature (say payment_on_time) by subtracting "Payment Due Date" from "Payment Date".
payment_on_time (in days) = Payment Date - Payment Due Date
More positive the value of payment_on_time (in days), there is more delay in the payment.
More negative the value of payment_on_time (in days), there is less delay in the payment.
For example, Payment Due Date is 5th of March. Payment Date is 2nd of March. It means customer paid on time. So, the value of "payment_on_time" will be -3 (2 - 5).
I found some useful articles on web regarding handling of dates using pandas:
Article 1, Article 2, Article 3
When you load data into Pandas dataframe, dates are loaded as strings by default. We need to convert it into numeric columns.
First approach is to split date into multiple columns like year, month, day, hour etc.
Second approach is to convert dates into numbers based on the nature of the feature and domain knowledge.
Consider a scenario where dataset has date of birth column. Now, we know we can't simply drop this column as it has a significant impact on the dependent variable. We can create a new feature out of it. We can create age column from date of birth column by subtracting date of birth from today's date. In this way, we will get a numeric column.
Consider another scenario where we have a dataset of credit card users. We have to find out which customers generally delay their credit card payment and which customers pay on or before the due date. We have two columns called "Payment Due Date" and "Payment Date". Now these two date features are very crucial in our prediction but we cannot use these as such. So, we can create a new feature (say payment_on_time) by subtracting "Payment Due Date" from "Payment Date".
payment_on_time (in days) = Payment Date - Payment Due Date
More positive the value of payment_on_time (in days), there is more delay in the payment.
More negative the value of payment_on_time (in days), there is less delay in the payment.
For example, Payment Due Date is 5th of March. Payment Date is 2nd of March. It means customer paid on time. So, the value of "payment_on_time" will be -3 (2 - 5).
I found some useful articles on web regarding handling of dates using pandas:
Article 1, Article 2, Article 3
No comments:
Post a Comment