How to handle date variable in machine learning data pre-processing

python r machine-learning logistic-regression feature-selection

27,701

Solution 1

Some random thoughts:

Dates are good sources for feature engineering, I don't think there is one method to use dates in a model. Business user expertise would be great; are there observed trends that can be coded into the data?

Possible suggestions of features include:

weekends vs weekdays
business hours and time of day
seasons
week of year number
month
year
beginning/end of month (pay days)
quarter
days to/from an action event(distance)
missing or incomplete data
etc.

All this depends on the data set and most won't apply.

some links:

http://appliedpredictivemodeling.com/blog/2015/7/28/feature-engineering-versus-feature-extraction

https://www.salford-systems.com/blog/dan-steinberg/using-dates-in-data-mining-models

http://trevorstephens.com/kaggle-titanic-tutorial/r-part-4-feature-engineering/

Solution 2

Cyclic Feature Encoding

Data that has a unique set of values that repeat in a cycle are known as cyclic data. Time-related features are mainly cyclic in nature. For example, months of a year, days of a week, hours of time, minutes of time etc... These features have a set of values and all the observations will have a value from this set only. In many ML problems, we encounter such features. Handling such features properly have proved to help in the improvement of accuracy.

Implementation

def encode(data, col, max_val):
    data[col + '_sin'] = np.sin(2 * np.pi * data[col]/max_val)
    data[col + '_cos'] = np.cos(2 * np.pi * data[col]/max_val)
    return data

data['month'] = data.datetime.dt.month
data = encode(data, 'month', 12)

data['day'] = data.datetime.dt.day
data = encode(data, 'day', 31)

The Logic

A common method for encoding cyclical data is to transform the data into two dimensions using a sine and cosine transformation. Map each cyclical variable onto a circle such that the lowest value for that variable appears right next to the largest value. We compute the x- and y- components of that point using sin and cos trigonometric functions.

$x_{sin}=\sin\left(\frac{2\pi{x}}{\max(x)}\right)$

$x_{cos}=\cos\left(\frac{2\pi{x}}{\max(x)}\right)$

For handling months we consider them from 0-11 and refer to the below figure.

We can do that using the following transformations:

More on Feature Engineering Cyclic Features

27,701

Author by

yppdgr

Updated on December 27, 2021

Comments

yppdgr over 2 years
I have a data-set that contains among other variables the time-stamp of the transaction in the format 26-09-2017 15:29:32. I need to find possible correlations and predictions of the sales (lets say in logistic regression). My questions are:
1. How to handle the date format? Shall I convert it to one number (like excel does automatically)? Shall I split it in more variables like day, month, year, hour, mins, seconds? any other possible suggestions?
2. What if I would like to add distinct week number per year? shall I add variable like 342017(week 34 of year 2017)?
3. Shall I make the same for question 2 for quarter of year?
```
#         Datetime               Gender        Purchase
1    23/09/2015 00:00:00           0             1
2    23/09/2015 01:00:00           1             0
3    25/09/2015 02:00:00           1             0
4    27/09/2015 03:00:00           1             1
5    28/09/2015 04:00:00           0             0
```
Pleastry almost 4 years

@Charles I am currently attempting the days from an action event. However, some of the entries don't have the action of interest, or that action is occurring for the first time. How do I represent this as a feature? Surely using a 0 implies that there are 0 days between the previous and the current action, so I can't use it in that way.
Ryan John almost 4 years

This depends on the model you're building. Some models will accept NULL values, others wont. For a regression you may need a flag - see:stats.stackexchange.com/questions/299663/…
Gautam J over 2 years

This answer deserves more credit!