How to handle date variable in machine learning data pre-processing
Solution 1
Some random thoughts:
Dates are good sources for feature engineering, I don't think there is one method to use dates in a model. Business user expertise would be great; are there observed trends that can be coded into the data?
Possible suggestions of features include:
- weekends vs weekdays
- business hours and time of day
- seasons
- week of year number
- month
- year
- beginning/end of month (pay days)
- quarter
- days to/from an action event(distance)
- missing or incomplete data
- etc.
All this depends on the data set and most won't apply.
some links:
http://appliedpredictivemodeling.com/blog/2015/7/28/feature-engineering-versus-feature-extraction
https://www.salford-systems.com/blog/dan-steinberg/using-dates-in-data-mining-models
http://trevorstephens.com/kaggle-titanic-tutorial/r-part-4-feature-engineering/
Solution 2
Cyclic Feature Encoding
Data that has a unique set of values that repeat in a cycle are known as cyclic data. Time-related features are mainly cyclic in nature. For example, months of a year, days of a week, hours of time, minutes of time etc... These features have a set of values and all the observations will have a value from this set only. In many ML problems, we encounter such features. Handling such features properly have proved to help in the improvement of accuracy.
Implementation
def encode(data, col, max_val):
data[col + '_sin'] = np.sin(2 * np.pi * data[col]/max_val)
data[col + '_cos'] = np.cos(2 * np.pi * data[col]/max_val)
return data
data['month'] = data.datetime.dt.month
data = encode(data, 'month', 12)
data['day'] = data.datetime.dt.day
data = encode(data, 'day', 31)
The Logic
A common method for encoding cyclical data is to transform the data into two dimensions using a sine and cosine transformation. Map each cyclical variable onto a circle such that the lowest value for that variable appears right next to the largest value. We compute the x- and y- components of that point using sin and cos trigonometric functions.
For handling months we consider them from 0-11 and refer to the below figure.
We can do that using the following transformations:
More on Feature Engineering Cyclic Features
yppdgr
Updated on December 27, 2021Comments
-
yppdgr over 2 years
I have a data-set that contains among other variables the time-stamp of the transaction in the format 26-09-2017 15:29:32. I need to find possible correlations and predictions of the sales (lets say in logistic regression). My questions are:
- How to handle the date format? Shall I convert it to one number (like excel does automatically)? Shall I split it in more variables like day, month, year, hour, mins, seconds? any other possible suggestions?
- What if I would like to add distinct week number per year? shall I add variable like 342017(week 34 of year 2017)?
- Shall I make the same for question 2 for quarter of year?
# Datetime Gender Purchase 1 23/09/2015 00:00:00 0 1 2 23/09/2015 01:00:00 1 0 3 25/09/2015 02:00:00 1 0 4 27/09/2015 03:00:00 1 1 5 28/09/2015 04:00:00 0 0
-
Pleastry almost 4 years@Charles I am currently attempting the days from an action event. However, some of the entries don't have the action of interest, or that action is occurring for the first time. How do I represent this as a feature? Surely using a 0 implies that there are 0 days between the previous and the current action, so I can't use it in that way.
-
Ryan John almost 4 yearsThis depends on the model you're building. Some models will accept NULL values, others wont. For a regression you may need a flag - see:stats.stackexchange.com/questions/299663/…
-
Gautam J over 2 yearsThis answer deserves more credit!