Pandas: Subtracting two date columns and the result being an integer

160,692

Solution 1

How about:

df_test['Difference'] = (df_test['First_Date'] - df_test['Second Date']).dt.days

This will return difference as int if there are no missing values(NaT) and float if there is.

Pandas have a rich documentation on Time series / date functionality and Time deltas

Solution 2

You can divide column of dtype timedelta by np.timedelta64(1, 'D'), but output is not int, but float, because NaN values:

df_test['Difference'] = df_test['Difference'] / np.timedelta64(1, 'D')
print (df_test)
  First_Date Second Date  Difference
0 2016-02-09  2015-11-19        82.0
1 2016-01-06  2015-11-30        37.0
2        NaT  2015-12-04         NaN
3 2016-01-06  2015-12-08        29.0
4        NaT  2015-12-09         NaN
5 2016-01-07  2015-12-11        27.0
6        NaT  2015-12-12         NaN
7        NaT  2015-12-14         NaN
8 2016-01-06  2015-12-14        23.0
9        NaT  2015-12-15         NaN

Frequency conversion.

Solution 3

You can use datetime module to help here. Also, as a side note, a simple date subtraction should work as below:

import datetime as dt
import numpy as np
import pandas as pd

#Assume we have df_test:
In [222]: df_test
Out[222]: 
   first_date second_date
0  2016-01-31  2015-11-19
1  2016-02-29  2015-11-20
2  2016-03-31  2015-11-21
3  2016-04-30  2015-11-22
4  2016-05-31  2015-11-23
5  2016-06-30  2015-11-24
6         NaT  2015-11-25
7         NaT  2015-11-26
8  2016-01-31  2015-11-27
9         NaT  2015-11-28
10        NaT  2015-11-29
11        NaT  2015-11-30
12 2016-04-30  2015-12-01
13        NaT  2015-12-02
14        NaT  2015-12-03
15 2016-04-30  2015-12-04
16        NaT  2015-12-05
17        NaT  2015-12-06

In [223]: df_test['Difference'] = df_test['first_date'] - df_test['second_date'] 

In [224]: df_test
Out[224]: 
   first_date second_date  Difference
0  2016-01-31  2015-11-19     73 days
1  2016-02-29  2015-11-20    101 days
2  2016-03-31  2015-11-21    131 days
3  2016-04-30  2015-11-22    160 days
4  2016-05-31  2015-11-23    190 days
5  2016-06-30  2015-11-24    219 days
6         NaT  2015-11-25         NaT
7         NaT  2015-11-26         NaT
8  2016-01-31  2015-11-27     65 days
9         NaT  2015-11-28         NaT
10        NaT  2015-11-29         NaT
11        NaT  2015-11-30         NaT
12 2016-04-30  2015-12-01    151 days
13        NaT  2015-12-02         NaT
14        NaT  2015-12-03         NaT
15 2016-04-30  2015-12-04    148 days
16        NaT  2015-12-05         NaT
17        NaT  2015-12-06         NaT

Now, change type to datetime.timedelta, and then use the .days method on valid timedelta objects.

In [226]: df_test['Diffference'] = df_test['Difference'].astype(dt.timedelta).map(lambda x: np.nan if pd.isnull(x) else x.days)

In [227]: df_test
Out[227]: 
   first_date second_date  Difference  Diffference
0  2016-01-31  2015-11-19     73 days           73
1  2016-02-29  2015-11-20    101 days          101
2  2016-03-31  2015-11-21    131 days          131
3  2016-04-30  2015-11-22    160 days          160
4  2016-05-31  2015-11-23    190 days          190
5  2016-06-30  2015-11-24    219 days          219
6         NaT  2015-11-25         NaT          NaN
7         NaT  2015-11-26         NaT          NaN
8  2016-01-31  2015-11-27     65 days           65
9         NaT  2015-11-28         NaT          NaN
10        NaT  2015-11-29         NaT          NaN
11        NaT  2015-11-30         NaT          NaN
12 2016-04-30  2015-12-01    151 days          151
13        NaT  2015-12-02         NaT          NaN
14        NaT  2015-12-03         NaT          NaN
15 2016-04-30  2015-12-04    148 days          148
16        NaT  2015-12-05         NaT          NaN
17        NaT  2015-12-06         NaT          NaN

Hope that helps.

Solution 4

I feel that the overall answer does not handle if the dates 'wrap' around a year. This would be useful in understanding proximity to a date being accurate by day of year. In order to do these row operations, I did the following. (I had this used in a business setting in renewing customer subscriptions).

def get_date_difference(row, x, y):
    try:
        # Calcuating the smallest date difference between the start and the close date
        # There's some tricky logic in here to calculate for determining date difference
        # the other way around (Dec -> Jan is 1 month rather than 11)

        sub_start_date = int(row[x].strftime('%j')) # day of year (1-366)
        close_date = int(row[y].strftime('%j')) # day of year (1-366)

        later_date_of_year = max(sub_start_date, close_date) 
        earlier_date_of_year = min(sub_start_date, close_date)
        days_diff = later_date_of_year - earlier_date_of_year

# Calculates the difference going across the next year (December -> Jan)
        days_diff_reversed = (365 - later_date_of_year) + earlier_date_of_year
        return min(days_diff, days_diff_reversed)

    except ValueError:
        return None

Then the function could be:

dfAC_Renew['date_difference'] = dfAC_Renew.apply(get_date_difference, x = 'customer_since_date', y = 'renewal_date', axis = 1)
Share:
160,692
Kevin
Author by

Kevin

Updated on August 21, 2021

Comments

  • Kevin
    Kevin over 2 years

    I have two columns in a Pandas data frame that are dates.

    I am looking to subtract one column from another and the result being the difference in numbers of days as an integer.

    A peek at the data:

    df_test.head(10)
    Out[20]: 
      First_Date Second Date
    0 2016-02-09  2015-11-19
    1 2016-01-06  2015-11-30
    2        NaT  2015-12-04
    3 2016-01-06  2015-12-08
    4        NaT  2015-12-09
    5 2016-01-07  2015-12-11
    6        NaT  2015-12-12
    7        NaT  2015-12-14
    8 2016-01-06  2015-12-14
    9        NaT  2015-12-15
    

    I have created a new column successfully with the difference:

    df_test['Difference'] = df_test['First_Date'].sub(df_test['Second Date'], axis=0)
    df_test.head()         
    Out[22]: 
      First_Date Second Date  Difference
    0 2016-02-09  2015-11-19     82 days
    1 2016-01-06  2015-11-30     37 days
    2        NaT  2015-12-04         NaT
    3 2016-01-06  2015-12-08     29 days
    4        NaT  2015-12-09         NaT
    

    However I am unable to get a numeric version of the result:

    df_test['Difference'] = df_test[['Difference']].apply(pd.to_numeric)     
    
    df_test.head()
    Out[25]: 
      First_Date Second Date    Difference
    0 2016-02-09  2015-11-19  7.084800e+15
    1 2016-01-06  2015-11-30  3.196800e+15
    2        NaT  2015-12-04           NaN
    3 2016-01-06  2015-12-08  2.505600e+15
    4        NaT  2015-12-09           NaN
    
  • jezrael
    jezrael almost 8 years
    Yes, it is one possible solution, but I think it is not recommended approach, because output of column Diffference is object and next processing (adding, substraction...) is impossible.
  • clocker
    clocker almost 8 years
    @jesrael, there are other ways of doing, eg, your solution. However, adding/subtracting is not a problem with NaNs mixed in with int types in a column. They will be automatically be casted to float operations as needed.
  • Robert Beatty
    Robert Beatty almost 5 years
    Agree with @AllenWang. This is the superior answer.
  • user2256085
    user2256085 over 4 years
    @ Make that at least 3 suggesting this be the accepted answer
  • Nitro
    Nitro over 3 years
    This may have changed in recent versions. It works for me using .days now while .dt.days throws an error
  • Prayson W. Daniel
    Prayson W. Daniel over 3 years
    It appears that it depends on the resulting value. If they are a datetime series then .dt is required. Can you check the results of expression. Is it a DataFrame or a Series? I am still trying to figure out when dt is required
  • GenDemo
    GenDemo almost 3 years
    this seems to only work for days, not for weeks or years.