How to join two dataframes for which column values are within a certain range?

11,554

Solution 1

One simple solution is create interval index from start and end setting closed = both then use get_loc to get the event i.e (Hope all the date times are in timestamps dtype )

df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])

Output :

            timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3

Solution 2

First use IntervalIndex to create a reference index based on the interval of interest, then use get_indexer to slice the dataframe which contains the discrete events of interest.

idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
event = df_2.iloc[idx.get_indexer(df_1.timestamp), 'event']

event
0    E1
1    E2
1    E2
1    E2
2    E3
Name: event, dtype: object

df_1['event'] = event.to_numpy()
df_1
            timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3

Reference: A question on IntervalIndex.get_indexer.

Solution 3

You can use the module pandasql

import pandasql as ps

sqlcode = '''
select df_1.timestamp
,df_1.A
,df_1.B
,df_2.event
from df_1 
inner join df_2 
on d1.timestamp between df_2.start and df2.end
'''

newdf = ps.sqldf(sqlcode,locals())

Solution 4

Option 1

idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
df_2.index=idx
df_1['event']=df_2.loc[df_1.timestamp,'event'].values

Option 2

df_2['timestamp']=df_2['end']
pd.merge_asof(df_1,df_2[['timestamp','event']],on='timestamp',direction ='forward',allow_exact_matches =True)
Out[405]: 
            timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3

Solution 5

In this method, we assume TimeStamp objects are used.

df2  start                end                  event    
   0 2016-05-14 10:54:31  2016-05-14 10:54:33  E1
   1 2016-05-14 10:54:34  2016-05-14 10:54:37  E2
   2 2016-05-14 10:54:38  2016-05-14 10:54:42  E3

event_num = len(df2.event)

def get_event(t):    
    event_idx = ((t >= df2.start) & (t <= df2.end)).dot(np.arange(event_num))
    return df2.event[event_idx]

df1["event"] = df1.timestamp.transform(get_event)

Explanation of get_event

For each timestamp in df1, say t0 = 2016-05-14 10:54:33,

(t0 >= df2.start) & (t0 <= df2.end) will contain 1 true. (See example 1). Then, take a dot product with np.arange(event_num) to get the index of the event that a t0 belongs to.

Examples:

Example 1

    t0 >= df2.start    t0 <= df2.end     After &     np.arange(3)    
0     True                True         ->  T              0        event_idx
1    False                True         ->  F              1     ->     0
2    False                True         ->  F              2

Take t2 = 2016-05-14 10:54:35 for another example

    t2 >= df2.start    t2 <= df2.end     After &     np.arange(3)    
0     True                False        ->  F              0        event_idx
1     True                True         ->  T              1     ->     1
2    False                True         ->  F              2

We finally use transform to transform each timestamp into an event.

Share:
11,554
DougKruger
Author by

DougKruger

Updated on June 12, 2022

Comments

  • DougKruger
    DougKruger almost 2 years

    Given two dataframes df_1 and df_2, how to join them such that datetime column df_1 is in between start and end in dataframe df_2:

    print df_1
    
      timestamp              A          B
    0 2016-05-14 10:54:33    0.020228   0.026572
    1 2016-05-14 10:54:34    0.057780   0.175499
    2 2016-05-14 10:54:35    0.098808   0.620986
    3 2016-05-14 10:54:36    0.158789   1.014819
    4 2016-05-14 10:54:39    0.038129   2.384590
    
    
    print df_2
    
      start                end                  event    
    0 2016-05-14 10:54:31  2016-05-14 10:54:33  E1
    1 2016-05-14 10:54:34  2016-05-14 10:54:37  E2
    2 2016-05-14 10:54:38  2016-05-14 10:54:42  E3
    

    Get corresponding event where df1.timestamp is between df_2.start and df2.end

      timestamp              A          B          event
    0 2016-05-14 10:54:33    0.020228   0.026572   E1
    1 2016-05-14 10:54:34    0.057780   0.175499   E2
    2 2016-05-14 10:54:35    0.098808   0.620986   E2
    3 2016-05-14 10:54:36    0.158789   1.014819   E2
    4 2016-05-14 10:54:39    0.038129   2.384590   E3
    
  • Erick 3E
    Erick 3E about 5 years
    I didn't know this was an option, thank you! It solved my problem
  • PascalVKooten
    PascalVKooten about 5 years
    It's very slow.
  • TaL
    TaL over 3 years
    I know it's been a while since you answered the question but maybe you can elaborate \ explain the second line in the code? I'm having a similar problem and do not know how to adjust it to my code. Thank you
  • Bharath
    Bharath over 3 years
    @TaL, its just mapping the data. df_2.index.get_loc(x) basically will return the index of time x based on the upper and lower bound of interval index, that index is used to get the event from the table.
  • Joe Ferndz
    Joe Ferndz over 2 years
    @Bharath, I know we are going back on an old post. Question: what if we have multiple values for event. Can I use nunique() to count the number of events? I am unable to adjust the code based on your input. Any recommendations?
  • Bharath
    Bharath over 2 years
    @JoeFerndz it's been a while, you can post a new question in SO explaining your requirements, this is an old answer there might be better approaches out there.
  • sammywemmy
    sammywemmy over 2 years
    this works great, if the intervals do not overlap, else you might have to revert to Bharath's solution
  • rdmolony
    rdmolony over 2 years
    this thread demos the join using only pandas and sqlite
  • Olsgaard
    Olsgaard about 2 years
    As far as I can tell, this fails if some events are outside of the intervals. While the supplied code works on the example data, I don't think doesn't fully fulfil the question of how to join on a time range, as that question implies that the answer will work more similarly to how SQL will join using the between -keyword
  • Olsgaard
    Olsgaard about 2 years
    I think this is a better answer than the current accepted. The code is shorter and it works even if some of the timestamps are not inside the timeintervals. This method also works using the assign-method, e.g. df_1.assign(events = df_2['event'])
  • Bharath
    Bharath about 2 years
    @Olsgaard its been a while since I answered here mostly during early stages of my career, and there has to be better solution out there than this. Will certainly update this solution when I get time.