How to join two dataframes for which column values are within a certain range?
Solution 1
One simple solution is create interval index
from start and end
setting closed = both
then use get_loc
to get the event i.e (Hope all the date times are in timestamps dtype )
df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])
Output :
timestamp A B event 0 2016-05-14 10:54:33 0.020228 0.026572 E1 1 2016-05-14 10:54:34 0.057780 0.175499 E2 2 2016-05-14 10:54:35 0.098808 0.620986 E2 3 2016-05-14 10:54:36 0.158789 1.014819 E2 4 2016-05-14 10:54:39 0.038129 2.384590 E3
Solution 2
First use IntervalIndex to create a reference index based on the interval of interest, then use get_indexer to slice the dataframe which contains the discrete events of interest.
idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
event = df_2.iloc[idx.get_indexer(df_1.timestamp), 'event']
event
0 E1
1 E2
1 E2
1 E2
2 E3
Name: event, dtype: object
df_1['event'] = event.to_numpy()
df_1
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
Reference: A question on IntervalIndex.get_indexer.
Solution 3
You can use the module pandasql
import pandasql as ps
sqlcode = '''
select df_1.timestamp
,df_1.A
,df_1.B
,df_2.event
from df_1
inner join df_2
on d1.timestamp between df_2.start and df2.end
'''
newdf = ps.sqldf(sqlcode,locals())
Solution 4
Option 1
idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
df_2.index=idx
df_1['event']=df_2.loc[df_1.timestamp,'event'].values
Option 2
df_2['timestamp']=df_2['end']
pd.merge_asof(df_1,df_2[['timestamp','event']],on='timestamp',direction ='forward',allow_exact_matches =True)
Out[405]:
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
Solution 5
In this method, we assume TimeStamp objects are used.
df2 start end event
0 2016-05-14 10:54:31 2016-05-14 10:54:33 E1
1 2016-05-14 10:54:34 2016-05-14 10:54:37 E2
2 2016-05-14 10:54:38 2016-05-14 10:54:42 E3
event_num = len(df2.event)
def get_event(t):
event_idx = ((t >= df2.start) & (t <= df2.end)).dot(np.arange(event_num))
return df2.event[event_idx]
df1["event"] = df1.timestamp.transform(get_event)
Explanation of get_event
For each timestamp in df1
, say t0 = 2016-05-14 10:54:33
,
(t0 >= df2.start) & (t0 <= df2.end)
will contain 1 true. (See example 1). Then, take a dot product with np.arange(event_num)
to get the index of the event that a t0
belongs to.
Examples:
Example 1
t0 >= df2.start t0 <= df2.end After & np.arange(3)
0 True True -> T 0 event_idx
1 False True -> F 1 -> 0
2 False True -> F 2
Take t2 = 2016-05-14 10:54:35
for another example
t2 >= df2.start t2 <= df2.end After & np.arange(3)
0 True False -> F 0 event_idx
1 True True -> T 1 -> 1
2 False True -> F 2
We finally use transform
to transform each timestamp into an event.
DougKruger
Updated on June 12, 2022Comments
-
DougKruger almost 2 years
Given two dataframes
df_1
anddf_2
, how to join them such that datetime columndf_1
is in betweenstart
andend
in dataframedf_2
:print df_1 timestamp A B 0 2016-05-14 10:54:33 0.020228 0.026572 1 2016-05-14 10:54:34 0.057780 0.175499 2 2016-05-14 10:54:35 0.098808 0.620986 3 2016-05-14 10:54:36 0.158789 1.014819 4 2016-05-14 10:54:39 0.038129 2.384590 print df_2 start end event 0 2016-05-14 10:54:31 2016-05-14 10:54:33 E1 1 2016-05-14 10:54:34 2016-05-14 10:54:37 E2 2 2016-05-14 10:54:38 2016-05-14 10:54:42 E3
Get corresponding
event
wheredf1.timestamp
is betweendf_2.start
anddf2.end
timestamp A B event 0 2016-05-14 10:54:33 0.020228 0.026572 E1 1 2016-05-14 10:54:34 0.057780 0.175499 E2 2 2016-05-14 10:54:35 0.098808 0.620986 E2 3 2016-05-14 10:54:36 0.158789 1.014819 E2 4 2016-05-14 10:54:39 0.038129 2.384590 E3
-
Erick 3E about 5 yearsI didn't know this was an option, thank you! It solved my problem
-
PascalVKooten about 5 yearsIt's very slow.
-
TaL over 3 yearsI know it's been a while since you answered the question but maybe you can elaborate \ explain the second line in the code? I'm having a similar problem and do not know how to adjust it to my code. Thank you
-
Bharath over 3 years@TaL, its just mapping the data.
df_2.index.get_loc(x)
basically will return theindex
of timex
based on the upper and lower bound of interval index, thatindex
is used to get the event from the table. -
Joe Ferndz over 2 years@Bharath, I know we are going back on an old post. Question: what if we have multiple values for event. Can I use nunique() to count the number of events? I am unable to adjust the code based on your input. Any recommendations?
-
Bharath over 2 years@JoeFerndz it's been a while, you can post a new question in SO explaining your requirements, this is an old answer there might be better approaches out there.
-
sammywemmy over 2 yearsthis works great, if the intervals do not overlap, else you might have to revert to Bharath's solution
-
rdmolony over 2 yearsthis thread demos the join using only pandas and sqlite
-
Olsgaard about 2 yearsAs far as I can tell, this fails if some events are outside of the intervals. While the supplied code works on the example data, I don't think doesn't fully fulfil the question of how to join on a time range, as that question implies that the answer will work more similarly to how SQL will join using the
between
-keyword -
Olsgaard about 2 yearsI think this is a better answer than the current accepted. The code is shorter and it works even if some of the
timestamps
are not inside thetimeintervals
. This method also works using the assign-method, e.g.df_1.assign(events = df_2['event'])
-
Bharath about 2 years@Olsgaard its been a while since I answered here mostly during early stages of my career, and there has to be better solution out there than this. Will certainly update this solution when I get time.