Best way to join / merge by range in pandas
Solution 1
Setup
Consider the dataframes A
and B
A = pd.DataFrame(dict(
A_id=range(10),
A_value=range(5, 105, 10)
))
B = pd.DataFrame(dict(
B_id=range(5),
B_low=[0, 30, 30, 46, 84],
B_high=[10, 40, 50, 54, 84]
))
A
A_id A_value
0 0 5
1 1 15
2 2 25
3 3 35
4 4 45
5 5 55
6 6 65
7 7 75
8 8 85
9 9 95
B
B_high B_id B_low
0 10 0 0
1 40 1 30
2 50 2 30
3 54 3 46
4 84 4 84
numpy
The ✌easiest✌ way is to use numpy
broadcasting.
We look for every instance of A_value
being greater than or equal to B_low
while at the same time A_value
is less than or equal to B_high
.
a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(
np.column_stack([A.values[i], B.values[j]]),
columns=A.columns.append(B.columns)
)
A_id A_value B_high B_id B_low
0 0 5 10 0 0
1 3 35 40 1 30
2 3 35 50 2 30
3 4 45 50 2 30
To address the comments and give something akin to a left join, I appended the part of A
that doesn't match.
pd.DataFrame(
np.column_stack([A.values[i], B.values[j]]),
columns=A.columns.append(B.columns)
).append(
A[~np.in1d(np.arange(len(A)), np.unique(i))],
ignore_index=True, sort=False
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 3 35 1.0 30.0 40.0
2 3 35 2.0 30.0 50.0
3 4 45 2.0 30.0 50.0
4 1 15 NaN NaN NaN
5 2 25 NaN NaN NaN
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
Solution 2
Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:
conn = sqlite3.connect(":memory:")
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)
You can adapt the query as needed in your application
Solution 3
I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.
Solution 4
conditional_join from pyjanitor may be helpful in the abstraction/convenience;:
# pip install pyjanitor
import pandas as pd
import janitor
inner join
A.conditional_join(B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<=')
)
A_id A_value B_id B_low B_high
0 0 5 0 0 10
1 3 35 1 30 40
2 3 35 2 30 50
3 4 45 2 30 50
left join
A.conditional_join(
B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<='),
how = 'left'
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 1 15 NaN NaN NaN
2 2 25 NaN NaN NaN
3 3 35 1.0 30.0 40.0
4 3 35 2.0 30.0 50.0
5 4 45 2.0 30.0 50.0
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
Dimgold
Data Science Researcher and RDBMS / Data Warehouse Lab TA
Updated on June 02, 2022Comments
-
Dimgold almost 2 years
I'm frequently using pandas for merge (join) by using a range condition.
For instance if there are 2 dataframes:
A (A_id, A_value)
B (B_id,B_low, B_high, B_name)
which are big and approximately of the same size (let's say 2M records each).
I would like to make an inner join between A and B, so A_value would be between B_low and B_high.
Using SQL syntax that would be:
SELECT * FROM A,B WHERE A_value between B_low and B_high
and that would be really easy, short and efficient.
Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:
A['dummy'] = 1 B['dummy'] = 1 Temp = pd.merge(A,B,on='dummy') Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]
Another solution that I had is by applying on each of A value a search function on B by using
B[(x>=B.B_low) & (x<=B.B_high)]
mask, but it sounds inefficient as well and might require index optimization.Is there a more elegant and/or efficient way to perform this action?
-
Dimgold almost 7 yearsthat's not a join, just stacking
-
joel.wilson about 6 yearsamazing solution.. can we say that this is a cross join... If I wanted to keep all rows of
A
only(basically left join onA
) then what change would I need to make ? -
joel.wilson about 6 yearsI wanted to reduce the bursting of rows thats happening too. Any thoughts?
-
brandog almost 6 yearsThis is great, I was have the same question as Joel, would it be possible to keep all the values of table A, like a left join?
-
yeye almost 5 years@piRSquared how would you do to keep only rows where A_id == B_id ? We can do it afterwards but I don't think that's the most efficient. In my case, I have an original df of 79k rows, it goes after your operation to 2.3m rows, then when I keep only rows where A_id == B_id, I have 74k rows which is what I expect. Can't this be done all at once ?
-
yeye almost 5 yearsFound my way: a = A.A_value.values aId = A.A_id.values bId = B.B_id.values bh = B.B_high.values bl = B.B_low.values i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh) & (aId[:, None] == bId)
-
Fenrir over 4 yearsI'm trying to figure out how to join an aggregate of the values for A joined to B: Example: in the last dataframe shown I would like to get the average of A_value for line 2 and 3 so that only one record remains for B_id=2 unfortunately my pandas/numpy knowledge lacks here
-
hjonez over 4 yearsThe more accurate description of a SQL join with ALL of A and matching Bs (a_val BETWEEN b_low AND b_high), like in the second example is a LEFT OUTER JOIN. LEFT JOIN with neither INNER (like the first example) nor OUTER specified, defaults to INNER.
-
lollerskates over 4 yearshello @piRSquared, in addition to joining on overlapping values, how can I add an extra condition that each overlapping values must also have the same ID value. Something similar to the SQL equivalent of
join A B on A.value BETWEEN B.low and B.high AND A.id = B.id
-
xicocaio over 3 yearsTo very big datasets with may rows and columns, I got a memory overflow with this answer. =/
-
Quixotic22 over 2 yearsI get the following error:
Cannot interpret '<PandasArray> [ 5, 15, 25, 35, 35, 35, 45, 45, 45, 55, 55, 55, 55, 65, 65, 65, 65, 75, 75, 75, 75] Length: 21, dtype: int64' as a data type
Any ideas? -
sammywemmy over 2 yearsHi @Quixotic22, a new version of
pyjanitor
was recently released; you should not have that error anymore. You can leave an issue on the GitHub page if you encounter any issues. -
Joe about 2 years@xicocaio I edited the answer to use pandas methods instead of numpy in the generation of the final DataFrame. This works much faster and consumes less memory, so maybe this will solve your overflow issue
-
xicocaio about 2 years@joel.wilson, great I will test it. Thanks!