Best way to join / merge by range in pandas

python pandas numpy join

19,210

Solution 1

Setup
Consider the dataframes A and B

A = pd.DataFrame(dict(
        A_id=range(10),
        A_value=range(5, 105, 10)
    ))
B = pd.DataFrame(dict(
        B_id=range(5),
        B_low=[0, 30, 30, 46, 84],
        B_high=[10, 40, 50, 54, 84]
    ))

A

   A_id  A_value
0     0        5
1     1       15
2     2       25
3     3       35
4     4       45
5     5       55
6     6       65
7     7       75
8     8       85
9     9       95

B

   B_high  B_id  B_low
0      10     0      0
1      40     1     30
2      50     2     30
3      54     3     46
4      84     4     84

numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.

a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

pd.DataFrame(
    np.column_stack([A.values[i], B.values[j]]),
    columns=A.columns.append(B.columns)
)

   A_id  A_value  B_high  B_id  B_low
0     0        5      10     0      0
1     3       35      40     1     30
2     3       35      50     2     30
3     4       45      50     2     30

To address the comments and give something akin to a left join, I appended the part of A that doesn't match.

pd.DataFrame(
    np.column_stack([A.values[i], B.values[j]]),
    columns=A.columns.append(B.columns)
).append(
    A[~np.in1d(np.arange(len(A)), np.unique(i))],
    ignore_index=True, sort=False
)

    A_id  A_value  B_id  B_low  B_high
0      0        5   0.0    0.0    10.0
1      3       35   1.0   30.0    40.0
2      3       35   2.0   30.0    50.0
3      4       45   2.0   30.0    50.0
4      1       15   NaN    NaN     NaN
5      2       25   NaN    NaN     NaN
6      5       55   NaN    NaN     NaN
7      6       65   NaN    NaN     NaN
8      7       75   NaN    NaN     NaN
9      8       85   NaN    NaN     NaN
10     9       95   NaN    NaN     NaN

Solution 2

Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:

conn = sqlite3.connect(":memory:") 
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)

You can adapt the query as needed in your application

Solution 3

I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.

Solution 4

conditional_join from pyjanitor may be helpful in the abstraction/convenience;:

# pip install pyjanitor
import pandas as pd
import janitor

inner join

A.conditional_join(B, 
                   ('A_value', 'B_low', '>='), 
                   ('A_value', 'B_high', '<=')
                  )

   A_id  A_value  B_id  B_low  B_high
0     0        5     0      0      10
1     3       35     1     30      40
2     3       35     2     30      50
3     4       45     2     30      50

left join

A.conditional_join(
       B, 
       ('A_value', 'B_low', '>='), 
       ('A_value', 'B_high', '<='), 
       how = 'left'
    )

    A_id  A_value  B_id  B_low  B_high
0      0        5   0.0    0.0    10.0
1      1       15   NaN    NaN     NaN
2      2       25   NaN    NaN     NaN
3      3       35   1.0   30.0    40.0
4      3       35   2.0   30.0    50.0
5      4       45   2.0   30.0    50.0
6      5       55   NaN    NaN     NaN
7      6       65   NaN    NaN     NaN
8      7       75   NaN    NaN     NaN
9      8       85   NaN    NaN     NaN
10     9       95   NaN    NaN     NaN

View more solutions

19,210

Author by

Dimgold

Data Science Researcher and RDBMS / Data Warehouse Lab TA

Updated on June 02, 2022

Comments

Dimgold almost 2 years
I'm frequently using pandas for merge (join) by using a range condition.

For instance if there are 2 dataframes:

A (A_id, A_value)

B (B_id,B_low, B_high, B_name)

which are big and approximately of the same size (let's say 2M records each).

I would like to make an inner join between A and B, so A_value would be between B_low and B_high.

Using SQL syntax that would be:
```
SELECT *
FROM A,B
WHERE A_value between B_low and B_high
```
and that would be really easy, short and efficient.

Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:
```
A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]
```
Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.

Is there a more elegant and/or efficient way to perform this action?
Dimgold almost 7 years

that's not a join, just stacking
joel.wilson about 6 years

amazing solution.. can we say that this is a cross join... If I wanted to keep all rows of A only(basically left join on A) then what change would I need to make ?
joel.wilson about 6 years

I wanted to reduce the bursting of rows thats happening too. Any thoughts?
brandog almost 6 years

This is great, I was have the same question as Joel, would it be possible to keep all the values of table A, like a left join?
yeye almost 5 years

@piRSquared how would you do to keep only rows where A_id == B_id ? We can do it afterwards but I don't think that's the most efficient. In my case, I have an original df of 79k rows, it goes after your operation to 2.3m rows, then when I keep only rows where A_id == B_id, I have 74k rows which is what I expect. Can't this be done all at once ?
yeye almost 5 years

Found my way: a = A.A_value.values aId = A.A_id.values bId = B.B_id.values bh = B.B_high.values bl = B.B_low.values i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh) & (aId[:, None] == bId)
Fenrir over 4 years

I'm trying to figure out how to join an aggregate of the values for A joined to B: Example: in the last dataframe shown I would like to get the average of A_value for line 2 and 3 so that only one record remains for B_id=2 unfortunately my pandas/numpy knowledge lacks here
hjonez over 4 years

The more accurate description of a SQL join with ALL of A and matching Bs (a_val BETWEEN b_low AND b_high), like in the second example is a LEFT OUTER JOIN. LEFT JOIN with neither INNER (like the first example) nor OUTER specified, defaults to INNER.
lollerskates over 4 years

hello @piRSquared, in addition to joining on overlapping values, how can I add an extra condition that each overlapping values must also have the same ID value. Something similar to the SQL equivalent of join A B on A.value BETWEEN B.low and B.high AND A.id = B.id
xicocaio over 3 years

To very big datasets with may rows and columns, I got a memory overflow with this answer. =/
Quixotic22 over 2 years

I get the following error: Cannot interpret '<PandasArray> [ 5, 15, 25, 35, 35, 35, 45, 45, 45, 55, 55, 55, 55, 65, 65, 65, 65, 75, 75, 75, 75] Length: 21, dtype: int64' as a data type Any ideas?
sammywemmy over 2 years

Hi @Quixotic22, a new version of pyjanitor was recently released; you should not have that error anymore. You can leave an issue on the GitHub page if you encounter any issues.
Joe about 2 years

@xicocaio I edited the answer to use pandas methods instead of numpy in the generation of the final DataFrame. This works much faster and consumes less memory, so maybe this will solve your overflow issue
xicocaio about 2 years

@joel.wilson, great I will test it. Thanks!