Best way to join / merge by range in pandas

19,210

Solution 1

Setup
Consider the dataframes A and B

A = pd.DataFrame(dict(
        A_id=range(10),
        A_value=range(5, 105, 10)
    ))
B = pd.DataFrame(dict(
        B_id=range(5),
        B_low=[0, 30, 30, 46, 84],
        B_high=[10, 40, 50, 54, 84]
    ))

A

   A_id  A_value
0     0        5
1     1       15
2     2       25
3     3       35
4     4       45
5     5       55
6     6       65
7     7       75
8     8       85
9     9       95

B

   B_high  B_id  B_low
0      10     0      0
1      40     1     30
2      50     2     30
3      54     3     46
4      84     4     84

numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.

a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

pd.DataFrame(
    np.column_stack([A.values[i], B.values[j]]),
    columns=A.columns.append(B.columns)
)

   A_id  A_value  B_high  B_id  B_low
0     0        5      10     0      0
1     3       35      40     1     30
2     3       35      50     2     30
3     4       45      50     2     30

To address the comments and give something akin to a left join, I appended the part of A that doesn't match.

pd.DataFrame(
    np.column_stack([A.values[i], B.values[j]]),
    columns=A.columns.append(B.columns)
).append(
    A[~np.in1d(np.arange(len(A)), np.unique(i))],
    ignore_index=True, sort=False
)

    A_id  A_value  B_id  B_low  B_high
0      0        5   0.0    0.0    10.0
1      3       35   1.0   30.0    40.0
2      3       35   2.0   30.0    50.0
3      4       45   2.0   30.0    50.0
4      1       15   NaN    NaN     NaN
5      2       25   NaN    NaN     NaN
6      5       55   NaN    NaN     NaN
7      6       65   NaN    NaN     NaN
8      7       75   NaN    NaN     NaN
9      8       85   NaN    NaN     NaN
10     9       95   NaN    NaN     NaN

Solution 2

Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:

conn = sqlite3.connect(":memory:") 
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)

You can adapt the query as needed in your application

Solution 3

I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.

Solution 4

conditional_join from pyjanitor may be helpful in the abstraction/convenience;:

# pip install pyjanitor
import pandas as pd
import janitor

inner join

A.conditional_join(B, 
                   ('A_value', 'B_low', '>='), 
                   ('A_value', 'B_high', '<=')
                  )

   A_id  A_value  B_id  B_low  B_high
0     0        5     0      0      10
1     3       35     1     30      40
2     3       35     2     30      50
3     4       45     2     30      50

left join

A.conditional_join(
       B, 
       ('A_value', 'B_low', '>='), 
       ('A_value', 'B_high', '<='), 
       how = 'left'
    )

    A_id  A_value  B_id  B_low  B_high
0      0        5   0.0    0.0    10.0
1      1       15   NaN    NaN     NaN
2      2       25   NaN    NaN     NaN
3      3       35   1.0   30.0    40.0
4      3       35   2.0   30.0    50.0
5      4       45   2.0   30.0    50.0
6      5       55   NaN    NaN     NaN
7      6       65   NaN    NaN     NaN
8      7       75   NaN    NaN     NaN
9      8       85   NaN    NaN     NaN
10     9       95   NaN    NaN     NaN
Share:
19,210
Dimgold
Author by

Dimgold

Data Science Researcher and RDBMS / Data Warehouse Lab TA

Updated on June 02, 2022

Comments

  • Dimgold
    Dimgold almost 2 years

    I'm frequently using pandas for merge (join) by using a range condition.

    For instance if there are 2 dataframes:

    A (A_id, A_value)

    B (B_id,B_low, B_high, B_name)

    which are big and approximately of the same size (let's say 2M records each).

    I would like to make an inner join between A and B, so A_value would be between B_low and B_high.

    Using SQL syntax that would be:

    SELECT *
    FROM A,B
    WHERE A_value between B_low and B_high
    

    and that would be really easy, short and efficient.

    Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:

    A['dummy'] = 1
    B['dummy'] = 1
    Temp = pd.merge(A,B,on='dummy')
    Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]
    

    Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.

    Is there a more elegant and/or efficient way to perform this action?

  • Dimgold
    Dimgold almost 7 years
    that's not a join, just stacking
  • joel.wilson
    joel.wilson about 6 years
    amazing solution.. can we say that this is a cross join... If I wanted to keep all rows of A only(basically left join on A) then what change would I need to make ?
  • joel.wilson
    joel.wilson about 6 years
    I wanted to reduce the bursting of rows thats happening too. Any thoughts?
  • brandog
    brandog almost 6 years
    This is great, I was have the same question as Joel, would it be possible to keep all the values of table A, like a left join?
  • yeye
    yeye almost 5 years
    @piRSquared how would you do to keep only rows where A_id == B_id ? We can do it afterwards but I don't think that's the most efficient. In my case, I have an original df of 79k rows, it goes after your operation to 2.3m rows, then when I keep only rows where A_id == B_id, I have 74k rows which is what I expect. Can't this be done all at once ?
  • yeye
    yeye almost 5 years
    Found my way: a = A.A_value.values aId = A.A_id.values bId = B.B_id.values bh = B.B_high.values bl = B.B_low.values i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh) & (aId[:, None] == bId)
  • Fenrir
    Fenrir over 4 years
    I'm trying to figure out how to join an aggregate of the values for A joined to B: Example: in the last dataframe shown I would like to get the average of A_value for line 2 and 3 so that only one record remains for B_id=2 unfortunately my pandas/numpy knowledge lacks here
  • hjonez
    hjonez over 4 years
    The more accurate description of a SQL join with ALL of A and matching Bs (a_val BETWEEN b_low AND b_high), like in the second example is a LEFT OUTER JOIN. LEFT JOIN with neither INNER (like the first example) nor OUTER specified, defaults to INNER.
  • lollerskates
    lollerskates over 4 years
    hello @piRSquared, in addition to joining on overlapping values, how can I add an extra condition that each overlapping values must also have the same ID value. Something similar to the SQL equivalent of join A B on A.value BETWEEN B.low and B.high AND A.id = B.id
  • xicocaio
    xicocaio over 3 years
    To very big datasets with may rows and columns, I got a memory overflow with this answer. =/
  • Quixotic22
    Quixotic22 over 2 years
    I get the following error: Cannot interpret '<PandasArray> [ 5, 15, 25, 35, 35, 35, 45, 45, 45, 55, 55, 55, 55, 65, 65, 65, 65, 75, 75, 75, 75] Length: 21, dtype: int64' as a data type Any ideas?
  • sammywemmy
    sammywemmy over 2 years
    Hi @Quixotic22, a new version of pyjanitor was recently released; you should not have that error anymore. You can leave an issue on the GitHub page if you encounter any issues.
  • Joe
    Joe about 2 years
    @xicocaio I edited the answer to use pandas methods instead of numpy in the generation of the final DataFrame. This works much faster and consumes less memory, so maybe this will solve your overflow issue
  • xicocaio
    xicocaio about 2 years
    @joel.wilson, great I will test it. Thanks!