JOIN two dataframes on common column in python

60,292

Solution 1

Use merge:

print (pd.merge(df1, df2, left_on='id', right_on='id1', how='left').drop('id1', axis=1))
   id name  count  price  rating
0   1    a     10  100.0     1.0
1   2    b     20  200.0     2.0
2   3    c     30  300.0     3.0
3   4    d     40    NaN     NaN
4   5    e     50  500.0     5.0

Another solution is simple rename column:

print (pd.merge(df1, df2.rename(columns={'id1':'id'}), on='id',  how='left'))
   id name  count  price  rating
0   1    a     10  100.0     1.0
1   2    b     20  200.0     2.0
2   3    c     30  300.0     3.0
3   4    d     40    NaN     NaN
4   5    e     50  500.0     5.0

If need only column price the simpliest is map:

df1['price'] = df1.id.map(df2.set_index('id1')['price'])
print (df1)
   id name  count  price
0   1    a     10  100.0
1   2    b     20  200.0
2   3    c     30  300.0
3   4    d     40    NaN
4   5    e     50  500.0

Another 2 solutions:

print (pd.merge(df1, df2, left_on='id', right_on='id1', how='left')
         .drop(['id1', 'rating'], axis=1))
   id name  count  price
0   1    a     10  100.0
1   2    b     20  200.0
2   3    c     30  300.0
3   4    d     40    NaN
4   5    e     50  500.0

print (pd.merge(df1, df2[['id1','price']], left_on='id', right_on='id1', how='left')
         .drop('id1', axis=1))
   id name  count  price
0   1    a     10  100.0
1   2    b     20  200.0
2   3    c     30  300.0
3   4    d     40    NaN
4   5    e     50  500.0

Solution 2

join utilizes the index to merge on unless we specify a column to use instead. However, we can only specify a column instead of the index for the 'left' dataframe.

Strategy:

  • set_index on df2 to be id1
  • use join with df as the left dataframe and id as the on parameter. Note that I could have set_index('id') on df to avoid having to use the on parameter. However, this allowed me leave the column in the dataframe rather than having to reset_index later.

df.join(df2.set_index('id1'), on='id')

   id name  count  price  rating
0   1    a     10  100.0     1.0
1   2    b     20  200.0     2.0
2   3    c     30  300.0     3.0
3   4    d     40    NaN     NaN
4   5    e     50  500.0     5.0

If you only want price from df2

df.join(df2.set_index('id1')[['price']], on='id')


   id name  count  price
0   1    a     10  100.0
1   2    b     20  200.0
2   3    c     30  300.0
3   4    d     40    NaN
4   5    e     50  500.0
Share:
60,292
Shubham R
Author by

Shubham R

Updated on July 25, 2022

Comments

  • Shubham R
    Shubham R almost 2 years

    I have a dataframe df:

    id   name   count
    1    a       10
    2    b       20
    3    c       30
    4    d       40
    5    e       50
    

    Here I have another dataframe df2:

    id1  price   rating
     1     100     1.0
     2     200     2.0
     3     300     3.0
     5     500     5.0
    

    I want to join these two dataframes on column id and id1(both refer same). Here is an example of df3:

    id   name   count   price   rating
    1    a       10      100      1.0
    2    b       20      200      2.0
    3    c       30      300      3.0
    4    d       40      Nan      Nan
    5    e       50      500      5.0
    

    Should I use df.merge or pd.concat?

  • Shubham R
    Shubham R over 7 years
    keeping this answer, if from df2 i have to select only 1 column 'price' then?
  • Shubham R
    Shubham R over 7 years
    keeping this answer, if from df2 i have to select only 1 column 'price' then?
  • jezrael
    jezrael over 7 years
    I dont understand, can you explain more?
  • Shubham R
    Shubham R over 7 years
    final table has id name count id1 price rating but i want only price from df2 not rating, then?
  • Shubham R
    Shubham R over 7 years
    the 2 ways that you suggested both are correct,right?
  • jezrael
    jezrael over 7 years
    Yes, all solutions are correct. If need add more column, nicer and better is join (not necessary delete column, left join by default), but if need add only one column map is faster.
  • Shubham R
    Shubham R over 7 years
    for two huge dataframe(say each df has around 4 million rows, should i use merge or map? which one would take less time to complete
  • jezrael
    jezrael over 7 years
    do you need add one or 2 columns?
  • Shubham R
    Shubham R over 7 years
    just a question, my requirements may change, sometimes one column sometimes 2 columns. which one would be best in both the cases
  • jezrael
    jezrael over 7 years
    hmmm, I try test it, but join or merge is more universal.
  • jezrael
    jezrael over 7 years
    I try test if join or merge is faster.