Pandas merge dataframes with shared column, fillna in left with right

25,168

Solution 1

The problem confusing merge is that both dataframes have a 'b' column, but the left and right versions have NaNs in mismatched places. You want to avoid getting unwanted multiple 'b' columns 'b_x', 'b_y' from merge in the first place:

  • slice the non-shared columns 'a','e' from df1
  • do merge(df2, 'left'), this will pick up 'b' from the right dataframe (since it only exists in the right df)
  • finally do df1.update(...) , this will update the NaNs in the column 'b' taken from df2 with df1['b']

Solution:

df1.update(df1[['a', 'e']].merge(df2, 'left'))

df1

   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b

Note: Because I used merge(..., how='left'), I preserve the row order of the calling dataframe. If my df1 had values of a that were not in order

   a    b  e
0  1  0.0  a
1  2  1.0  1
2  4  1.0  b
3  3  NaN  2

The result would be

df1.update(df1[['a', 'e']].merge(df2, 'left'))

df1

   a    b  e
0  1  0.0  a
1  2  1.0  1
2  4  1.0  b
3  3  0.0  2

Which is as expected.


Further...

If you want to be more explicit when there may be more columns involved

df1.update(df1.drop('b', 1).merge(df2, 'left', 'a'))

Even Further...

If you don't want to update the dataframe, we can use combine_first

Quick

df1.combine_first(df1[['a', 'e']].merge(df2, 'left'))

Explicit

df1.combine_first(df1.drop('b', 1).merge(df2, 'left', 'a'))

EVEN FURTHER!...

The 'left' merge may preserve order but NOT the index. This is the ultra conservative approach:

df3 = df1.drop('b', 1).merge(df2, 'left', on='a').set_index(df1.index)
df1.combine_first(df3)

Solution 2

Short version

df1.b.fillna(df1.a.map(df2.set_index('a').b),inplace=True)
df1
Out[173]: 
   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b

Since you mentioned there will be multiple columns

df = df1.combine_first(df1[['a']].merge(df2, on='a', how='left'))
df
Out[184]: 
   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b

Also we can pass to fillna with df

df1.fillna(df1[['a']].merge(df2, on='a', how='left'))
Out[185]: 
   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b

Solution 3

Only if the indices are alligned (important note), we can use update:

df1['b'].update(df2['b'])


   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b

Or simply fillna:

df1['b'].fillna(df2['b'], inplace=True)

If you're indices are not alligned, see WenNYoBen's answer or comment underneath.

Solution 4

You can mask the data.

original data:

print(df)
   one  two  three
0    1  1.0    1.0
1    2  NaN    2.0
2    3  3.0    NaN

print(df2)
   one  two  three
0    4    4      4
1    4    2      4
2    4    4      3

See below, mask just fills based on condition.

# mask values where isna()
df1[['two','three']] = df1[['two','three']]\
        .mask(df1[['two','three']].isna(),df2[['two','three']])

output:

   one  two  three
0    1  1.0    1.0
1    2  2.0    2.0
2    3  3.0    3.0
Share:
25,168
Kenan
Author by

Kenan

I started my professional career as a Geotechnical Engineer for 3 years, mostly playing with sand, gravel, clay and silt. After that I worked in the financial industry as a Python developer mostly using Pandas for data wrangling. Now I am working at Arlo on the smart cameras teams. Stack overflow has been an invaluable tool throughout my career as a programmer. I look forward to returning my thanks to the community in years to follow.

Updated on July 08, 2020

Comments

  • Kenan
    Kenan almost 4 years

    I am trying to merge two dataframes and replace the nan in the left df with the right df, I can do it with three lines of code as below, but I want to know if there is a better/shorter way?

    # Example data (my actual df is ~500k rows x 11 cols)
    df1 = pd.DataFrame({'a': [1,2,3,4], 'b': [0,1,np.nan, 1], 'e': ['a', 1, 2,'b']})
    df2 = pd.DataFrame({'a': [1,2,3,4], 'b': [np.nan, 1, 0, 1]})
    
    # Merge the dataframes...
    df = df1.merge(df2, on='a', how='left')
    
    # Fillna in 'b' column of left df with right df...
    df['b'] = df['b_x'].fillna(df['b_y'])
    
    # Drop the columns no longer needed
    df = df.drop(['b_x', 'b_y'], axis=1)
    
  • BENY
    BENY almost 5 years
    df1.set_index('a',inplace=True); df1.update(df2.set_index('a')); df1.reset_index()
  • piRSquared
    piRSquared almost 5 years
    Operating on the dataframe df1 without any inplace mutation... If you wanted to: df1.fillna({'b': df1.a.map(df2.set_index('a').b)})
  • Erfan
    Erfan almost 5 years
    If you look good, this can be variant on merge? Or better combine_first?
  • BENY
    BENY almost 5 years
    @piRSquared yep , already + for that , that is a great answer sir :-)
  • Kenan
    Kenan almost 5 years
    Note how my example has 3 columns, so you have to merge somehow
  • krewsayder
    krewsayder almost 5 years
    I edited to account for this. You can mask more than once column at a time. No merge required. Still requires the order to be the same.
  • Kenan
    Kenan almost 5 years
    update function doesn't work for me; however the combine_first did exactly what I wanted, thank you
  • smci
    smci almost 5 years
    kslookall: update works fine, please retry in a clean session with pandas 0.24, and confirm it works.
  • smci
    smci almost 5 years
    @piRSquared you really need to explain that update is filling the NAs, before the left-merge, so they don't cause unwanted multiple 'b' columns 'b_x', 'b_y'.
  • piRSquared
    piRSquared almost 5 years
    That’s fair. I’ll add explanation when I have a chance
  • smci
    smci almost 5 years
    I edited in some explanation. It's confusing that you present so many alternatives, please edit concise one-line explanations of when you need each one.
  • smci
    smci almost 5 years
    The comment "The 'left' merge may preserve order but NOT the index." seems irrelevant and confusing, clearly both dfs have a shared column a that can be used to merge(..., on='a'), so their indexes are irrelevant, and good solutions shouldn't rely on it. It's better to explain why something is a solution, rather than worry the user by discussing lots of possible problems.
  • piRSquared
    piRSquared almost 5 years
    I disagree with the last bit. People using merge might not be aware that it blows up the index and they might’ve cared about it. OP never states one way or the other but I implicitly decide for them with an answer that disregards the index. I’m not perfectly consistent but my answers tend to addressing the details of the data, index, and column elements of a dataframe