Pandas: replace column values based on match from another column

10,236

Solution 1

Use map

All the logic you need:

def update_type(t1, t2, dropna=False):
    return t1.map(t2).dropna() if dropna else t1.map(t2).fillna(t1)

Let's make 'ItemType2' the index of Dataframe2

update_type(Dataframe1.ItemType1,
            Dataframe2.set_index('ItemType2').newType)

0                Tomato
1                Potato
2                Potato
3      greenCauliflower
4     yellowCauliflower
5                Squash
6                Onions
7                Onions
8                Onions
9         yellowCabbage
10         GreenCabbage
Name: ItemType1, dtype: object

update_type(Dataframe1.ItemType1,
            Dataframe2.set_index('ItemType2').newType,
            dropna=True)

0                Tomato
1                Potato
2                Potato
3      greenCauliflower
4     yellowCauliflower
5                Squash
6                Onions
7                Onions
8                Onions
Name: ItemType1, dtype: object

Verify

updated = update_type(Dataframe1.ItemType1, Dataframe2.set_index('ItemType2').newType)

pd.concat([Dataframe1, updated], axis=1, keys=['old', 'new'])

enter image description here


Timing

def root(Dataframe1, Dataframe2):
    return Dataframe1['ItemType1'].replace(Dataframe2.set_index('ItemType2')['newType'].dropna())

def piRSquared(Dataframe1, Dataframe2):
    t1 = Dataframe1.ItemType1
    t2 = Dataframe2.set_index('ItemType2').newType
    return update_type(t1, t2)

enter image description here

Solution 2

You can convert df2 into a Series indexed by 'ItemType2', and then use replace on df1:

# Make df2 a Series indexed by 'ItemType'.
df2 = df2.set_index('ItemType2')['newType'].dropna()

# Replace values in df1.
df1['ItemType1'] = df1['ItemType1'].replace(df2)

Or in a single line, if you don't want to alter df2:

df1['ItemType1'] = df1['ItemType1'].replace(df2.set_index('ItemType2')['newType'].dropna())

Solution 3

This method requires you set your column names to 'type', then you can set off using merge and np.where

df3 = df1.merge(df2,how='inner',on='type')['type','newType']

df3['newType'] = np.where(df['newType'].isnull(),df['type'],df['newType'])
Share:
10,236

Related videos on Youtube

Anil_M
Author by

Anil_M

I love to solve problems, especially challenging one's , Help others and learn something new along the way. I believe SO is a great platform to learn and share knowledge.

Updated on September 15, 2022

Comments

  • Anil_M
    Anil_M about 1 year

    I've a column in first data-frame df1["ItemType"] as below,

    Dataframe1

    ItemType1
    redTomato
    whitePotato
    yellowPotato
    greenCauliflower
    yellowCauliflower
    yelloSquash
    redOnions
    YellowOnions
    WhiteOnions
    yellowCabbage
    GreenCabbage
    

    I need to replace that based on a dictionary created from another data-frame.

    Dataframe2

    ItemType2          newType
    whitePotato        Potato
    yellowPotato       Potato
    redTomato          Tomato
    yellowCabbage   
    GreenCabbage    
    yellowCauliflower   yellowCauliflower
    greenCauliflower    greenCauliflower
    YellowOnions        Onions
    WhiteOnions         Onions
    yelloSquash         Squash
    redOnions           Onions
    

    Notice that,

    • In dataframe2 some of the ItemType are same as ItemType in dataframe1.
    • Some ItemType in dataframe2 have null values like yellowCabbage.
    • ItemType in dataframe2 are out of order with respect toItemType in dataframe

    I need to replace values in Dataframe1 ItemType column if there is a match for value in the corresponding Dataframe2 ItemType with newType keeping above exceptions listed in bullet-points in mind.
    If there is no match, then values needs to be as they are [ no change].

    So far I got is.

    import pandas as pd
    
    #read second `csv-file`
    df2 = pd.read_csv('mappings.csv',names = ["ItemType", "newType"])
    #conver to dict
    df2=df2.set_index('ItemType').T.to_dict('list')
    

    Below given replace on match are not working. They are inserting NaN values instead of actual. These are based on discussion here on SO.

    df1.loc[df1['ItemType'].isin(df2['ItemType'])]=df2[['NewType']]
    

    OR

    df1['ItemType']=df2['ItemType'].map(df2)
    

    Thanks in advance

    EDIT
    Two column headers in both data frames have different names. So dataframe1 column on is ItemType1 and first column in second data-frame is ItemType2. Missed that on first edit.

  • Anil_M
    Anil_M over 7 years
    Hi - Thanks for quick replay. I have done slight changes to question. Two column headers in both data frames have different names. So dataframe1 column on is ItemType1 and first column in second dataframe is ItemType2. Also , above solution is giving error as KeyError: 'type'
  • draco_alpine
    draco_alpine over 7 years
    the error on 'type' and the issues with ItemType1 and ItemType2 are one and the same. Specifically, I'm trying to join on 'type' when actually neither df has column 'type' instead they have 'ItemType1' and 'ItemType2'. Personally I would rename the columns to ItemType in both df's and proceed. But the other solutions offered may be more suitable to your specific needs.
  • mechanical_meat
    mechanical_meat over 7 years
    Hooray for time measurements... +1
  • Anil_M
    Anil_M over 7 years
    Hi- thx for response, still checking out. The solution also needs to omit/drop ItemType1 items for which corresponding values are empty , null in second dataframe newType columns. So here it should drop yellowCabbage and GreenCabbage such that they shouldnt appear in the final table.
  • piRSquared
    piRSquared over 7 years
    @Anil_M that's easy, I forcibly put that back in. Updated post with optional dropna paramter.
  • Anil_M
    Anil_M over 7 years
    @piRSquared - I am getting NameError: global name 'dropna' is not defined for return t1.map(t2).dropna() if dropna else t1.map(t2).fillna(t1)
  • piRSquared
    piRSquared over 7 years
    @Anil_M did you put the dropna=False in the signature of the function definition?
  • Anil_M
    Anil_M over 7 years
    Yes, that fixed it.
  • bernando_vialli
    bernando_vialli about 6 years
    2 layer question here. Firstly, when I am trying to run this now on 1 column, I get a MemoryError, what can be done about that. Second question, I am trying to use this in the work I am doing now but I need something more complicated. I want to apply the one column to match to a huge dataframe with a bunch of columns (about 100) and rows. How would I modify the code to achieve that?