Use dictionary to replace a string within a string in Pandas columns

14,836

Solution 1

You can create dictionary and then replace:

ids = {'Id':['NYC','LA','UK'],
      'City':['New York City','Los Angeles','United Kingdom']}

ids = dict(zip(ids['Id'], ids['City']))
print (ids)
{'UK': 'United Kingdom', 'LA': 'Los Angeles', 'NYC': 'New York City'}

df['commentTest'] = df['Comment'].replace(ids, regex=True)
print (df)
  Categories                       Comment  Type  \
0     animal      The NYC tree is very big  tree   
1      plant  The cat from the UK is small   dog   
2     object     The rock was found in LA.  rock   

                                commentTest  
0        The New York City tree is very big  
1  The cat from the United Kingdom is small  
2        The rock was found in Los Angeles.  

Solution 2

It's actually much faster to use str.replace() than replace(), even though str.replace() requires a loop:

ids = {'NYC': 'New York City', 'LA': 'Los Angeles', 'UK': 'United Kingdom'}

for old, new in ids.items():
    df['Comment'] = df['Comment'].str.replace(old, new, regex=False)

#   Categories  Type                                   Comment
# 0     animal  tree        The New York City tree is very big
# 1      plant   dog  The cat from the United Kingdom is small
# 2     object  rock         The rock was found in Los Angeles

The only time replace() outperforms a str.replace() loop is with small dataframes:

timings for str.replace vs replace

The timing functions for reference:

def Series_replace(df):
    df['Comment'] = df['Comment'].replace(ids, regex=True)
    return df

def Series_str_replace(df):
    for old, new in ids.items():
        df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
    return df

Note that if ids is a dataframe instead of dictionary, you can get the same performance with itertuples():

ids = pd.DataFrame({'Id': ['NYC', 'LA', 'UK'], 'City': ['New York City', 'Los Angeles', 'United Kingdom']})

for row in ids.itertuples():
    df['Comment'] = df['Comment'].str.replace(row.Id, row.City, regex=False)
Share:
14,836
owwoow14
Author by

owwoow14

Updated on June 07, 2022

Comments

  • owwoow14
    owwoow14 almost 2 years

    I am trying to use a dictionary key to replace strings in a pandas column with its values. However, each column contains sentences. Therefore, I must first tokenize the sentences and detect whether a Word in the sentence corresponds with a key in my dictionary, then replace the string with the corresponding value.

    However, the result that I continue to get it none. Is there a better pythonic way to approach this problem?

    Here is my MVC for the moment. In the comments, I specified where the issue is happening.

    import pandas as pd
    
    data = {'Categories': ['animal','plant','object'],
        'Type': ['tree','dog','rock'],
            'Comment': ['The NYC tree is very big','The cat from the UK is small','The rock was found in LA.']
    }
    
    ids = {'Id':['NYC','LA','UK'],
          'City':['New York City','Los Angeles','United Kingdom']}
    
    
    df = pd.DataFrame(data)
    ids = pd.DataFrame(ids)
    
    def col2dict(ids):
        data = ids[['Id', 'City']]
        idDict = data.set_index('Id').to_dict()['City']
        return idDict
    
    def replaceIds(data,idDict):
        ids = idDict.keys()
        types = idDict.values()
        data['commentTest'] = data['Comment']
        words = data['commentTest'].apply(lambda x: x.split())
        for (i,word) in enumerate(words):
            #Here we can see that the words appear
            print word
            print ids
            if word in ids:
            #Here we can see that they are not being recognized. What happened?
                print ids
                print word
                words[i] = idDict[word]
                data['commentTest'] = ' '.apply(lambda x: ''.join(x))
        return data
    
    idDict = col2dict(ids)
    results = replaceIds(df, idDict)
    

    Results:

    None
    

    I am using python2.7 and when I am printing out the dict, there are u' of Unicode.

    My expected outcome is:

    Categories

    Comment

    Type

    commentTest

      Categories  Comment  Type commentTest
    0 animal  The NYC tree is very big tree The New York City tree is very big 
    1 plant The cat from the UK is small dog  The cat from the United Kingdom is small 
    2 object  The rock was found in LA. rock  The rock was found in Los Angeles.