python pandas replacing strings in dataframe with numbers

121,619

Solution 1

What about DataFrame.replace?

In [9]: mapping = {'set': 1, 'test': 2}

In [10]: df.replace({'set': mapping, 'tesst': mapping})
Out[10]: 
   Unnamed: 0 respondent  brand engine  country  aware  aware_2  aware_3  age  \
0           0          a  volvo      p      swe      1        0        1   23   
1           1          b  volvo   None      swe      0        0        1   45   
2           2          c    bmw      p       us      0        0        1   56   
3           3          d    bmw      p       us      0        1        1   43   
4           4          e    bmw      d  germany      1        0        1   34   
5           5          f   audi      d  germany      1        0        1   59   
6           6          g  volvo      d      swe      1        0        0   65   
7           7          h   audi      d      swe      1        0        0   78   
8           8          i  volvo      d       us      1        1        1   32   

  tesst set  
0     2   1  
1     1   2  
2     2   1  
3     1   2  
4     2   1  
5     1   2  
6     2   1  
7     1   2  
8     2   1  

As @Jeff pointed out in the comments, in pandas versions < 0.11.1, manually tack .convert_objects() onto the end to properly convert tesst and set to int64 columns, in case that matters in subsequent operations.

Solution 2

I know this is old, but adding for those searching as I was. Create a dataframe in pandas, df in this code

ip_addresses = df.source_ip.unique()
ip_dict = dict(zip(ip_addresses, range(len(ip_addresses))))

That will give you a dictionary map of the ip addresses without having to write it out.

Solution 3

You can use the applymap DataFrame function to do this:

In [26]: df = DataFrame({"A": [1,2,3,4,5], "B": ['a','b','c','d','e'],
                         "C": ['b','a','c','c','d'], "D": ['a','c',7,9,2]})
In [27]: df
Out[27]:
   A  B  C  D
0  1  a  b  a
1  2  b  a  c
2  3  c  c  7
3  4  d  c  9
4  5  e  d  2

In [28]: mymap = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}

In [29]: df.applymap(lambda s: mymap.get(s) if s in mymap else s)
Out[29]:
   A  B  C  D
0  1  1  2  1
1  2  2  1  3
2  3  3  3  7
3  4  4  3  9
4  5  5  4  2

Solution 4

The simplest way to replace any value in the dataframe:

df=df.replace(to_replace="set",value="1")
df=df.replace(to_replace="test",value="2")

Hope this will help.

Solution 5

To convert Strings like 'volvo','bmw' into integers first convert it to a dataframe then pass it to pandas.get_dummies()

  df  = DataFrame.from_csv("myFile.csv")
  df_transform = pd.get_dummies( df )
  print( df_transform )

Better alternative: passing a dictionary to map() of a pandas series (df.myCol) (by specifying the column brand for example)

df.brand = df.brand.map( {'volvo':0 , 'bmw':1, 'audi':2} )
Share:
121,619
jonas
Author by

jonas

Updated on July 09, 2022

Comments

  • jonas
    jonas almost 2 years

    Is there any way to use the mapping function or something better to replace values in an entire dataframe?

    I only know how to perform the mapping on series.

    I would like to replace the strings in the 'tesst' and 'set' column with a number for example set = 1, test =2

    Here is a example of my dataset: (Original dataset is very large)

    ds_r
      respondent  brand engine  country  aware  aware_2  aware_3  age tesst   set
    0          a  volvo      p      swe      1        0        1   23   set   set
    1          b  volvo   None      swe      0        0        1   45   set   set
    2          c    bmw      p       us      0        0        1   56  test  test
    3          d    bmw      p       us      0        1        1   43  test  test
    4          e    bmw      d  germany      1        0        1   34   set   set
    5          f   audi      d  germany      1        0        1   59   set   set
    6          g  volvo      d      swe      1        0        0   65  test   set
    7          h   audi      d      swe      1        0        0   78  test   set
    8          i  volvo      d       us      1        1        1   32   set   set
    

    Final result should be

     ds_r
      respondent  brand engine  country  aware  aware_2  aware_3  age  tesst  set
    0          a  volvo      p      swe      1        0        1   23      1    1
    1          b  volvo   None      swe      0        0        1   45      1    1
    2          c    bmw      p       us      0        0        1   56      2    2
    3          d    bmw      p       us      0        1        1   43      2    2
    4          e    bmw      d  germany      1        0        1   34      1    1
    5          f   audi      d  germany      1        0        1   59      1    1
    6          g  volvo      d      swe      1        0        0   65      2    1
    7          h   audi      d      swe      1        0        0   78      2    1
    8          i  volvo      d       us      1        1        1   32      1    1
    
  • Jeff
    Jeff almost 11 years
    note that you might want to do a df.convert_objects() after the replacement to coerce to proper dtypes
  • Jeff
    Jeff almost 11 years
    @Dan Allan this will be default in 0.11.1, FYI (to convert_objects)
  • SRS
    SRS almost 9 years
    I working on the problem like this and I just followed the exact steps mentioned in your answer. I am not getting the output. Code: wc = pd.read_csv('PATH', usecols = ['Workclass'])
  • SRS
    SRS almost 9 years
    df = pd.DataFrame(wc) end of line wcdict = {"?":0,"Federal-gov":1,"Local-gov":2,"Never-worked":3,"Priva‌​te":4,"Self-emp-inc"‌​:5, "Self-emp-n-inc":6,"State-gov":7,"Without-pay":8} end of line df.applymap(lambda s: wcdict.get(s) if s in wcdict else s) end of line print(df)
  • bdiamante
    bdiamante almost 9 years
    df.applymap(lambda s: mymap.get(s) if s in mymap else s) does not make inline changes to df, so your print df statement will not reflect the results of the applymap. You need to do an assigment like df2 = df.applymap(lambda s: mymap.get(s) if s in mymap else s). print df2 will now reflect the changes.
  • SRS
    SRS almost 9 years
    That worked!! Thanks :) I have one more question, I need to work with pyspark rather than working with normal python. Does the implementation of this logic differs in pyspark? When I created a data frame, I gave the file path [as shown in above comments] but, I would like to give an RDD as the input to data frame. I couldn't do that. Do you have any idea about this?
  • bdiamante
    bdiamante almost 9 years
    Glad it worked. I'm really not sure... perhaps this might be a start?
  • SRS
    SRS almost 9 years
    Thanks for your help :)
  • HerrIvan
    HerrIvan over 5 years
    in general, what is this category type for?
  • tsando
    tsando over 5 years
    @HerrIvan there's plenty of documentation here pandas.pydata.org/pandas-docs/stable/categorical.html
  • Ishnark
    Ishnark about 5 years
    This is super old but you can also do this now: df.replace(to_replace=['set', 'test'], value=[1, 2])
  • H S Rathore
    H S Rathore about 4 years
    I think we shouldn't ask to hardcode name of the values, It should be dynamically picked up at run time and assigned number.