python pandas replacing strings in dataframe with numbers

python replace dataframe pandas

121,619

Solution 1

What about DataFrame.replace?

In [9]: mapping = {'set': 1, 'test': 2}

In [10]: df.replace({'set': mapping, 'tesst': mapping})
Out[10]: 
   Unnamed: 0 respondent  brand engine  country  aware  aware_2  aware_3  age  \
0           0          a  volvo      p      swe      1        0        1   23   
1           1          b  volvo   None      swe      0        0        1   45   
2           2          c    bmw      p       us      0        0        1   56   
3           3          d    bmw      p       us      0        1        1   43   
4           4          e    bmw      d  germany      1        0        1   34   
5           5          f   audi      d  germany      1        0        1   59   
6           6          g  volvo      d      swe      1        0        0   65   
7           7          h   audi      d      swe      1        0        0   78   
8           8          i  volvo      d       us      1        1        1   32   

  tesst set  
0     2   1  
1     1   2  
2     2   1  
3     1   2  
4     2   1  
5     1   2  
6     2   1  
7     1   2  
8     2   1

As @Jeff pointed out in the comments, in pandas versions < 0.11.1, manually tack .convert_objects() onto the end to properly convert tesst and set to int64 columns, in case that matters in subsequent operations.

Solution 2

I know this is old, but adding for those searching as I was. Create a dataframe in pandas, df in this code

ip_addresses = df.source_ip.unique()
ip_dict = dict(zip(ip_addresses, range(len(ip_addresses))))

That will give you a dictionary map of the ip addresses without having to write it out.

Solution 3

You can use the applymap DataFrame function to do this:

In [26]: df = DataFrame({"A": [1,2,3,4,5], "B": ['a','b','c','d','e'],
                         "C": ['b','a','c','c','d'], "D": ['a','c',7,9,2]})
In [27]: df
Out[27]:
   A  B  C  D
0  1  a  b  a
1  2  b  a  c
2  3  c  c  7
3  4  d  c  9
4  5  e  d  2

In [28]: mymap = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}

In [29]: df.applymap(lambda s: mymap.get(s) if s in mymap else s)
Out[29]:
   A  B  C  D
0  1  1  2  1
1  2  2  1  3
2  3  3  3  7
3  4  4  3  9
4  5  5  4  2

Solution 4

The simplest way to replace any value in the dataframe:

df=df.replace(to_replace="set",value="1")
df=df.replace(to_replace="test",value="2")

Hope this will help.

Solution 5

To convert Strings like 'volvo','bmw' into integers first convert it to a dataframe then pass it to pandas.get_dummies()

  df  = DataFrame.from_csv("myFile.csv")
  df_transform = pd.get_dummies( df )
  print( df_transform )

Better alternative: passing a dictionary to map() of a pandas series (df.myCol) (by specifying the column brand for example)

df.brand = df.brand.map( {'volvo':0 , 'bmw':1, 'audi':2} )

View more solutions

121,619

Author by

jonas

Updated on July 09, 2022

Comments

jonas almost 2 years

Is there any way to use the mapping function or something better to replace values in an entire dataframe?

I only know how to perform the mapping on series.

I would like to replace the strings in the 'tesst' and 'set' column with a number for example set = 1, test =2

Here is a example of my dataset: (Original dataset is very large)

ds_r
  respondent  brand engine  country  aware  aware_2  aware_3  age tesst   set
0          a  volvo      p      swe      1        0        1   23   set   set
1          b  volvo   None      swe      0        0        1   45   set   set
2          c    bmw      p       us      0        0        1   56  test  test
3          d    bmw      p       us      0        1        1   43  test  test
4          e    bmw      d  germany      1        0        1   34   set   set
5          f   audi      d  germany      1        0        1   59   set   set
6          g  volvo      d      swe      1        0        0   65  test   set
7          h   audi      d      swe      1        0        0   78  test   set
8          i  volvo      d       us      1        1        1   32   set   set

Final result should be

 ds_r
  respondent  brand engine  country  aware  aware_2  aware_3  age  tesst  set
0          a  volvo      p      swe      1        0        1   23      1    1
1          b  volvo   None      swe      0        0        1   45      1    1
2          c    bmw      p       us      0        0        1   56      2    2
3          d    bmw      p       us      0        1        1   43      2    2
4          e    bmw      d  germany      1        0        1   34      1    1
5          f   audi      d  germany      1        0        1   59      1    1
6          g  volvo      d      swe      1        0        0   65      2    1
7          h   audi      d      swe      1        0        0   78      2    1
8          i  volvo      d       us      1        1        1   32      1    1

Jeff almost 11 years

note that you might want to do a df.convert_objects() after the replacement to coerce to proper dtypes
Jeff almost 11 years

@Dan Allan this will be default in 0.11.1, FYI (to convert_objects)
SRS almost 9 years

I working on the problem like this and I just followed the exact steps mentioned in your answer. I am not getting the output. Code: wc = pd.read_csv('PATH', usecols = ['Workclass'])
SRS almost 9 years

df = pd.DataFrame(wc) end of line wcdict = {"?":0,"Federal-gov":1,"Local-gov":2,"Never-worked":3,"Priva‌te":4,"Self-emp-inc"‌:5, "Self-emp-n-inc":6,"State-gov":7,"Without-pay":8} end of line df.applymap(lambda s: wcdict.get(s) if s in wcdict else s) end of line print(df)
bdiamante almost 9 years

df.applymap(lambda s: mymap.get(s) if s in mymap else s) does not make inline changes to df, so your print df statement will not reflect the results of the applymap. You need to do an assigment like df2 = df.applymap(lambda s: mymap.get(s) if s in mymap else s). print df2 will now reflect the changes.
SRS almost 9 years

That worked!! Thanks :) I have one more question, I need to work with pyspark rather than working with normal python. Does the implementation of this logic differs in pyspark? When I created a data frame, I gave the file path [as shown in above comments] but, I would like to give an RDD as the input to data frame. I couldn't do that. Do you have any idea about this?
bdiamante almost 9 years

Glad it worked. I'm really not sure... perhaps this might be a start?
SRS almost 9 years

Thanks for your help :)
HerrIvan over 5 years

in general, what is this category type for?
tsando over 5 years

@HerrIvan there's plenty of documentation here pandas.pydata.org/pandas-docs/stable/categorical.html
Ishnark about 5 years

This is super old but you can also do this now: df.replace(to_replace=['set', 'test'], value=[1, 2])
H S Rathore about 4 years

I think we shouldn't ask to hardcode name of the values, It should be dynamically picked up at run time and assigned number.