Iterating over a column and replacing a value with an extracted string [Pandas]

23,828

Solution 1

The easiest and fastest way is to use Pandas str.get() function and create an other column for the desired results.

Solution #1 This first solution works if your values in B are integers [1234,3456,[1357, 2468],8901]

df['C'] = df['B'].str.get(0).astype(float)
df.C.fillna(df['B'], inplace=True)
df['C'] = df.C.astype(int, inplace=True)

Output:

A             B     C
0  aa          1234  1234
1  ab          3456  3456
2  bc  [1357, 2468]  1357
3  cc          8901  8901

Then, you can delete column B if you don't need it.

Solution #2 This solution works if your values in B are strings ['1234','3456',['1357', '2468'],'8901']

import re
df['digits'] = df['B'].apply(lambda x: re.findall('\d+', str(x)))
df['digits'] = df['digits'].str.get(0)
print(df)

Output:

   A             B    digits
0  aa          1234   1234
1  ab          3456   3456
2  bc  [1357, 2468]   1357
3  cc          8901   8901

Again, you can delete column B if you don't need it.

Solution 2

In your code, you are looping over the items in column B of the dataframe, but you don't have a way to index back into your original dataframe. Specifically, the line:

df['B'][item] = item_v,

doesn't do what you want. It is placing a new item in column B with an index of item. If you try it with a small dataframe, you will probably see some odd values at the end of the frame. When I try this, I get:

In[36]: df
Out[36]: 
    A     B
0  aa  1234
1  ab  3456
2  bc  1357
3  cc  8901

In[37]: df['B'][item] = item_v

In[38]: df['B']
Out[38]: 
0       1234
1       3456
2       1357
3       8901
8901    8901 <-- ???
Name: B, dtype: object

To make matters worse, this line doesn't insert the value into the dataframe where you would expect. You will only see the new element when you look at df['B']. If you look at only df you will see the original dataframe without the extra item.

The correct way is to set elements in a dataframe is to use .loc[] like:

df.loc[item,'B'] = item_v

This still doesn't address the original problem, which is how to get the correct index. One fix for your original code is to accumulate values for each item in column B in a list and then assign it back to column B like this:

newB = []
for item in df['B']:
    if len(item) > 4:
        item_v = str(item[1:5])
    else:
        item_v = item
    newB.append(item_v)
print(newB)
df.loc[:, 'B'] = newB 

However, with pandas there are also solutions that don't require directly iterating over the items in column B.

For example, you can use .where() to replace only the strings longer than 4 characters along with the .str functions to manipulate the text elements. This one liner will do the job:

df.loc[:,'B'] = df['B'].where((df['B'].str.len() <= 4), df['B'].str[1:5])

This statement creates a Series that contains the item from column B if it is 4 or fewer characters, or the slice [1:5] of the item in column B if it is longer than 4 characters. This series is then assigned to replace column B in df.

Share:
23,828
Feyzi Bagirov
Author by

Feyzi Bagirov

Updated on March 12, 2020

Comments

  • Feyzi Bagirov
    Feyzi Bagirov about 4 years

    I have a dataset, that looks like this:

      A   B
    1 aa  1234
    2 ab  3456
    3 bc  [1357, 2468]
    4 cc  8901
    ...
    

    I need to iterate over the column B and replace all values in square brackets ([]) with four left digits in those brackets, so the dataset would look like this:

      A   B
    1 aa  1234
    2 ab  3456
    3 bc  1357
    4 cc  8901
    ...
    

    I have this code:

    for item in df['B']:
    if len(item) > 4:
        item_v = str(item[1:5])
        df['B'][item] = item_v 
        print(df['B'][item])
    

    Which prints truncated values, however, if I check the head of the df, it still has the old values:

       > df['B'].head()
    
       >  A   B
        1 aa  1234
        2 ab  3456
        3 bc  [1357, 2468]
        4 cc  8901
        ...
    

    What am I doing wrong?