Iterating over a column and replacing a value with an extracted string [Pandas]
Solution 1
The easiest and fastest way is to use Pandas str.get() function and create an other column for the desired results.
Solution #1
This first solution works if your values in B
are integers [1234,3456,[1357, 2468],8901]
df['C'] = df['B'].str.get(0).astype(float)
df.C.fillna(df['B'], inplace=True)
df['C'] = df.C.astype(int, inplace=True)
Output:
A B C
0 aa 1234 1234
1 ab 3456 3456
2 bc [1357, 2468] 1357
3 cc 8901 8901
Then, you can delete column B if you don't need it.
Solution #2
This solution works if your values in B
are strings ['1234','3456',['1357', '2468'],'8901']
import re
df['digits'] = df['B'].apply(lambda x: re.findall('\d+', str(x)))
df['digits'] = df['digits'].str.get(0)
print(df)
Output:
A B digits
0 aa 1234 1234
1 ab 3456 3456
2 bc [1357, 2468] 1357
3 cc 8901 8901
Again, you can delete column B if you don't need it.
Solution 2
In your code, you are looping over the items in column B of the dataframe, but you don't have a way to index back into your original dataframe. Specifically, the line:
df['B'][item] = item_v
,
doesn't do what you want. It is placing a new item in column B with an index of item
. If you try it with a small dataframe, you will probably see some odd values at the end of the frame. When I try this, I get:
In[36]: df
Out[36]:
A B
0 aa 1234
1 ab 3456
2 bc 1357
3 cc 8901
In[37]: df['B'][item] = item_v
In[38]: df['B']
Out[38]:
0 1234
1 3456
2 1357
3 8901
8901 8901 <-- ???
Name: B, dtype: object
To make matters worse, this line doesn't insert the value into the dataframe where you would expect. You will only see the new element when you look at df['B']
. If you look at only df
you will see the original dataframe without the extra item.
The correct way is to set elements in a dataframe is to use .loc[]
like:
df.loc[item,'B'] = item_v
This still doesn't address the original problem, which is how to get the correct index. One fix for your original code is to accumulate values for each item in column B in a list and then assign it back to column B like this:
newB = []
for item in df['B']:
if len(item) > 4:
item_v = str(item[1:5])
else:
item_v = item
newB.append(item_v)
print(newB)
df.loc[:, 'B'] = newB
However, with pandas
there are also solutions that don't require directly iterating over the items in column B.
For example, you can use .where()
to replace only the strings longer than 4 characters along with the .str
functions to manipulate the text elements. This one liner will do the job:
df.loc[:,'B'] = df['B'].where((df['B'].str.len() <= 4), df['B'].str[1:5])
This statement creates a Series that contains the item from column B if it is 4 or fewer characters, or the slice [1:5] of the item in column B if it is longer than 4 characters. This series is then assigned to replace column B in df
.
Feyzi Bagirov
Updated on March 12, 2020Comments
-
Feyzi Bagirov about 4 years
I have a dataset, that looks like this:
A B 1 aa 1234 2 ab 3456 3 bc [1357, 2468] 4 cc 8901 ...
I need to iterate over the column B and replace all values in square brackets ([]) with four left digits in those brackets, so the dataset would look like this:
A B 1 aa 1234 2 ab 3456 3 bc 1357 4 cc 8901 ...
I have this code:
for item in df['B']: if len(item) > 4: item_v = str(item[1:5]) df['B'][item] = item_v print(df['B'][item])
Which prints truncated values, however, if I check the head of the df, it still has the old values:
> df['B'].head() > A B 1 aa 1234 2 ab 3456 3 bc [1357, 2468] 4 cc 8901 ...
What am I doing wrong?