Extracting text from elements in Pandas column, writing to new column
It isn't clear, why second parentheses doesn't match. Maybe because char !
.
Then you can use extract with regular expression.
Regular expression \(([A-Za-z0-9 _]+)\)
means:
-
\(
matches a literal(
character -
(
begins a new group -
[A-Za-z0-9 _]
is a character set matching any letter (capital or lower case), digit or underscore and space -
+
matches the preceding element (the character set) one or more times. -
)
ends the group -
\)
matches a literal)
character
Second parentheses isn't matched, because regex exclude character !
- it isn't in brackets [A-Za-z0-9 _]
.
import pandas as pd
import numpy as np
import io
temp=u"""(info) text (yay!)
I love text
Text is fun
(more info) more text
lotsa text (boo!)"""
df = pd.read_csv(io.StringIO(temp), header=None, names=['original'])
print df
# original
#0 (info) text (yay!)
#1 I love text
#2 Text is fun
#3 (more info) more text
#4 lotsa text (boo!)
df['col1'] = df['original'].str.extract(r"\(([A-Za-z0-9 _]+)\)")
df['col2'] = df['original'].str.replace(r"\(([A-Za-z0-9 _]+)\)", "")
print df
# original col1 col2
#0 (info) text (yay!) info text (yay!)
#1 I love text NaN I love text
#2 Text is fun NaN Text is fun
#3 (more info) more text more info more text
#4 lotsa text (boo!) NaN lotsa text (boo!)
Related videos on Youtube
user770901
Updated on September 16, 2022Comments
-
user770901 over 1 year
I've got some data in a column (COL_NAME) of a Pandas DataFrame. I'd like to extract some text between '(' and ')' (this data either exists, or the parens don't exist at all, although there may be more than one set of parens in the data). I'd then like to write the data in the paren to another column, and then remove the '(XXX)' from the original string.
I.e.
COL_NAME ======== (info) text (yay!) I love text Text is fun (more info) more text lotsa text (boo!)
turns in to:
COL_NAME NEW_COL ======== ======= text (yay!) info i love text None Text is fun None more text more info lots text (boo!) None
I can do this by isolating the column, iterating through its elements, splitting on the (, creating two new lists and then adding them to the DataFrame, but there's assuredly a way more Pythonic/Pandic way of doing this, right?
Thanks!
-
WoodChopperI highly doubt that he coded
lotsa text (boo!)
tolots text (boo!)
-
-
user770901 over 8 yearsThis works great! Thank you so much. One note: @jezrael, is that it leaves a leading space on the "leftover" text in col2. Now there's the issue of applying lstrip to an entire column of a dataframe...But thank you for the well explained answer!