Concatenate strings from several rows using Pandas groupby
Solution 1
You can groupby the 'name'
and 'month'
columns, then call transform
which will return data aligned to the original df and apply a lambda where we join
the text entries:
In [119]:
df['text'] = df[['name','text','month']].groupby(['name','month'])['text'].transform(lambda x: ','.join(x))
df[['name','text','month']].drop_duplicates()
Out[119]:
name text month
0 name1 hej,du 11
2 name1 aj,oj 12
4 name2 fin,katt 11
6 name2 mycket,lite 12
I sub the original df by passing a list of the columns of interest df[['name','text','month']]
here and then call drop_duplicates
EDIT actually I can just call apply
and then reset_index
:
In [124]:
df.groupby(['name','month'])['text'].apply(lambda x: ','.join(x)).reset_index()
Out[124]:
name month text
0 name1 11 hej,du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
update
the lambda
is unnecessary here:
In[38]:
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Out[38]:
name month text
0 name1 11 du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
Solution 2
We can groupby the 'name' and 'month' columns, then call agg()
functions of Panda’s DataFrame objects.
The aggregation functionality provided by the agg()
function allows multiple statistics to be calculated per group in one calculation.
df.groupby(['name', 'month'], as_index = False).agg({'text': ' '.join})
Solution 3
The answer by EdChum provides you with a lot of flexibility but if you just want to concateate strings into a column of list objects you can also:
output_series = df.groupby(['name','month'])['text'].apply(list)
Solution 4
If you want to concatenate your "text" in a list:
df.groupby(['name', 'month'], as_index = False).agg({'text': list})
Solution 5
For me the above solutions were close but added some unwanted /n's
and dtype:object
, so here's a modified version:
df.groupby(['name', 'month'])['text'].apply(lambda text: ''.join(text.to_string(index=False))).str.replace('(\\n)', '').reset_index()
Related videos on Youtube
Comments
-
mattiasostmar about 2 years
I want to merge several strings in a dataframe based on a groupedby in Pandas.
This is my code so far:
import pandas as pd from io import StringIO data = StringIO(""" "name1","hej","2014-11-01" "name1","du","2014-11-02" "name1","aj","2014-12-01" "name1","oj","2014-12-02" "name2","fin","2014-11-01" "name2","katt","2014-11-02" "name2","mycket","2014-12-01" "name2","lite","2014-12-01" """) # load string as stream into dataframe df = pd.read_csv(data,header=0, names=["name","text","date"],parse_dates=[2]) # add column with month df["month"] = df["date"].apply(lambda x: x.month)
I want the end result to look like this:
I don't get how I can use groupby and apply some sort of concatenation of the strings in the column "text". Any help appreciated!
-
Matthias Fripp almost 4 yearsIn
pandas < 1.0
,.drop_duplicates()
ignores the index, which may give unexpected results. You can avoid this by using.agg(lambda x: ','.join(x))
instead of.transform().drop_duplicates()
. -
Alex Fedotov almost 4 yearsMan, you've just saved me a lot of time. Thank you. This is the best way to assemble the chronological lists of registrations/user ids into 'cohorts' that I know of. Thank you once again.
-
Raghavan vmvs over 3 yearsNeat and uncomplicated. Eminently fleixible also
-
IAmBotmaker over 3 years
drop_duplicates()
might not work if you do not include parameterdrop_duplicates(inplace=True)
or just rewrite the line of code asdf = df[['name','text','month']].drop_duplicates()
-
Thomas almost 3 yearsWhat ensures that the text e.g. in the first column is actually "hej du" and not "du hej"? Is there an implicit sort somewhere? How can I make this explicit, e.g. sort by the date column?
-
kağan hazal koçdemir over 2 yearshi, any ideas for dropping duplicates with agg function ?
-
Nicolas78 over 2 years@kağanhazalkoçdemir
agg({'text': lambda x: ' '.join(set(x))})
-
constantstranger about 2 yearsWhy did 'hej,du' change to just 'du' in the "update" section?
-
whydoesntwork about 2 yearsThis solution worked for me very well for getting the unique appearances too. I just used “set” instead of “list” and then daisy chained a join and presto. Note that it doesn’t work if there are nan values, so I had to use fillna() on the text field first. In my case the command ended: df.groupby(['doc_id'])['author'].apply(set).apply(", ".join).reset_index()
-
Bill about 2 yearsI don't think this adds spaces between the strings does it?