Concatenate strings from several rows using Pandas groupby

208,047

Solution 1

You can groupby the 'name' and 'month' columns, then call transform which will return data aligned to the original df and apply a lambda where we join the text entries:

In [119]:

df['text'] = df[['name','text','month']].groupby(['name','month'])['text'].transform(lambda x: ','.join(x))
df[['name','text','month']].drop_duplicates()
Out[119]:
    name         text  month
0  name1       hej,du     11
2  name1        aj,oj     12
4  name2     fin,katt     11
6  name2  mycket,lite     12

I sub the original df by passing a list of the columns of interest df[['name','text','month']] here and then call drop_duplicates

EDIT actually I can just call apply and then reset_index:

In [124]:

df.groupby(['name','month'])['text'].apply(lambda x: ','.join(x)).reset_index()

Out[124]:
    name  month         text
0  name1     11       hej,du
1  name1     12        aj,oj
2  name2     11     fin,katt
3  name2     12  mycket,lite

update

the lambda is unnecessary here:

In[38]:
df.groupby(['name','month'])['text'].apply(','.join).reset_index()

Out[38]: 
    name  month         text
0  name1     11           du
1  name1     12        aj,oj
2  name2     11     fin,katt
3  name2     12  mycket,lite

Solution 2

We can groupby the 'name' and 'month' columns, then call agg() functions of Panda’s DataFrame objects.

The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation.

df.groupby(['name', 'month'], as_index = False).agg({'text': ' '.join})

enter image description here

Solution 3

The answer by EdChum provides you with a lot of flexibility but if you just want to concateate strings into a column of list objects you can also:

output_series = df.groupby(['name','month'])['text'].apply(list)

Solution 4

If you want to concatenate your "text" in a list:

df.groupby(['name', 'month'], as_index = False).agg({'text': list})

Solution 5

For me the above solutions were close but added some unwanted /n's and dtype:object, so here's a modified version:

df.groupby(['name', 'month'])['text'].apply(lambda text: ''.join(text.to_string(index=False))).str.replace('(\\n)', '').reset_index()
Share:
208,047

Related videos on Youtube

mattiasostmar
Author by

mattiasostmar

Computational Media Analyst

Updated on February 15, 2022

Comments

  • mattiasostmar
    mattiasostmar about 2 years

    I want to merge several strings in a dataframe based on a groupedby in Pandas.

    This is my code so far:

    import pandas as pd
    from io import StringIO
    
    data = StringIO("""
    "name1","hej","2014-11-01"
    "name1","du","2014-11-02"
    "name1","aj","2014-12-01"
    "name1","oj","2014-12-02"
    "name2","fin","2014-11-01"
    "name2","katt","2014-11-02"
    "name2","mycket","2014-12-01"
    "name2","lite","2014-12-01"
    """)
    
    # load string as stream into dataframe
    df = pd.read_csv(data,header=0, names=["name","text","date"],parse_dates=[2])
    
    # add column with month
    df["month"] = df["date"].apply(lambda x: x.month)
    

    I want the end result to look like this:

    enter image description here

    I don't get how I can use groupby and apply some sort of concatenation of the strings in the column "text". Any help appreciated!

  • Matthias Fripp
    Matthias Fripp almost 4 years
    In pandas < 1.0, .drop_duplicates() ignores the index, which may give unexpected results. You can avoid this by using .agg(lambda x: ','.join(x)) instead of .transform().drop_duplicates().
  • Alex Fedotov
    Alex Fedotov almost 4 years
    Man, you've just saved me a lot of time. Thank you. This is the best way to assemble the chronological lists of registrations/user ids into 'cohorts' that I know of. Thank you once again.
  • Raghavan vmvs
    Raghavan vmvs over 3 years
    Neat and uncomplicated. Eminently fleixible also
  • IAmBotmaker
    IAmBotmaker over 3 years
    drop_duplicates() might not work if you do not include parameter drop_duplicates(inplace=True) or just rewrite the line of code as df = df[['name','text','month']].drop_duplicates()
  • Thomas
    Thomas almost 3 years
    What ensures that the text e.g. in the first column is actually "hej du" and not "du hej"? Is there an implicit sort somewhere? How can I make this explicit, e.g. sort by the date column?
  • kağan hazal koçdemir
    kağan hazal koçdemir over 2 years
    hi, any ideas for dropping duplicates with agg function ?
  • Nicolas78
    Nicolas78 over 2 years
    @kağanhazalkoçdemir agg({'text': lambda x: ' '.join(set(x))})
  • constantstranger
    constantstranger about 2 years
    Why did 'hej,du' change to just 'du' in the "update" section?
  • whydoesntwork
    whydoesntwork about 2 years
    This solution worked for me very well for getting the unique appearances too. I just used “set” instead of “list” and then daisy chained a join and presto. Note that it doesn’t work if there are nan values, so I had to use fillna() on the text field first. In my case the command ended: df.groupby(['doc_id'])['author'].apply(set).apply(", ".join).reset_index()
  • Bill
    Bill about 2 years
    I don't think this adds spaces between the strings does it?