Pandas - possible to aggregate two columns using two different aggregations?

21,394

The agg method can accept a dict, in which case the keys indicate the column to which the function is applied:

grouped.agg({'numberA':'sum', 'numberB':'min'})

For example,

import numpy as np
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'number A': np.arange(8),
                   'number B': np.arange(8) * 2})
grouped = df.groupby('A')

print(grouped.agg({
    'number A': 'sum',
    'number B': 'min'}))

yields

     number B  number A
A                      
bar         2         9
foo         0        19

This also shows that Pandas can handle spaces in column names. I'm not sure what the origin of the problem was, but literal spaces should not have posed a problem. If you wish to investigate this further,

print(df.columns)

without reassigning the column names, will show show us the repr of the names. Maybe there was a hard-to-see character in the column name that looked like a space (or some other character) but was actually a u'\xa0' (NO-BREAK SPACE), for example.

Share:
21,394
marcus adamski
Author by

marcus adamski

Updated on September 19, 2020

Comments

  • marcus adamski
    marcus adamski almost 4 years

    I'm loading a csv file, which has the following columns: date, textA, textB, numberA, numberB

    I want to group by the columns: date, textA and textB - but want to apply "sum" to numberA, but "min" to numberB.

    data = pd.read_table("file.csv", sep=",", thousands=',')
    grouped = data.groupby(["date", "textA", "textB"], as_index=False)
    

    ...but I cannot see how to then apply two different aggregate functions, to two different columns? I.e. sum(numberA), min(numberB)

  • marcus adamski
    marcus adamski almost 11 years
    thanks - for the quick response. I had tried this, but I get a "KeyError: u'no item named KeyError: u'no item named Number A '" Maybe that's related to my column name having a space in? I'll experiment to see if that makes a difference
  • Andy Hayden
    Andy Hayden almost 11 years
    Woah, you can also do stuff like this: g.agg({'A': ['min', 'max'], 'B': 'mean'})
  • marcus adamski
    marcus adamski almost 11 years
    Just added a "names" param to read_table, and renamed all my columns by removing any spaces in the column names. That did the trick - the method I had previously tried (and Andy suggested) now works.
  • Andy Hayden
    Andy Hayden almost 11 years
    @marcusadamski this should work fine with spaces in the columns names.