get first and last values in a groupby

36,352

Solution 1

Option 1

def first_last(df):
    return df.ix[[0, -1]]

df.groupby(level=0, group_keys=False).apply(first_last)

enter image description here


Option 2 - only works if index is unique

idx = df.index.to_series().groupby(level=0).agg(['first', 'last']).stack()
df.loc[idx]

Option 3 - per notes below, this only makes sense when there are no NAs

I also abused the agg function. The code below works, but is far uglier.

df.reset_index(1).groupby(level=0).agg(['first', 'last']).stack() \
    .set_index('level_1', append=True).reset_index(1, drop=True) \
    .rename_axis([None, None])

Note

per @unutbu: agg(['first', 'last']) take the firs non-na values.

I interpreted this as, it must then be necessary to run this column by column. Further, forcing index level=1 to align may not even make sense.

Let's include another test

df = pd.DataFrame(np.arange(20).reshape(10, -1),
                  [list('aaaabbbccd'),
                   list('abcdefghij')],
                  list('XY'))

df.loc[tuple('aa'), 'X'] = np.nan

def first_last(df):
    return df.ix[[0, -1]]

df.groupby(level=0, group_keys=False).apply(first_last)

enter image description here

df.reset_index(1).groupby(level=0).agg(['first', 'last']).stack() \
    .set_index('level_1', append=True).reset_index(1, drop=True) \
    .rename_axis([None, None])

enter image description here

Sure enough! This second solution is taking the first valid value in column X. It is now nonsensical to have forced that value to align with the index a.

Solution 2

This could be on of the easy solution.

df.groupby(level = 0, as_index= False).nth([0,-1])

      X   Y
a a   0   1
  d   6   7
b e   8   9
  g  12  13
c h  14  15
  i  16  17
d j  18  19

Hope this helps. (Y)

Solution 3

Please try this:

For last value: df.groupby('Column_name').nth(-1),

For first value: df.groupby('Column_name').nth(0)

Share:
36,352
Brian
Author by

Brian

Updated on June 24, 2020

Comments

  • Brian
    Brian almost 4 years

    I have a dataframe df

    df = pd.DataFrame(np.arange(20).reshape(10, -1),
                      [['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'],
                       ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']],
                      ['X', 'Y'])
    

    How do I get the first and last rows, grouped by the first level of the index?

    I tried

    df.groupby(level=0).agg(['first', 'last']).stack()
    

    and got

              X   Y
    a first   0   1
      last    6   7
    b first   8   9
      last   12  13
    c first  14  15
      last   16  17
    d first  18  19
      last   18  19
    

    This is so close to what I want. How can I preserve the level 1 index and get this instead:

          X   Y
    a a   0   1
      d   6   7
    b e   8   9
      g  12  13
    c h  14  15
      i  16  17
    d j  18  19
      j  18  19
    
  • unutbu
    unutbu over 7 years
    Even though it looks more complicated, the reset_index/agg solution is significantly faster than the groupby/apply solution when there are a lot of groups. For example, when df = pd.DataFrame(np.random.randint(100, size=(10**3,4)), columns=list('ABCD')).set_index(['A','B']).rename_axis([None‌​, None]).
  • unutbu
    unutbu over 7 years
    Nice solutions! Might also be good to note that agg(['first', 'last']) returns the first and last non-NaN value if available. apply(first_last) will return the first and last values even if they are NaN.
  • Richard DiSalvo
    Richard DiSalvo almost 4 years
    i think for latest pandas version we need df.iloc instead of df.ix see e.g. stackoverflow.com/questions/59991397/…
  • jtlz2
    jtlz2 about 2 years
    But simultaneously?
  • GregOliveira
    GregOliveira almost 2 years
    Helped a lot! :-)