Pandas combine two group by's, filter and merge the groups(counts)
Solution 1
Use concat
for merge them together:
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
df = pd.concat([df1, df_success],axis=1)
print (df)
DELETE POST PUT SUCCESS
ID
1 1 0 1 2
2 0 1 1 1
Another solution with value_counts
:
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
df = pd.concat([df1, df_success],axis=1)
print (df)
DELETE POST PUT SUCCESS
ID
1 1 0 1 2
2 0 1 1 1
Last is possible convert index to column and remove columns name ID
by reset_index
+ rename_axis
:
df = df.reset_index().rename_axis(None, axis=1)
print (df)
ID DELETE POST PUT SUCCESS
0 1 1 0 1 2
1 2 0 1 1 1
Solution 2
pandas
pd.get_dummies(df.EVENT) \
.assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)) \
.groupby(df.ID).sum().reset_index()
ID DELETE POST PUT SUCCESS
0 1 1 0 1 2
1 2 0 1 1 1
numpy
and pandas
f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
np.column_stack([d, s]),
df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()
ID DELETE POST PUT SUCCESS
0 1 1 0 1 2
1 2 0 1 1 1
Timing
small data
%%timeit
f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
np.column_stack([d, s]),
df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()
1000 loops, best of 3: 1.32 ms per loop
%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 3.3 ms per loop
%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 3.28 ms per loop
%timeit pd.get_dummies(df.EVENT).assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)).groupby(df.ID).sum().reset_index()
100 loops, best of 3: 2.62 ms per loop
large data
df = pd.DataFrame(dict(
ID=np.random.randint(100, size=100000),
EVENT=np.random.choice('PUT POST DELETE'.split(), size=100000),
SUCCESS=np.random.choice(list('YN'), size=100000)
))
%%timeit
f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
np.column_stack([d, s]),
df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()
100 loops, best of 3: 10.8 ms per loop
%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 17.7 ms per loop
%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 17.4 ms per loop
%timeit pd.get_dummies(df.EVENT).assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)).groupby(df.ID).sum().reset_index()
100 loops, best of 3: 16.8 ms per loop
Sheepy
Updated on June 14, 2022Comments
-
Sheepy almost 2 years
I have a dataframe that I need to combine two different groupbys with one of them filtered.
ID EVENT SUCCESS 1 PUT Y 2 POST Y 2 PUT N 1 DELETE Y
This table below is how I would like the data to look like. Firstly grouping the 'EVENT' counts, the second is to count the amount of Successes ('Y') per ID
ID PUT POST DELETE SUCCESS 1 1 0 1 2 2 1 1 0 1
I've tried a few techniques and the closet I've found is two separate methods which yield the following
group_df = df.groupby(['ID', 'EVENT']) count_group_df = group_df.size().unstack()
Which yields the following for the 'EVENT' counts
ID PUT POST DELETE 1 1 0 1 2 1 1 0
For the Successes with filters, i dont know whether I can join this to the first set on 'ID'
df_success = df.loc[df['SUCCESS'] == 'Y', ['ID', 'SUCCESS']] count_group_df_2 = df_success.groupby(['ID', 'SUCCESS']) ID SUCCESS 1 2 2 1
I need to combine these somehow?
Additionally I'd also like to merge the counts two of the 'EVENT''s for example PUT's and POST's into one column.