pandas groupby two columns and summarize by mean
You need reset_index
or parameter as_index=False
in groupby
, because you get MuliIndex
and by default the higher levels of the indexes are sparsified to make the console output a bit easier on the eyes:
np.random.seed(100)
df = pd.DataFrame()
df['id'] = [1,1,1,2,2,3,3,3,3,4,4,5]
df['view'] = ['A', 'B', 'A', 'A','B', 'A', 'B', 'A', 'A','B', 'A', 'B']
df['value'] = np.random.random(12)
print (df)
id view value
0 1 A 0.543405
1 1 B 0.278369
2 1 A 0.424518
3 2 A 0.844776
4 2 B 0.004719
5 3 A 0.121569
6 3 B 0.670749
7 3 A 0.825853
8 3 A 0.136707
9 4 B 0.575093
10 4 A 0.891322
11 5 B 0.209202
res = df.groupby(['id', 'view'])['value'].mean().reset_index()
print (res)
id view value
0 1 A 0.483961
1 1 B 0.278369
2 2 A 0.844776
3 2 B 0.004719
4 3 A 0.361376
5 3 B 0.670749
6 4 A 0.891322
7 4 B 0.575093
8 5 B 0.209202
res = df.groupby(['id', 'view'], as_index=False)['value'].mean()
print (res)
id view value
0 1 A 0.483961
1 1 B 0.278369
2 2 A 0.844776
3 2 B 0.004719
4 3 A 0.361376
5 3 B 0.670749
6 4 A 0.891322
7 4 B 0.575093
8 5 B 0.209202
Related videos on Youtube

spore234
Updated on September 15, 2022Comments
-
spore234 10 months
I have a data frame like this:
df = pd.DataFrame() df['id'] = [1,1,1,2,2,3,3,3,3,4,4,5] df['view'] = ['A', 'B', 'A', 'A','B', 'A', 'B', 'A', 'A','B', 'A', 'B'] df['value'] = np.random.random(12) id view value 0 1 A 0.625781 1 1 B 0.330084 2 1 A 0.024532 3 2 A 0.154651 4 2 B 0.196960 5 3 A 0.393941 6 3 B 0.607217 7 3 A 0.422823 8 3 A 0.994323 9 4 B 0.366650 10 4 A 0.649585 11 5 B 0.513923
I now want to summarize for each
id
eachview
by mean of 'value'. Think of this as some ids have repeated observations for view, and I want to summarize them. For example, id 1 has two observations for A.I tried
res = df.groupby(['id', 'view'])['value'].mean()
This actually almost what I want, but pandas combines the
id
andview
column into one, which I do not want.id view 1 A 0.325157 B 0.330084 2 A 0.154651 B 0.196960 3 A 0.603696 B 0.607217 4 A 0.649585 B 0.366650 5 B 0.513923
also res.shape is of dimension (9,)
my desired output would be this:
id view value 1 A 0.325157 1 B 0.330084 2 A 0.154651 2 B 0.196960 3 A 0.603696 3 B 0.607217 4 A 0.649585 4 B 0.366650 5 B 0.513923
where the column names and dimensions are kept and where the id is repeated. Each id should have only 1 row for A and B.
How can I achieve this?
-
spore234 over 6 yearscool, thanks. Which of the two ways is 'better' (e.g. faster,..)
-
jezrael over 6 yearsFirst is more general, second works with
mean
,sum
. But I think performance is same.