Calculate average of every x rows in a table and create new table
Solution 1
You can create an artificial group using df.index//2
(or as @DSM pointed out, using np.arange(len(df))//2
- so that it works for all indices) and then use groupby:
df.groupby(np.arange(len(df))//2).mean()
Out[13]:
a b c d
0 3.0 30.5 31.5 35.0
1 7.0 35.0 21.5 25.0
2 11.0 37.5 41.5 38.5
3 15.0 10.0 16.0 18.5
4 19.0 15.5 27.0 38.0
Solution 2
You can approach this problem using pd.rolling()
to create a rolling average and then just grab every second element using iloc
df = df.rolling(2).mean()
df = df.iloc[::2, :]
Note that the first observation will be missing (i.e. the rolling starts at the top) so make sure to check that your data is sorted how you need it.
Solution 3
NumPythonic way would be to extract the elements as a NumPy array with df.values
, then reshape to a 3D
array with 2
elements along axis=1
and 4
along axis=2
and perform the average reduction along axis=1
and finally convert back to a dataframe, like so -
pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
As it turns out, you can introduce NumPy's very efficient tool : np.einsum
to do this average-reduction
as a combination of sum-reduction
and scaling-down
, like so -
pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
Please note that the proposed approaches assume that the number of rows is divisible by 2
.
Also as noted by @DSM
, to preserve the column names, you need to add columns=df.columns
when converting back to Dataframe, i.e. -
pd.DataFrame(...,columns=df.columns)
Sample run -
>>> df
0 1 2 3
0 2 50 25 26
1 4 11 38 44
2 6 33 16 25
3 8 37 27 25
4 10 28 48 32
5 12 47 35 45
6 14 8 16 7
7 16 12 16 30
8 18 22 39 29
9 20 9 15 47
>>> pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
0 1 2 3
0 3 30.5 31.5 35.0
1 7 35.0 21.5 25.0
2 11 37.5 41.5 38.5
3 15 10.0 16.0 18.5
4 19 15.5 27.0 38.0
>>> pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
0 1 2 3
0 3 30.5 31.5 35.0
1 7 35.0 21.5 25.0
2 11 37.5 41.5 38.5
3 15 10.0 16.0 18.5
4 19 15.5 27.0 38.0
Runtime tests -
In this section, let's test out all the three approaches listed thus far to solve the problem for performance, including @ayhan's solution with groupby
.
In [24]: A = np.random.randint(0,9,(200,50))
In [25]: df = pd.DataFrame(A)
In [26]: %timeit df.groupby(df.index//2).mean() # @ayhan's solution
1000 loops, best of 3: 1.61 ms per loop
In [27]: %timeit pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
1000 loops, best of 3: 317 µs per loop
In [28]: %timeit pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
1000 loops, best of 3: 266 µs per loop
Solution 4
df.set_index(np.arange(len(df)) // 2).mean(level=0)
Solution 5
In your case, as you want to average the rows, assuming your dataframe name is new
new = new.groupby(np.arange(len(new)) // 2).mean()
If one wants to do the average for the columns
new = new.groupby(np.arrange(len(new.columns)) // 2, axis=1).mean()
Gnu
Updated on March 14, 2021Comments
-
Gnu about 3 years
I have a long table of data (~200 rows by 50 columns) and I need to create a code that can calculate the mean values of every two rows and for each column in the table with the final output being a new table of the mean values. This is obviously crazy to do in Excel! I use python3 and I am aware of some similar questions:here, here and here. But none of these helps as I need some elegant code to work with multiple columns and produces an organised data table. By the way my original datatable has been imported using pandas and is defined as a dataframe but could not find an easy way to do this in pandas. Help is much appreciated.
An example of the table (short version) is:
a b c d 2 50 25 26 4 11 38 44 6 33 16 25 8 37 27 25 10 28 48 32 12 47 35 45 14 8 16 7 16 12 16 30 18 22 39 29 20 9 15 47
Expected mean table:
a b c d 3 30.5 31.5 35 7 35 21.5 25 11 37.5 41.5 38.5 15 10 16 18.5 19 15.5 27 38