Postgres window function and group by exception
You are not, in fact, using aggregate functions. You are using window functions. That's why PostgreSQL demands sp.payout
and s.buyin
to be included in the GROUP BY
clause.
By appending an OVER
clause, the aggregate function sum()
is turned into a window function, which aggregates values per partition while keeping all rows.
You can combine window functions and aggregate functions. Aggregations are applied first. I did not understand from your description how you want to handle multiple payouts / buyins per event. As a guess, I calculate a sum of them per event. Now I can remove sp.payout
and s.buyin
from the GROUP BY
clause and get one row per player
and event
:
SELECT p.name
, e.event_id
, e.date
, sum(sum(sp.payout)) OVER w
- sum(sum(s.buyin )) OVER w AS "Profit/Loss"
FROM player p
JOIN result r ON r.player_id = p.player_id
JOIN game g ON g.game_id = r.game_id
JOIN event e ON e.event_id = g.event_id
JOIN structure s ON s.structure_id = g.structure_id
JOIN structure_payout sp ON sp.structure_id = g.structure_id
AND sp.position = r.position
WHERE p.player_id = 17
GROUP BY e.event_id
WINDOW w AS (ORDER BY e.date, e.event_id)
ORDER BY e.date, e.event_id;
In this expression: sum(sum(sp.payout)) OVER w
, the outer sum()
is a window function, the inner sum()
is an aggregate function.
Assuming p.player_id
and e.event_id
are PRIMARY KEY
in their respective tables.
I added e.event_id
to the ORDER BY
of the WINDOW
clause to arrive at a deterministic sort order. (There could be multiple events on the same date.) Also included event_id
in the result to distinguish multiple events per day.
While the query restricts to a single player (WHERE p.player_id = 17
), we don't need to add p.name
or p.player_id
to GROUP BY
and ORDER BY
. If one of the joins would multiply rows unduly, the resulting sum would be incorrect (partly or completely multiplied). Grouping by p.name
could not repair the query then.
I also removed e.date
from the GROUP BY
clause. The primary key e.event_id
covers all columns of the input row since PostgreSQL 9.1.
If you change the query to return multiple players at once, adapt:
...
WHERE p.player_id < 17 -- example - multiple players
GROUP BY p.name, p.player_id, e.date, e.event_id -- e.date and p.name redundant
WINDOW w AS (ORDER BY p.name, p.player_id, e.date, e.event_id)
ORDER BY p.name, p.player_id, e.date, e.event_id;
Unless p.name
is defined unique (?), group and order by player_id
additionally to get correct results in a deterministic sort order.
I only kept e.date
and p.name
in GROUP BY
to have identical sort order in all clauses, hoping for a performance benefit. Else, you can remove the columns there. (Similar for just e.date
in the first query.)
Martin
Currently a Architecture/Tech guy at a SaaS company based in Altrincham. Looking at anything that scales, I'm not language agnostic, I speak C#, general .NET/ASP.NET, linux, and bash. I'm an OpenSource advocate, and I'm working on various home based projects. Personal philosophies: "Trust, like spandex, is a privilege not a right" (combination of me and Matthew Lillard, I think) "Trust, but verify" (Ronald Reagan)
Updated on November 12, 2021Comments
-
Martin over 2 years
I'm trying to put together a query that will retrieve the statistics of a user (profit/loss) as a cumulative result, over a period of time.
Here's the query I have so far:
SELECT p.name, e.date, sum(sp.payout) OVER (ORDER BY e.date) - sum(s.buyin) OVER (ORDER BY e.date) AS "Profit/Loss" FROM result r JOIN game g ON r.game_id = g.game_id JOIN event e ON g.event_id = e.event_id JOIN structure s ON g.structure_id = s.structure_id JOIN structure_payout sp ON g.structure_id = sp.structure_id AND r.position = sp.position JOIN player p ON r.player_id = p.player_id WHERE p.player_id = 17 GROUP BY p.name, e.date, e.event_id, sp.payout, s.buyin ORDER BY p.name, e.date ASC
The query will run. However, the result is slightly incorrect. The reason is that an
event
can have multiple games (with differentsp.payouts
). Therefore, the above comes out with multiple rows if a user has 2 results in an event with different payouts (i.e. there are 4 games per event, and a user gets £20 from one, and £40 from another).The obvious solution would be to amend the
GROUP BY
to:GROUP BY p.name, e.date, e.event_id
However, Postgres complains at this as it doesn't appear to be recognizing that
sp.payout
ands.buyin
are inside an aggregate function. I get the error:column "sp.payout" must appear in the GROUP BY clause or be used in an aggregate function
I'm running 9.1 on Ubuntu Linux server.
Am I missing something, or could this be a genuine defect in Postgres? -
Martin over 12 yearsThe first query is working, however, the output of the query isn't giving the required results. I can see what amendment would work in theory, but Postgres doesn't like it. I'll try the above later and let you know. however, it looks like there will be 2 rows in the output of your query if an "event_id" has more than one "payout" amount.
-
Martin over 12 yearsI just tried it with the amendments you've suggested, and it does come back with multiple rows where there are multiple sp.payout values for a single event_id.
-
Erwin Brandstetter over 12 years@Martin: See my amended answer.
-
ryantuck over 7 yearswouldn't you need to
partition by p.name, e.event_id
in your window function as well for the grouping to work as expected? -
Erwin Brandstetter over 7 years@RyanTuck: There were several unexplained / fuzzy bits. I improved and clarified (or at least hope so).
-
David V McKay over 2 years@ErwinBrandstetter , may I just say ! I LOVE the formatting of your first sql snippet in this answer. Utilizing both of the 2 dimensions available to the human visual system for organizing code REALLY makes it digestible, and MORE coders should do this to achieve higher code density simultaneously with faster visual parsing