Keeping rows from double-counting in a GROUP BY
Solution 1
The underlying problem (joining to multiple tables with multiple matches) is explained in this related answer:
To fix, I first simplified & formatted your query:
select pe.year
, sum(pe.wins) AS wins
, sum(pe.losses) AS losses
, sum(pe.ties) AS ties
, array_agg(po.id) AS position_id
, array_agg(po.name) AS position_names
from periods_positions_coaches_linking pp
join positions po ON po.id = pp.position
join periods pe ON pe.id = pp.period
where pp.coach = 1
group by pe.year
order by pe.year;
Yields the same, incorrect result as your original, but simpler / faster / easier to read.
-
No point in joining the table
coach
as long as you don't use columns in theSELECT
list. I removed it completely and replaced theWHERE
condition withwhere pp.coach = 1
. -
You don't need
COALESCE
.NULL
values are ignored in the aggregate functionsum()
. No need to substitute0
. -
Use table aliases to make it easier to read.
Next, I solved your problem like this:
SELECT *
FROM (
SELECT pe.year
, array_agg(DISTINCT po.id) AS position_id
, array_agg(DISTINCT po.name) AS position_names
FROM periods_positions_coaches_linking pp
JOIN positions po ON po.id = pp.position
JOIN periods pe ON pe.id = pp.period
WHERE pp.coach = 1
GROUP BY pe.year
) po
LEFT JOIN (
SELECT pe.year
, sum(pe.wins) AS wins
, sum(pe.losses) AS losses
, sum(pe.ties) AS ties
FROM (
SELECT period
FROM periods_positions_coaches_linking
WHERE coach = 1
GROUP BY period
) pp
JOIN periods pe ON pe.id = pp.period
GROUP BY pe.year
) pe USING (year)
ORDER BY year;
-
Aggregate positions and periods separately before joining them.
-
In the first sub-query
po
list positions only once witharray_agg(DISTINCT ...)
. -
In the second sub-query
pe
...-
GROUP BY period
, because a coach can have multiple positions per period. -
JOIN
to periods-data after that, and then aggregate to get sums.
-
Solution 2
use distinct
as shown here
code:
select periods.year as year,
sum(coalesce(periods.wins, 0)) as wins,
sum(coalesce(periods.losses, 0)) as losses,
sum(coalesce(periods.ties, 0)) as ties,
array_agg( distinct positions.id) as position_id,
array_agg( distinct positions.name) as position_names
from periods_positions_coaches_linking
join coaches on coaches.id = periods_positions_coaches_linking.coach
join positions on positions.id = periods_positions_coaches_linking.position
join periods on periods.id = periods_positions_coaches_linking.period
where coaches.id = 1
group by periods.year, positions.id
order by periods.year;
Solution 3
In your case, the easiest way might be to divide out the positions:
select periods.year as year,
sum(coalesce(periods.wins, 0))/COUNT(distinct positions.id) as wins,
sum(coalesce(periods.losses, 0))/COUNT(distinct positions.id) as losses,
sum(coalesce(periods.ties, 0))/COUNT(distinct positions.id) as ties,
array_agg(distinct positions.id) as position_id,
array_agg(distinct positions.name) as position_names
from periods_positions_coaches_linking join
coaches
on coaches.id = periods_positions_coaches_linking.coach join
positions
on positions.id = periods_positions_coaches_linking.position join
periods
on periods.id = periods_positions_coaches_linking.period
where coaches.id = 1
group by periods.year
order by periods.year;
The number of positions scales the wins, losses, and ties, so dividing it out adjusts the counts.
ldrg
Updated on June 26, 2022Comments
-
ldrg almost 2 years
Here's the basic guts of my schema and problem: http://sqlfiddle.com/#!1/72ec9/4/2
Note that the periods table can refer to a variable range of time - it could be an entire season, it could be a few games or one game. For a given team and year all period rows represent exclusive ranges of time.
I've got a query written which joins up tables and uses a GROUP BY periods.year to aggregate scores for a season (see sqlfiddle). However, if a coach had two positions in the same year the GROUP BY will count the same period row twice. How can I ditch the duplicates when a coach held two positions but still sum up periods when a year is comprised of multiple periods? If there's a better way to do the schema I'd also appreciate it if you pointed it out to me.
-
ldrg over 11 yearsThis now returns two rows for year 2014. I'd prefer to get a single row returned for each year but I understand if SQL can't do that. If I'm using this query is there a guarantee that the two rows will have the same scores and I don't have to do any summing/processing outside of the query? (I'm worried about corner cases).
-
Teena Thomas over 11 yearsto guarantee that the scores are same, you need to add that criteria in the
where
clause -
Erwin Brandstetter over 11 yearsThis fails when a coach has multiple positions within per period, thereby multiplying values in the aggregation. You need to get distinct periods per coach first ...