Keeping rows from double-counting in a GROUP BY

sql postgresql join aggregate-functions

15,232

Solution 1

The underlying problem (joining to multiple tables with multiple matches) is explained in this related answer:

Two SQL LEFT JOINS produce incorrect result

To fix, I first simplified & formatted your query:

select pe.year
     , sum(pe.wins)       AS wins
     , sum(pe.losses)     AS losses
     , sum(pe.ties)       AS ties
     , array_agg(po.id)   AS position_id
     , array_agg(po.name) AS position_names
from   periods_positions_coaches_linking pp
join   positions po ON po.id = pp.position
join   periods   pe ON pe.id = pp.period
where  pp.coach = 1
group  by pe.year
order  by pe.year;

Yields the same, incorrect result as your original, but simpler / faster / easier to read.

No point in joining the table coach as long as you don't use columns in the SELECT list. I removed it completely and replaced the WHERE condition with where pp.coach = 1.
You don't need COALESCE. NULL values are ignored in the aggregate function sum(). No need to substitute 0.
Use table aliases to make it easier to read.

Next, I solved your problem like this:

SELECT *
FROM  (
   SELECT pe.year
        , array_agg(DISTINCT po.id)   AS position_id
        , array_agg(DISTINCT po.name) AS position_names
   FROM   periods_positions_coaches_linking pp
   JOIN   positions                         po ON po.id = pp.position
   JOIN   periods                           pe ON pe.id = pp.period
   WHERE  pp.coach = 1
   GROUP  BY pe.year
   ) po
LEFT   JOIN (
   SELECT pe.year
        , sum(pe.wins)   AS wins
        , sum(pe.losses) AS losses
        , sum(pe.ties)   AS ties
   FROM  (
      SELECT period
      FROM   periods_positions_coaches_linking
      WHERE  coach = 1
      GROUP  BY period
      ) pp
   JOIN   periods pe ON pe.id = pp.period
   GROUP  BY pe.year
   ) pe USING (year)
ORDER  BY year;

Aggregate positions and periods separately before joining them.
In the first sub-query po list positions only once with array_agg(DISTINCT ...).
In the second sub-query pe ...
- GROUP BY period, because a coach can have multiple positions per period.
- JOIN to periods-data after that, and then aggregate to get sums.

db<>fiddle here
_{Old sqlfiddle}

Solution 2

use distinct as shown here

code:

select periods.year as year,
sum(coalesce(periods.wins, 0)) as wins,
sum(coalesce(periods.losses, 0)) as losses,
sum(coalesce(periods.ties, 0)) as ties,
array_agg( distinct positions.id) as position_id,
array_agg( distinct positions.name) as position_names

from periods_positions_coaches_linking

join coaches on coaches.id = periods_positions_coaches_linking.coach
join positions on positions.id = periods_positions_coaches_linking.position
join periods on periods.id = periods_positions_coaches_linking.period

where coaches.id = 1

group by periods.year, positions.id
order by periods.year;

Solution 3

In your case, the easiest way might be to divide out the positions:

select periods.year as year,
       sum(coalesce(periods.wins, 0))/COUNT(distinct positions.id) as wins,
       sum(coalesce(periods.losses, 0))/COUNT(distinct positions.id) as losses,
       sum(coalesce(periods.ties, 0))/COUNT(distinct positions.id) as ties,
       array_agg(distinct positions.id) as position_id,
       array_agg(distinct positions.name) as position_names
from periods_positions_coaches_linking join
     coaches
     on coaches.id = periods_positions_coaches_linking.coach join
     positions
     on positions.id = periods_positions_coaches_linking.position join
     periods
     on periods.id = periods_positions_coaches_linking.period
where coaches.id = 1
group by periods.year
order by periods.year;

The number of positions scales the wins, losses, and ties, so dividing it out adjusts the counts.

15,232

Author by

ldrg

Updated on June 26, 2022

Comments

ldrg almost 2 years

Here's the basic guts of my schema and problem: http://sqlfiddle.com/#!1/72ec9/4/2

Note that the periods table can refer to a variable range of time - it could be an entire season, it could be a few games or one game. For a given team and year all period rows represent exclusive ranges of time.

I've got a query written which joins up tables and uses a GROUP BY periods.year to aggregate scores for a season (see sqlfiddle). However, if a coach had two positions in the same year the GROUP BY will count the same period row twice. How can I ditch the duplicates when a coach held two positions but still sum up periods when a year is comprised of multiple periods? If there's a better way to do the schema I'd also appreciate it if you pointed it out to me.
ldrg over 11 years

This now returns two rows for year 2014. I'd prefer to get a single row returned for each year but I understand if SQL can't do that. If I'm using this query is there a guarantee that the two rows will have the same scores and I don't have to do any summing/processing outside of the query? (I'm worried about corner cases).
Teena Thomas over 11 years

to guarantee that the scores are same, you need to add that criteria in the where clause
Erwin Brandstetter over 11 years

This fails when a coach has multiple positions within per period, thereby multiplying values in the aggregation. You need to get distinct periods per coach first ...