Keeping rows from double-counting in a GROUP BY

15,232

Solution 1

The underlying problem (joining to multiple tables with multiple matches) is explained in this related answer:

To fix, I first simplified & formatted your query:

select pe.year
     , sum(pe.wins)       AS wins
     , sum(pe.losses)     AS losses
     , sum(pe.ties)       AS ties
     , array_agg(po.id)   AS position_id
     , array_agg(po.name) AS position_names
from   periods_positions_coaches_linking pp
join   positions po ON po.id = pp.position
join   periods   pe ON pe.id = pp.period
where  pp.coach = 1
group  by pe.year
order  by pe.year;

Yields the same, incorrect result as your original, but simpler / faster / easier to read.

  • No point in joining the table coach as long as you don't use columns in the SELECT list. I removed it completely and replaced the WHERE condition with where pp.coach = 1.

  • You don't need COALESCE. NULL values are ignored in the aggregate function sum(). No need to substitute 0.

  • Use table aliases to make it easier to read.

Next, I solved your problem like this:

SELECT *
FROM  (
   SELECT pe.year
        , array_agg(DISTINCT po.id)   AS position_id
        , array_agg(DISTINCT po.name) AS position_names
   FROM   periods_positions_coaches_linking pp
   JOIN   positions                         po ON po.id = pp.position
   JOIN   periods                           pe ON pe.id = pp.period
   WHERE  pp.coach = 1
   GROUP  BY pe.year
   ) po
LEFT   JOIN (
   SELECT pe.year
        , sum(pe.wins)   AS wins
        , sum(pe.losses) AS losses
        , sum(pe.ties)   AS ties
   FROM  (
      SELECT period
      FROM   periods_positions_coaches_linking
      WHERE  coach = 1
      GROUP  BY period
      ) pp
   JOIN   periods pe ON pe.id = pp.period
   GROUP  BY pe.year
   ) pe USING (year)
ORDER  BY year;
  • Aggregate positions and periods separately before joining them.

  • In the first sub-query po list positions only once with array_agg(DISTINCT ...).

  • In the second sub-query pe ...

    • GROUP BY period, because a coach can have multiple positions per period.
    • JOIN to periods-data after that, and then aggregate to get sums.

db<>fiddle here
Old sqlfiddle

Solution 2

use distinct as shown here

code:

select periods.year as year,
sum(coalesce(periods.wins, 0)) as wins,
sum(coalesce(periods.losses, 0)) as losses,
sum(coalesce(periods.ties, 0)) as ties,
array_agg( distinct positions.id) as position_id,
array_agg( distinct positions.name) as position_names

from periods_positions_coaches_linking

join coaches on coaches.id = periods_positions_coaches_linking.coach
join positions on positions.id = periods_positions_coaches_linking.position
join periods on periods.id = periods_positions_coaches_linking.period

where coaches.id = 1

group by periods.year, positions.id
order by periods.year;

Solution 3

In your case, the easiest way might be to divide out the positions:

select periods.year as year,
       sum(coalesce(periods.wins, 0))/COUNT(distinct positions.id) as wins,
       sum(coalesce(periods.losses, 0))/COUNT(distinct positions.id) as losses,
       sum(coalesce(periods.ties, 0))/COUNT(distinct positions.id) as ties,
       array_agg(distinct positions.id) as position_id,
       array_agg(distinct positions.name) as position_names
from periods_positions_coaches_linking join
     coaches
     on coaches.id = periods_positions_coaches_linking.coach join
     positions
     on positions.id = periods_positions_coaches_linking.position join
     periods
     on periods.id = periods_positions_coaches_linking.period
where coaches.id = 1
group by periods.year
order by periods.year;

The number of positions scales the wins, losses, and ties, so dividing it out adjusts the counts.

Share:
15,232
ldrg
Author by

ldrg

Updated on June 26, 2022

Comments

  • ldrg
    ldrg almost 2 years

    Here's the basic guts of my schema and problem: http://sqlfiddle.com/#!1/72ec9/4/2

    Note that the periods table can refer to a variable range of time - it could be an entire season, it could be a few games or one game. For a given team and year all period rows represent exclusive ranges of time.

    I've got a query written which joins up tables and uses a GROUP BY periods.year to aggregate scores for a season (see sqlfiddle). However, if a coach had two positions in the same year the GROUP BY will count the same period row twice. How can I ditch the duplicates when a coach held two positions but still sum up periods when a year is comprised of multiple periods? If there's a better way to do the schema I'd also appreciate it if you pointed it out to me.

  • ldrg
    ldrg over 11 years
    This now returns two rows for year 2014. I'd prefer to get a single row returned for each year but I understand if SQL can't do that. If I'm using this query is there a guarantee that the two rows will have the same scores and I don't have to do any summing/processing outside of the query? (I'm worried about corner cases).
  • Teena Thomas
    Teena Thomas over 11 years
    to guarantee that the scores are same, you need to add that criteria in the where clause
  • Erwin Brandstetter
    Erwin Brandstetter over 11 years
    This fails when a coach has multiple positions within per period, thereby multiplying values in the aggregation. You need to get distinct periods per coach first ...