DISTINCT with PARTITION BY vs. GROUPBY

16,225

Solution 1

Performance:

Winner: GROUP BY

Some very rudimentary testing on a large table with unindexed columns showed that at least in my case the two queries generated a completely different query plan. The one for PARTITION BY was significantly slower.

The GROUP BY query plan included only a table scan and aggregation operation while the PARTITION BY plan had two nested loop self-joins. The PARTITION BY took about 2800ms on the second run, the GROUP BY took only 500ms.

Readability / Maintainability:

Winner: GROUP BY

Based on the opinions of the commenters here the PARTITION BY is less readable for most developers so it will be probably also harder to maintain in the future.

Flexibility

Winner: PARTITION BY

PARTITION BY gives you more flexibility in choosing the grouping columns. With GROUP BY you can have only one set of grouping columns for all aggregated columns. With DISTINCT + PARTITION BY you can have different column in each partition. Also on some DBMSs you can chose from more aggregation/analytic functions in the OVER clause.

Solution 2

Using sum() as an analytic function with over partition by is not necessary. I don't think there is a big difference between them in any sense. In oracle there are lot more analytic function than aggregation function. I think ms-sql is the same case. And for example lag(), lead(), rank(), dense rank(), etc are much harder to implement with only group by. Of course this argument is not really for defending the first version...

Maybe there were previously more computed fields in the result set which are not implementable with group by.

Share:
16,225
Andris
Author by

Andris

Updated on June 25, 2022

Comments

  • Andris
    Andris about 2 years

    I have found some SQL queries in an application I am examining like this:

    SELECT DISTINCT
    Company, Warehouse, Item,
    SUM(quantity) OVER (PARTITION BY Company, Warehouse, Item) AS stock
    

    I'm quite sure this gives the same result as:

    SELECT
    Company, Warehouse, Item,
    SUM(quantity) AS stock
    GROUP BY Company, Warehouse, Item
    

    Is there any benefit (performance, readability, additional flexibility in writing the query, maintainability, etc.) of using the first approach over the later?

  • Andris
    Andris over 10 years
    In the first query there is a DISTINCT after SELECT so it return only one row for each company/warehouse/item like the second.
  • SergeFantino
    SergeFantino over 10 years
    Ok, fair enough... still the DISTINCT is applied to every stock row, and need to take into account each value: company/warehouse/item and sum(quantity). If you look at the exec plan (ok, it may depend on your database) the DISTINCT cost adds on the analytical query, which is already a twice as costly as the simple group-by.