SQL Performance: SELECT DISTINCT versus GROUP BY

68,230

Solution 1

The performance difference is probably due to the execution of the subquery in the SELECT clause. I am guessing that it is re-executing this query for every row before the distinct. For the group by, it would execute once after the group by.

Try replacing it with a join, instead:

select . . .,
       parentcnt
from . . . left outer join
      (SELECT PARENT_ITEM_ID, COUNT(PKID) as parentcnt
       FROM ITEM_PARENTS 
      ) p
      on items.item_id = p.parent_item_id

Solution 2

I'm fairly sure that GROUP BY and DISTINCT have roughly the same execution plan.

The difference here since we have to guess (since we don't have the explain plans) is IMO that the inline subquery gets executed AFTER the GROUP BY but BEFORE the DISTINCT.

So if your query returns 1M rows and gets aggregated to 1k rows:

  • The GROUP BY query would have run the subquery 1000 times,
  • Whereas the DISTINCT query would have run the subquery 1000000 times.

The tkprof explain plan would help demonstrate this hypothesis.


While we're discussing this, I think it's important to note that the way the query is written is misleading both to the reader and to the optimizer: you obviously want to find all rows from item/item_transactions that have a TASK_INVENTORY_STEP.STEP_TYPE with a value of "TYPE A".

IMO your query would have a better plan and would be more easily readable if written like this:

SELECT ITEMS.ITEM_ID,
       ITEMS.ITEM_CODE,
       ITEMS.ITEMTYPE,
       ITEM_TRANSACTIONS.STATUS,
       (SELECT COUNT(PKID) 
          FROM ITEM_PARENTS 
         WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID) AS CHILD_COUNT
  FROM ITEMS
  JOIN ITEM_TRANSACTIONS 
    ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID 
   AND ITEM_TRANSACTIONS.FLAG = 1
 WHERE EXISTS (SELECT NULL
                 FROM JOB_INVENTORY   
                 JOIN TASK_INVENTORY_STEP 
                   ON JOB_INVENTORY.JOB_ITEM_ID=TASK_INVENTORY_STEP.JOB_ITEM_ID
                WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
                  AND ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID)

In many cases, a DISTINCT can be a sign that the query is not written properly (because a good query shouldn't return duplicates).

Note also that 4 tables are not used in your original select.

Solution 3

The first thing that should be noted is the use of Distinct indicates a code smell, aka anti-pattern. It generally means that there is a missing join or an extra join that is generating duplicate data. Looking at your query above, I am guessing that the reason why group by is faster (without seeing the query), is that the location of the group by reduces the number of records that end up being returned. Whereas distinct is blowing out the result set and doing row by row comparisons.

Update to approach

Sorry, I should have been more clear. Records are generated when users perform certain tasks in the system, so there is no schedule. A user could generate a single record in a day or hundreds per-hour. The important things is that each time a user runs a search, up-to-date records must be returned, which makes me doubtful that a materialized view would work here, especially if the query populating it would take long to run.

I do believe this is the exact reason to use a materialized view. So the process would work this way. You take the long running query as the piece that builds out your materialized view, since we know the user only cares about "new" data after they perform some arbitrary task in the system. So what you want to do is query against this base materialized view, which can be refreshed constantly on the back-end, the persistence strategy involved should not choke out the materialized view (persisting a few hundred records at a time won't crush anything). What this will allow is Oracle to grab a read lock (note we don't care how many sources read our data, we only care about writers). In the worst case a user will have "stale" data for microseconds, so unless this is a financial trading system on Wall Street or a system for a nuclear reactor, these "blips" should go unnoticed by even the most eagle eyed users.

Code example of how to do this:

create materialized view dept_mv FOR UPDATE as select * from dept; 

Now the key to this is as long as you don' t invoke refresh you won't lose any of the persisted data. It will be up to you to determine when you want to "base line" your materialized view again (midnight perhaps?)

Share:
68,230

Related videos on Youtube

woemler
Author by

woemler

Bioinformatics software developer & all-around nerd.

Updated on September 05, 2020

Comments

  • woemler
    woemler over 3 years

    I have been trying to improve query times for an existing Oracle database-driven application that has been running a little sluggish. The application executes several large queries, such as the one below, which can take over an hour to run. Replacing the DISTINCT with a GROUP BY clause in the query below shrank execution time from 100 minutes to 10 seconds. My understanding was that SELECT DISTINCT and GROUP BY operated in pretty much the same way. Why such a huge disparity between execution times? What is the difference in how the query is executed on the back-end? Is there ever a situation where SELECT DISTINCT runs faster?

    Note: In the following query, WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A' represents just one of a number of ways that results can be filtered. This example was provided to show the reasoning for joining all of the tables that do not have columns included in the SELECT and would result in about a tenth of all available data

    SQL using DISTINCT:

    SELECT DISTINCT 
        ITEMS.ITEM_ID,
        ITEMS.ITEM_CODE,
        ITEMS.ITEMTYPE,
        ITEM_TRANSACTIONS.STATUS,
        (SELECT COUNT(PKID) 
            FROM ITEM_PARENTS 
            WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID
            ) AS CHILD_COUNT
    FROM
        ITEMS
        INNER JOIN ITEM_TRANSACTIONS 
            ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID 
            AND ITEM_TRANSACTIONS.FLAG = 1
        LEFT OUTER JOIN ITEM_METADATA 
            ON ITEMS.ITEM_ID = ITEM_METADATA.ITEM_ID
        LEFT OUTER JOIN JOB_INVENTORY 
            ON ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID     
        LEFT OUTER JOIN JOB_TASK_INVENTORY 
            ON JOB_INVENTORY.JOB_ITEM_ID = JOB_TASK_INVENTORY.JOB_ITEM_ID
        LEFT OUTER JOIN JOB_TASKS 
            ON JOB_TASK_INVENTORY.TASKID = JOB_TASKS.TASKID                              
        LEFT OUTER JOIN JOBS 
            ON JOB_TASKS.JOB_ID = JOBS.JOB_ID
        LEFT OUTER JOIN TASK_INVENTORY_STEP 
            ON JOB_INVENTORY.JOB_ITEM_ID = TASK_INVENTORY_STEP.JOB_ITEM_ID 
        LEFT OUTER JOIN TASK_STEP_INFORMATION 
            ON TASK_INVENTORY_STEP.JOB_ITEM_ID = TASK_STEP_INFORMATION.JOB_ITEM_ID
    WHERE 
        TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
    ORDER BY 
        ITEMS.ITEM_CODE
    

    SQL using GROUP BY:

    SELECT
        ITEMS.ITEM_ID,
        ITEMS.ITEM_CODE,
        ITEMS.ITEMTYPE,
        ITEM_TRANSACTIONS.STATUS,
        (SELECT COUNT(PKID) 
            FROM ITEM_PARENTS 
            WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID
            ) AS CHILD_COUNT
    FROM
        ITEMS
        INNER JOIN ITEM_TRANSACTIONS 
            ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID 
            AND ITEM_TRANSACTIONS.FLAG = 1
        LEFT OUTER JOIN ITEM_METADATA 
            ON ITEMS.ITEM_ID = ITEM_METADATA.ITEM_ID
        LEFT OUTER JOIN JOB_INVENTORY 
            ON ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID     
        LEFT OUTER JOIN JOB_TASK_INVENTORY 
            ON JOB_INVENTORY.JOB_ITEM_ID = JOB_TASK_INVENTORY.JOB_ITEM_ID
        LEFT OUTER JOIN JOB_TASKS 
            ON JOB_TASK_INVENTORY.TASKID = JOB_TASKS.TASKID                              
        LEFT OUTER JOIN JOBS 
            ON JOB_TASKS.JOB_ID = JOBS.JOB_ID
        LEFT OUTER JOIN TASK_INVENTORY_STEP 
            ON JOB_INVENTORY.JOB_ITEM_ID = TASK_INVENTORY_STEP.JOB_ITEM_ID 
        LEFT OUTER JOIN TASK_STEP_INFORMATION 
            ON TASK_INVENTORY_STEP.JOB_ITEM_ID = TASK_STEP_INFORMATION.JOB_ITEM_ID
    WHERE 
        TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
    GROUP BY
        ITEMS.ITEM_ID,
        ITEMS.ITEM_CODE,
        ITEMS.ITEMTYPE,
        ITEM_TRANSACTIONS.STATUS
    ORDER BY 
        ITEMS.ITEM_CODE
    

    Here is the Oracle query plan for the query using DISTINCT:

    Oracle query plan for query using DISTINCT

    Here is the Oracle query plan for the query using GROUP BY:

    Oracle query plan for query using GROUP BY

    • Hamlet Hakobyan
      Hamlet Hakobyan over 11 years
      Show the query with group by.
    • symcbean
      symcbean over 11 years
      I don't have the answer to your question, but I expect that seeing BOTH queries, their explain plans and the number of logical GETs might help in understanding (FWIW I would have expected DISTINCT to have a performance advantage, if at all).
    • Sam Axe
      Sam Axe over 11 years
      In SQL Server you can get Query Execution Plans.. can you get something similar in Oracle? That would tell you where the difference was.
    • symcbean
      symcbean over 11 years
      BTW: why the big long chain of LEFT joins when you only want records with a 'TYPE A' at the end?
    • Robotic Cat
      Robotic Cat over 11 years
      Two things; 1) Put your GROUP BY query in your question and 2) Run an EXPLAIN PLAN on each query and also add the output to the question.
    • Charles Bretana
      Charles Bretana over 11 years
      ITEM_PARENTS has the children in it ?? that's interesting.
    • woemler
      woemler over 11 years
      Thanks for the comments, I have added the same query with the GROUP BY and clarified the reasoning for the query structure a little. I will generate query plans and add them shortly.
    • woemler
      woemler over 11 years
      Thanks @Dan-o, i have added the query plan.
    • woemler
      woemler over 11 years
      Thanks, @HamletHakobyan, I have added the group by.
    • woemler
      woemler over 11 years
      @symcbean: This is just one example of how results are filtered for this query. Columns from most of these tables can be referenced in the WHERE clause.
    • Woot4Moo
      Woot4Moo over 11 years
      How often are these data sets refreshed? This seems like a good candidate for materialized views
    • woemler
      woemler over 11 years
      @Woot4Moo: Typically, several hundred records are added a day. The records displayed in the output have to be up-to-date, so the materialized view would have to be refreshed quite often.
    • Woot4Moo
      Woot4Moo over 11 years
      hmm, define quite often. When I hear a several hundred a day that implies some type of hourly or scheduled based insert.
    • woemler
      woemler over 11 years
      @Woot4Moo: Sorry, I should have been more clear. Records are generated when users perform certain tasks in the system, so there is no schedule. A user could generate a single record in a day or hundreds per-hour. The important things is that each time a user runs a search, up-to-date records must be returned, which makes me doubtful that a materialized view would work here, especially if the query populating it would take long to run.
    • Woot4Moo
      Woot4Moo over 11 years
      I updated my post about how you could utilize a materialized view. Please let me know if it isn't clear. The key point is that you can insert data into the materialized view.
    • Ciro Santilli OurBigBook.com
      Ciro Santilli OurBigBook.com over 8 years
  • Clockwork-Muse
    Clockwork-Muse over 11 years
    +1 - This is exactly what I was thinking of too (including the potential solution), but I don't know enough about Oracle to be sure.
  • Vincent Malgrat
    Vincent Malgrat over 11 years
    +1 for code smell. Queries joining tables through PK shouldn't return duplicates ; if they do maybe something is amiss :)
  • woemler
    woemler over 11 years
    This would seem to be the bottleneck. I tried removing the subquery and the query executed as quickly as the GROUP BY version (100 min vs 20 sec). Thanks!
  • woemler
    woemler over 11 years
    Thanks for the response. The query given is just an example that shows one of the distantly joined tables being used to filter results. Columns from almost every table joined in this query could potentially be used in the WHERE clause.
  • woemler
    woemler over 11 years
    You are definitely right on this point. The schema is pretty poorly designed, with a lot of redundancy, from many years of having modules with new tables tacked on without a schema overhaul. Unfortunately, I have to live with what I have.
  • Vincent Malgrat
    Vincent Malgrat over 11 years
    You should still use SEMI-JOIN (EXISTS or IN) when appropriate instead of DISTINCT, it is clearer to both future reader and perhaps more importantly to the optimizer.