SQL Performance: SELECT DISTINCT versus GROUP BY
Solution 1
The performance difference is probably due to the execution of the subquery in the SELECT
clause. I am guessing that it is re-executing this query for every row before the distinct. For the group by
, it would execute once after the group by.
Try replacing it with a join, instead:
select . . .,
parentcnt
from . . . left outer join
(SELECT PARENT_ITEM_ID, COUNT(PKID) as parentcnt
FROM ITEM_PARENTS
) p
on items.item_id = p.parent_item_id
Solution 2
I'm fairly sure that GROUP BY
and DISTINCT
have roughly the same execution plan.
The difference here since we have to guess (since we don't have the explain plans) is IMO that the inline subquery gets executed AFTER the GROUP BY
but BEFORE the DISTINCT
.
So if your query returns 1M rows and gets aggregated to 1k rows:
- The
GROUP BY
query would have run the subquery 1000 times, - Whereas the
DISTINCT
query would have run the subquery 1000000 times.
The tkprof explain plan would help demonstrate this hypothesis.
While we're discussing this, I think it's important to note that the way the query is written is misleading both to the reader and to the optimizer: you obviously want to find all rows from item/item_transactions that have a TASK_INVENTORY_STEP.STEP_TYPE
with a value of "TYPE A".
IMO your query would have a better plan and would be more easily readable if written like this:
SELECT ITEMS.ITEM_ID,
ITEMS.ITEM_CODE,
ITEMS.ITEMTYPE,
ITEM_TRANSACTIONS.STATUS,
(SELECT COUNT(PKID)
FROM ITEM_PARENTS
WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID) AS CHILD_COUNT
FROM ITEMS
JOIN ITEM_TRANSACTIONS
ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID
AND ITEM_TRANSACTIONS.FLAG = 1
WHERE EXISTS (SELECT NULL
FROM JOB_INVENTORY
JOIN TASK_INVENTORY_STEP
ON JOB_INVENTORY.JOB_ITEM_ID=TASK_INVENTORY_STEP.JOB_ITEM_ID
WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
AND ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID)
In many cases, a DISTINCT can be a sign that the query is not written properly (because a good query shouldn't return duplicates).
Note also that 4 tables are not used in your original select.
Solution 3
The first thing that should be noted is the use of Distinct
indicates a code smell, aka anti-pattern. It generally means that there is a missing join or an extra join that is generating duplicate data. Looking at your query above, I am guessing that the reason why group by
is faster (without seeing the query), is that the location of the group by
reduces the number of records that end up being returned. Whereas distinct
is blowing out the result set and doing row by row comparisons.
Update to approach
Sorry, I should have been more clear. Records are generated when users perform certain tasks in the system, so there is no schedule. A user could generate a single record in a day or hundreds per-hour. The important things is that each time a user runs a search, up-to-date records must be returned, which makes me doubtful that a materialized view would work here, especially if the query populating it would take long to run.
I do believe this is the exact reason to use a materialized view. So the process would work this way. You take the long running query as the piece that builds out your materialized view, since we know the user only cares about "new" data after they perform some arbitrary task in the system. So what you want to do is query against this base materialized view, which can be refreshed constantly on the back-end, the persistence strategy involved should not choke out the materialized view (persisting a few hundred records at a time won't crush anything). What this will allow is Oracle to grab a read lock (note we don't care how many sources read our data, we only care about writers). In the worst case a user will have "stale" data for microseconds, so unless this is a financial trading system on Wall Street or a system for a nuclear reactor, these "blips" should go unnoticed by even the most eagle eyed users.
Code example of how to do this:
create materialized view dept_mv FOR UPDATE as select * from dept;
Now the key to this is as long as you don' t invoke refresh you won't lose any of the persisted data. It will be up to you to determine when you want to "base line" your materialized view again (midnight perhaps?)
Related videos on Youtube
woemler
Bioinformatics software developer & all-around nerd.
Updated on September 05, 2020Comments
-
woemler over 3 years
I have been trying to improve query times for an existing Oracle database-driven application that has been running a little sluggish. The application executes several large queries, such as the one below, which can take over an hour to run. Replacing the
DISTINCT
with aGROUP BY
clause in the query below shrank execution time from 100 minutes to 10 seconds. My understanding was thatSELECT DISTINCT
andGROUP BY
operated in pretty much the same way. Why such a huge disparity between execution times? What is the difference in how the query is executed on the back-end? Is there ever a situation whereSELECT DISTINCT
runs faster?Note: In the following query,
WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
represents just one of a number of ways that results can be filtered. This example was provided to show the reasoning for joining all of the tables that do not have columns included in theSELECT
and would result in about a tenth of all available dataSQL using
DISTINCT
:SELECT DISTINCT ITEMS.ITEM_ID, ITEMS.ITEM_CODE, ITEMS.ITEMTYPE, ITEM_TRANSACTIONS.STATUS, (SELECT COUNT(PKID) FROM ITEM_PARENTS WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID ) AS CHILD_COUNT FROM ITEMS INNER JOIN ITEM_TRANSACTIONS ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID AND ITEM_TRANSACTIONS.FLAG = 1 LEFT OUTER JOIN ITEM_METADATA ON ITEMS.ITEM_ID = ITEM_METADATA.ITEM_ID LEFT OUTER JOIN JOB_INVENTORY ON ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID LEFT OUTER JOIN JOB_TASK_INVENTORY ON JOB_INVENTORY.JOB_ITEM_ID = JOB_TASK_INVENTORY.JOB_ITEM_ID LEFT OUTER JOIN JOB_TASKS ON JOB_TASK_INVENTORY.TASKID = JOB_TASKS.TASKID LEFT OUTER JOIN JOBS ON JOB_TASKS.JOB_ID = JOBS.JOB_ID LEFT OUTER JOIN TASK_INVENTORY_STEP ON JOB_INVENTORY.JOB_ITEM_ID = TASK_INVENTORY_STEP.JOB_ITEM_ID LEFT OUTER JOIN TASK_STEP_INFORMATION ON TASK_INVENTORY_STEP.JOB_ITEM_ID = TASK_STEP_INFORMATION.JOB_ITEM_ID WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A' ORDER BY ITEMS.ITEM_CODE
SQL using
GROUP BY
:SELECT ITEMS.ITEM_ID, ITEMS.ITEM_CODE, ITEMS.ITEMTYPE, ITEM_TRANSACTIONS.STATUS, (SELECT COUNT(PKID) FROM ITEM_PARENTS WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID ) AS CHILD_COUNT FROM ITEMS INNER JOIN ITEM_TRANSACTIONS ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID AND ITEM_TRANSACTIONS.FLAG = 1 LEFT OUTER JOIN ITEM_METADATA ON ITEMS.ITEM_ID = ITEM_METADATA.ITEM_ID LEFT OUTER JOIN JOB_INVENTORY ON ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID LEFT OUTER JOIN JOB_TASK_INVENTORY ON JOB_INVENTORY.JOB_ITEM_ID = JOB_TASK_INVENTORY.JOB_ITEM_ID LEFT OUTER JOIN JOB_TASKS ON JOB_TASK_INVENTORY.TASKID = JOB_TASKS.TASKID LEFT OUTER JOIN JOBS ON JOB_TASKS.JOB_ID = JOBS.JOB_ID LEFT OUTER JOIN TASK_INVENTORY_STEP ON JOB_INVENTORY.JOB_ITEM_ID = TASK_INVENTORY_STEP.JOB_ITEM_ID LEFT OUTER JOIN TASK_STEP_INFORMATION ON TASK_INVENTORY_STEP.JOB_ITEM_ID = TASK_STEP_INFORMATION.JOB_ITEM_ID WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A' GROUP BY ITEMS.ITEM_ID, ITEMS.ITEM_CODE, ITEMS.ITEMTYPE, ITEM_TRANSACTIONS.STATUS ORDER BY ITEMS.ITEM_CODE
Here is the Oracle query plan for the query using
DISTINCT
:Here is the Oracle query plan for the query using
GROUP BY
:-
Hamlet Hakobyan over 11 yearsShow the query with
group by
. -
symcbean over 11 yearsI don't have the answer to your question, but I expect that seeing BOTH queries, their explain plans and the number of logical GETs might help in understanding (FWIW I would have expected DISTINCT to have a performance advantage, if at all).
-
Sam Axe over 11 yearsIn SQL Server you can get Query Execution Plans.. can you get something similar in Oracle? That would tell you where the difference was.
-
symcbean over 11 yearsBTW: why the big long chain of LEFT joins when you only want records with a 'TYPE A' at the end?
-
Robotic Cat over 11 yearsTwo things; 1) Put your GROUP BY query in your question and 2) Run an EXPLAIN PLAN on each query and also add the output to the question.
-
Charles Bretana over 11 yearsITEM_PARENTS has the children in it ?? that's interesting.
-
woemler over 11 yearsThanks for the comments, I have added the same query with the GROUP BY and clarified the reasoning for the query structure a little. I will generate query plans and add them shortly.
-
woemler over 11 yearsThanks @Dan-o, i have added the query plan.
-
woemler over 11 yearsThanks, @HamletHakobyan, I have added the group by.
-
woemler over 11 years@symcbean: This is just one example of how results are filtered for this query. Columns from most of these tables can be referenced in the WHERE clause.
-
Woot4Moo over 11 yearsHow often are these data sets refreshed? This seems like a good candidate for materialized views
-
woemler over 11 years@Woot4Moo: Typically, several hundred records are added a day. The records displayed in the output have to be up-to-date, so the materialized view would have to be refreshed quite often.
-
Woot4Moo over 11 yearshmm, define quite often. When I hear a several hundred a day that implies some type of hourly or scheduled based insert.
-
woemler over 11 years@Woot4Moo: Sorry, I should have been more clear. Records are generated when users perform certain tasks in the system, so there is no schedule. A user could generate a single record in a day or hundreds per-hour. The important things is that each time a user runs a search, up-to-date records must be returned, which makes me doubtful that a materialized view would work here, especially if the query populating it would take long to run.
-
Woot4Moo over 11 yearsI updated my post about how you could utilize a materialized view. Please let me know if it isn't clear. The key point is that you can insert data into the materialized view.
-
Ciro Santilli OurBigBook.com over 8 years
-
-
Clockwork-Muse over 11 years+1 - This is exactly what I was thinking of too (including the potential solution), but I don't know enough about Oracle to be sure.
-
Vincent Malgrat over 11 years+1 for code smell. Queries joining tables through PK shouldn't return duplicates ; if they do maybe something is amiss :)
-
woemler over 11 yearsThis would seem to be the bottleneck. I tried removing the subquery and the query executed as quickly as the GROUP BY version (100 min vs 20 sec). Thanks!
-
woemler over 11 yearsThanks for the response. The query given is just an example that shows one of the distantly joined tables being used to filter results. Columns from almost every table joined in this query could potentially be used in the WHERE clause.
-
woemler over 11 yearsYou are definitely right on this point. The schema is pretty poorly designed, with a lot of redundancy, from many years of having modules with new tables tacked on without a schema overhaul. Unfortunately, I have to live with what I have.
-
Vincent Malgrat over 11 yearsYou should still use SEMI-JOIN (EXISTS or IN) when appropriate instead of DISTINCT, it is clearer to both future reader and perhaps more importantly to the optimizer.