Huge performance difference when using GROUP BY vs DISTINCT
The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct
approach is executed like:
- Copy all
business_key
values to a temporary table - Sort the temporary table
- Scan the temporary table, returning each item that is different from the one before it
The group by
could be executed like:
- Scan the full table, storing each value of
business key
in a hashtable - Return the keys of the hashtable
The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.
Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.
Comments
-
Martin Dimitrov almost 2 years
I am performing some tests on a
HSQLDB
server with a table containing 500 000 entries. The table has no indices. There are 5000 distinct business keys. I need a list of them.Naturally I started with a
DISTINCT
query:SELECT DISTINCT business_key FROM memory WHERE concept <> 'case' OR attrib <> 'status' OR value <> 'closed';
It takes around 90 seconds!!!
Then I tried using
GROUP BY
:SELECT business_key FROM memory WHERE concept <> 'case' OR attrib <> 'status' OR value <> 'closed'; GROUP BY business_key
And it takes 1 second!!!
Trying to figure out the difference I ran
EXLAIN PLAN FOR
but it seems to give the same information for both queries.EXLAIN PLAN FOR DISTINCT ...
isAggregated=[false] columns=[ COLUMN: PUBLIC.MEMORY.BUSINESS_KEY ] [range variable 1 join type=INNER table=MEMORY alias=M access=FULL SCAN condition = [ index=SYS_IDX_SYS_PK_10057_10058 other condition=[ OR arg_left=[ OR arg_left=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[ VALUE = case, TYPE = CHARACTER]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[ VALUE = status, TYPE = CHARACTER]]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[ VALUE = closed, TYPE = CHARACTER]]] ] ]] PARAMETERS=[] SUBQUERIES[] Object References PUBLIC.MEMORY PUBLIC.MEMORY.CONCEPT PUBLIC.MEMORY.ATTRIB PUBLIC.MEMORY.VALUE PUBLIC.MEMORY.BUSINESS_KEY Read Locks PUBLIC.MEMORY WriteLocks
EXLAIN PLAN FOR SELECT ... GROUP BY ...
isDistinctSelect=[false] isGrouped=[true] isAggregated=[false] columns=[ COLUMN: PUBLIC.MEMORY.BUSINESS_KEY ] [range variable 1 join type=INNER table=MEMORY alias=M access=FULL SCAN condition = [ index=SYS_IDX_SYS_PK_10057_10058 other condition=[ OR arg_left=[ OR arg_left=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[ VALUE = case, TYPE = CHARACTER]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[ VALUE = status, TYPE = CHARACTER]]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[ VALUE = closed, TYPE = CHARACTER]]] ] ]] groupColumns=[ COLUMN: PUBLIC.MEMORY.BUSINESS_KEY] PARAMETERS=[] SUBQUERIES[] Object References PUBLIC.MEMORY PUBLIC.MEMORY.CONCEPT PUBLIC.MEMORY.ATTRIB PUBLIC.MEMORY.VALUE PUBLIC.MEMORY.BUSINESS_KEY Read Locks PUBLIC.MEMORY WriteLocks
EDIT
I did additional tests. With 500 000 records in
HSQLDB
with all distinct business keys, the performance ofDISTINCT
is now better - 3 seconds, vsGROUP BY
which took around 9 seconds.In
MySQL
both queries preform the same:MySQL: 500 000 rows - 5 000 distinct business keys: Both queries: 0.5 second MySQL: 500 000 rows - all distinct business keys:
SELECT DISTINCT ...
- 11 secondsSELECT ... GROUP BY business_key
- 13 secondsSo the problem is only related to
HSQLDB
.I will be very grateful if someone can explain why there is such a drastic difference.
-
Martin Dimitrov over 12 yearsThanks for the reply. Are your guesses evident from the
EXPLAIN
output? Both look the same to me. -
Andomar over 12 yearsAs far as I can see, the plan doesn't specify how it will execute the join. I'm not even sure why it would execute a join. It probably takes an HSQLDB specialist to read the explain output.
-
fredt over 12 yearsAs the answer indicates, the second method uses more memory and may hit garbage collection (GC) too often. If you increase the JVM memory allocation, there shouldn't be a huge difference between the two query times.
-
Martin Dimitrov over 12 yearsI did additional test by entering all distinct keys in the table (see above). Do you thing the result proves your point? Thanks a lot.
-
Andomar over 12 yearsThere are to much variables in database optimization to really prove anything. However, it seems consistent in the sense that the first approach would be relatively faster when all keys are different.
-
Martin Dimitrov over 12 yearsFair enough. Thanks for the help.
-
Hossein over 10 yearsin 200,000 record group_by spends 500ms and distinct spends 60ms.
-
singhswat over 8 yearsCan a SME- expert please explain this in more details with examples... I've had this issue a many times but don't seem to get around it... I know the fix but I want to know how and WHY