What's faster, SELECT DISTINCT or GROUP BY in MySQL?

856

Solution 1

They are essentially equivalent to each other (in fact this is how some databases implement DISTINCT under the hood).

If one of them is faster, it's going to be DISTINCT. This is because, although the two are the same, a query optimizer would have to catch the fact that your GROUP BY is not taking advantage of any group members, just their keys. DISTINCT makes this explicit, so you can get away with a slightly dumber optimizer.

When in doubt, test!

Solution 2

If you have an index on profession, these two are synonyms.

If you don't, then use DISTINCT.

GROUP BY in MySQL sorts results. You can even do:

SELECT u.profession FROM users u GROUP BY u.profession DESC

and get your professions sorted in DESC order.

DISTINCT creates a temporary table and uses it for storing duplicates. GROUP BY does the same, but sortes the distinct results afterwards.

So

SELECT DISTINCT u.profession FROM users u

is faster, if you don't have an index on profession.

Solution 3

All of the answers above are correct, for the case of DISTINCT on a single column vs GROUP BY on a single column. Every db engine has its own implementation and optimizations, and if you care about the very little difference (in most cases) then you have to test against specific server AND specific version! As implementations may change...

BUT, if you select more than one column in the query, then the DISTINCT is essentially different! Because in this case it will compare ALL columns of all rows, instead of just one column.

So if you have something like:

// This will NOT return unique by [id], but unique by (id,name)
SELECT DISTINCT id, name FROM some_query_with_joins

// This will select unique by [id].
SELECT id, name FROM some_query_with_joins GROUP BY id

It is a common mistake to think that DISTINCT keyword distinguishes rows by the first column you specified, but the DISTINCT is a general keyword in this manner.

So people you have to be careful not to take the answers above as correct for all cases... You might get confused and get the wrong results while all you wanted was to optimize!

Solution 4

Go for the simplest and shortest if you can -- DISTINCT seems to be more what you are looking for only because it will give you EXACTLY the answer you need and only that!

Solution 5

well distinct can be slower than group by on some occasions in postgres (dont know about other dbs).

tested example:

postgres=# select count(*) from (select distinct i from g) a;

count 

10001
(1 row)

Time: 1563,109 ms

postgres=# select count(*) from (select i from g group by i) a;

count
10001
(1 row)

Time: 594,481 ms

http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks_I

so be careful ... :)

Share:
856

Related videos on Youtube

Nancy Goyal
Author by

Nancy Goyal

Updated on October 10, 2021

Comments

  • Nancy Goyal
    Nancy Goyal over 2 years

    I am still new to regular expressions, as in the Python library re.

    I want to extract all the proper nouns as a whole word if they are separated by space.

    I tried

    result = re.findall(r'(\w+)\w*/NNP (\w+)\w*/NNP', tagged_sent_str)
    

    Input: I have a string like

    tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB" 
    

    Output expected:

    [('European Community'), ('European')]
    

    Current output:

    [('European','Community')]
    

    But this will only give the pairs not the single ones. I want all the kinds

    • Strawberry
      Strawberry almost 10 years
      You could test for yourself as quickly as ask the question. Irritatingly, it is almost impossible to construct a scenario in which DISTINCT outperforms GROUP BY - which is annoying because clearly this is not the purpose of GROUP BY. However, GROUP BY can produce misleading results, which I think is reason enough for avoiding it.
    • Aung Myo Linn
      Aung Myo Linn almost 8 years
      There's another duplicate with a different answer. see MySql - Distinct vs Group By <<< it says GROUP BY is better
    • Aung Myo Linn
      Aung Myo Linn almost 8 years
      Please see here if you want to measure the time difference between DISTINCT and GROUP BY running your query.
    • Chris
      Chris over 4 years
      What should happen if there are three (or more) consecutive NNPs?
    • Nancy Goyal
      Nancy Goyal over 4 years
      It should give all the consecutive NNPs together
  • SquareCog
    SquareCog about 15 years
    They are the same in terms of what they get, not in terms of how they get it. An ideal optimizer would execute them the same way, but MySQL optimizer is not ideal. Based on your evidence, it would seem that DISTINCT would go faster -- O(n) vs O(n*log n).
  • vava
    vava about 15 years
    So, "using filesort" is essentially bad thing?
  • SquareCog
    SquareCog about 15 years
    In this case it is, because you don't need to sort (you would if you needed the groups). MySQL sorts in order to place the same entries together, and then get groups by scanning the sorted file. You just need distincts, so you just have to hash your keys while doing a single table scan.
  • Quassnoi
    Quassnoi about 15 years
    DISTINCT will be faster only if you DON'T have an index (as it doesn't sort). When you do have an index and it's used, they're synonyms.
  • a_horse_with_no_name
    a_horse_with_no_name over 10 years
    Although this question is about MySQL it should be noted that the second query will work only in MySQL. Nearly every other DBMS will reject the second statement because it's an invalid use of the GROUP BY operator.
  • daniel.gindi
    daniel.gindi over 10 years
    Well, "nearly" is a problematic definition :-) It would be much more helpful if you state a specific DBMS that you have tested to see that it generates an error for this statement.
  • a_horse_with_no_name
    a_horse_with_no_name over 10 years
    Postgres, Oracle, Firebird, DB2, SQL Server for starters. MySQL :sqlfiddle.com/#!2/6897c/1 Postgres: sqlfiddle.com/#!12/6897c/1 Oracle: sqlfiddle.com/#!12/6897c/1 SQL Server: sqlfiddle.com/#!6/6897c/1
  • Ariel
    Ariel over 9 years
    You can add ORDER BY NULL to the GROUP BY to avoid the sort.
  • Ariel
    Ariel over 9 years
    Add ORDER BY NULL to the GROUP BY version and they will be the same.
  • rustyx
    rustyx over 9 years
    The definition of DISTINCT and GROUP BY differ in that DISTINCT doesn't have to sort the output, and GROUP BY by default does. However, in MySQL even a DISTINCT+ORDER BY might still be faster than a GROUP BY due to the extra hints for the optimizer as explained by SquareCog.
  • Pankaj Wanjari
    Pankaj Wanjari over 8 years
    DISTINCT is much faster with large amount data.
  • Lizardx
    Lizardx about 8 years
    I tested this, and found that on an indexed column, mysql, group by was about 6x slower than distinct with a fairly complicated query. Just adding this as a datapoint. About 100k rows. So test it and see for yourselves.
  • Aung Myo Linn
    Aung Myo Linn almost 8 years
    see MySql - Distinct vs Group By <<< it says GROUP BY is better
  • Thanh Trung
    Thanh Trung almost 5 years
    Still slower even with grouping by null
  • Quassnoi
    Quassnoi almost 5 years
    @ThanhTrung: what is slower than what?
  • Thanh Trung
    Thanh Trung almost 5 years
    @Quassnoi groupby slower than distinct even if avoiding sort
  • Nancy Goyal
    Nancy Goyal over 4 years
    Thank you Chris for your answer. This works well as required but I was wondering if this can be done with regular expression
  • Wiktor Stribiżew
    Wiktor Stribiżew over 4 years
    @NancyGoyal You can't use a single regex method call because the texts you want to appear inside a single string result are not adjoining in the input.
  • Admin
    Admin over 4 years
    is equal to SELECT profession FROM users GROUP BY profession
  • Matthew Lenz
    Matthew Lenz about 4 years
    Note: Order qualifiers on GROUP BY were deprecated in MySQL 8.
  • Bernardo Loureiro
    Bernardo Loureiro almost 4 years
    GROUP BY is also faster than DISTINCT in AWS Redshift, because GROUP BY uses a XN HashAggregate and DISTINCT uses a XN Unique. It is the same problem that older versions of Postgres have.
  • Marinos An
    Marinos An about 3 years
    And to confuse us :) , mysql allows using select distinct(a), b which means select distinct a, b, which means distinct on the pair.