Counting DISTINCT over multiple columns

592,558

Solution 1

If you are trying to improve performance, you could try creating a persisted computed column on either a hash or concatenated value of the two columns.

Once it is persisted, provided the column is deterministic and you are using "sane" database settings, it can be indexed and / or statistics can be created on it.

I believe a distinct count of the computed column would be equivalent to your query.

Solution 2

Edit: Altered from the less-than-reliable checksum-only query I've discovered a way to do this (in SQL Server 2005) that works pretty well for me and I can use as many columns as I need (by adding them to the CHECKSUM() function). The REVERSE() function turns the ints into varchars to make the distinct more reliable

SELECT COUNT(DISTINCT (CHECKSUM(DocumentId,DocumentSessionId)) + CHECKSUM(REVERSE(DocumentId),REVERSE(DocumentSessionId)) )
FROM DocumentOutPutItems

Solution 3

What is it about your existing query that you don't like? If you are concerned that DISTINCT across two columns does not return just the unique permutations why not try it?

It certainly works as you might expect in Oracle.

SQL> select distinct deptno, job from emp
  2  order by deptno, job
  3  /

    DEPTNO JOB
---------- ---------
        10 CLERK
        10 MANAGER
        10 PRESIDENT
        20 ANALYST
        20 CLERK
        20 MANAGER
        30 CLERK
        30 MANAGER
        30 SALESMAN

9 rows selected.


SQL> select count(*) from (
  2  select distinct deptno, job from emp
  3  )
  4  /

  COUNT(*)
----------
         9

SQL>

edit

I went down a blind alley with analytics but the answer was depressingly obvious...

SQL> select count(distinct concat(deptno,job)) from emp
  2  /

COUNT(DISTINCTCONCAT(DEPTNO,JOB))
---------------------------------
                                9

SQL>

edit 2

Given the following data the concatenating solution provided above will miscount:

col1  col2
----  ----
A     AA
AA    A

So we to include a separator...

select col1 + '*' + col2 from t23
/

Obviously the chosen separator must be a character, or set of characters, which can never appear in either column.

Solution 4

To run as a single query, concatenate the columns, then get the distinct count of instances of the concatenated string.

SELECT count(DISTINCT concat(DocumentId, DocumentSessionId)) FROM DocumentOutputItems;

In MySQL you can do the same thing without the concatenation step as follows:

SELECT count(DISTINCT DocumentId, DocumentSessionId) FROM DocumentOutputItems;

This feature is mentioned in the MySQL documentation:

http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_count-distinct

Solution 5

How about something like:

select count(*)
from
  (select count(*) cnt
   from DocumentOutputItems
   group by DocumentId, DocumentSessionId) t1

Probably just does the same as you are already though but it avoids the DISTINCT.

Share:
592,558

Related videos on Youtube

Novitzky
Author by

Novitzky

Updated on July 05, 2022

Comments

  • Novitzky
    Novitzky almost 2 years

    Is there a better way of doing a query like this:

    SELECT COUNT(*) 
    FROM (SELECT DISTINCT DocumentId, DocumentSessionId
          FROM DocumentOutputItems) AS internalQuery
    

    I need to count the number of distinct items from this table but the distinct is over two columns.

    My query works fine but I was wondering if I can get the final result using just one query (without using a sub-query)

    • Novitzky
      Novitzky over 14 years
      IordanTanev, Mark Brackett, RC - thanks for replies, it was a nice try, but you need to check what you doing before posting to SO. The queries you provided are not equivalent to my query. You can easily see I always have a scalar a result but your query returns multiple rows.
    • Jeff
      Jeff about 8 years
      Just updated the question to include your clarifying comment from one of the answers
    • quetzalcoatl
      quetzalcoatl over 5 years
    • Anupam
      Anupam almost 4 years
      This is a good question. I was wondering as well if there was a simpler way to do this
  • Dave Costa
    Dave Costa over 14 years
    In order for this to give the final answer, you would have to wrap it in another SELECT COUNT(*) FROM ( ... ). Essentially this answer is just giving you another way to list the distinct values you want to count. It's no better than your original solution.
  • Novitzky
    Novitzky over 14 years
    Thanks Dave. I know you can use group by instead of distinct in my case. I was wondering if you get the final result using just one query. I think is impossible but I might be wrong.
  • KM.
    KM. over 14 years
    in SQL Server you get: Msg 102, Level 15, State 1, Line 1 Incorrect syntax near ','.
  • Novitzky
    Novitzky over 14 years
    This is what I was thinking of. I want do similar thing in MSSQL if possible.
  • KM.
    KM. over 14 years
    in my tests (using SET SHOWPLAN_ALL ON), it had the same execution plan and exact same TotalSubtreeCost
  • KM.
    KM. over 14 years
    @Kamil Nowicki, in SQL Server, you can only have one field in a COUNT(), in my answer I show that you can concatenate the two fields into one and try this approach. However, I'd just stick with the original since the query plans would end up the same.
  • Novitzky
    Novitzky over 14 years
    +1 from me. Thanks for your answer. My query works fine but I was wondering if I can get the final result using just one query (without using a subquery)
  • Novitzky
    Novitzky over 14 years
    +1 from me. Thanks but I will stick with my query as you suggested. Using "convert" can decrease performance even more.
  • Bernoulli IT
    Bernoulli IT over 11 years
    +1 Nice one, works perfect (when you have the right column types to perform a CheckSum on... ;)
  • Custodio
    Custodio over 11 years
    Please give a look in @JayTee answer. It works like a charm. count ( distinct CHECKSUM ([Field1], [Field2])
  • tumchaaditya
    tumchaaditya over 10 years
    Excellent suggestion! The more I read, the more I am realizing that SQL is less about knowing syntax and functions and more about applying pure logic.. I wish I had 2 upvotes!
  • Lukas Eder
    Lukas Eder over 10 years
    Depending on the complexity of the original query, solving this with GROUP BY may introduce a couple of additional challenges to the query transformation to achieve the desired output (e.g. when the original query already had GROUP BY or HAVING clauses...)
  • crokusek
    crokusek over 10 years
    With hashes like Checksum(), there is small chance that the same hash will be returned for different inputs so the count may be very slightly off. HashBytes() is an even smaller chance but still not zero. If those two Ids were int's (32b) then a "lossless hash" could combine them into an bigint (64b) like Id1 << 32 + Id2.
  • pvolders
    pvolders almost 10 years
    the chance is not so small even, especially when you start combining columns (which is what it was supposed to be meant for). I was curious about this approach and in a particular case the checksum ended up with a count 10% smaller. If you think of it a bit longer, Checksum just returns an int, so if you'd checksum a full bigint range you'll end up with a distinct count about 2 billion times smaller than there actually is. -1
  • Anthony Geoghegan
    Anthony Geoghegan almost 10 years
    The above query will return a different set of results than what the OP was looking for (the distinct combinations of DocumentId and DocumentSessionId). Alexander Kjäll already posted the correct answer if the OP was using MySQL and not MS SQL Server.
  • JayTee
    JayTee over 9 years
    Updated the query to include the use of "REVERSE" to remove the chance of duplicates
  • The Red Pea
    The Red Pea over 8 years
    Could we avoid CHECKSUM -- could we just concatenate the two values together? I suppose that risks considering as the same thing: ('he', 'art') == 'hear', 't'). But I think that can be solved with a delimiter as @APC proposes (some value that doesn't appear in either column), so 'he|art' != 'hear|t' Are there other problems with a simple "concatenation" approach?
  • JayTee
    JayTee about 8 years
    I think concatentation can work - the db still has to determine uniqueness
  • Avrajit Roy
    Avrajit Roy about 8 years
    Too good suggestion. It avoided me to write unnecessary code to this.
  • sstan
    sstan almost 8 years
    This was a SQL Server question, and both options you posted have already been mentioned in the following answers to this question: stackoverflow.com/a/1471444/4955425 and stackoverflow.com/a/1471713/4955425.
  • ijoseph
    ijoseph almost 6 years
    FWIW, this almost works in PostgreSQL; just need extra parentheses: SELECT COUNT(DISTINCT (DocumentId, DocumentSessionId)) FROM DocumentOutputItems;
  • Vytenis Bivainis
    Vytenis Bivainis over 5 years
    what databases support select count(distinct(a, b))? :D
  • karmakaze
    karmakaze over 5 years
    @VytenisBivainis I know PostgreSQL does--not sure since which version.
  • naviram
    naviram over 5 years
    this doesn't do as require in the question, it counts the distinct in separate for each column
  • Anwar Shaikh
    Anwar Shaikh over 4 years
    It does not give you the count of distinct values in conjunction of two columns. At least not in MySQL 5.8.
  • Tab Alleman
    Tab Alleman over 4 years
    This question is tagged SQL Server, and this isn't SQL Server syntax
  • Aaron West
    Aaron West over 4 years
    Do not use CHECKSUM for this. It's simply left-shift-by-4 then xor, which is the worst hash I have seen (CRC-32 would be far superior; even CRC-16 might be better!) It is very easy to construct colliding values; for example, the following are both 0: select checksum('1234123412341234') select checksum('abcdabcdabcdabcd') The only valid use I can imagine for CHECKSUM is the recommended one; when you want a small index over large text values, and you add a comparison of those text values to the where clause, to eliminate collisions. HASHBytes may be adequate, though possibly slow.
  • jayqui
    jayqui about 4 years
    Would you please add an example or code sample to show more about what this means and how to do it?
  • Bort
    Bort over 3 years
    Be very careful with this method as it could lead to incorrect counts. The following example will return a count of 1. DocumentID | DocumentSessionID "A" | "AB" "AA" | "B"
  • Sreram
    Sreram over 3 years
    How is it different from creating a multi-column index on those columns? I'm sorry if this makes no sense. I'm new to SQL.
  • Tomty
    Tomty over 3 years
    Even in MySQL, this isn't entirely equivalent to the original query, because rows with NULLs won't be counted.
  • Tomty
    Tomty over 3 years
    As @Bort notes, the first option can lead to incorrect results, and would be better written using CONCAT_WS. The 2nd method also isn't guaranteed to produce the same results as the original query, in case any of the columns are nullable.
  • devloper152
    devloper152 over 2 years
    What does 1 mean in count(1)?
  • Kota Mori
    Kota Mori over 2 years
    How does this trick care about hash collisions? I think the distinct counts on hash values would be smaller than the truth due to the collisions.
  • Sergiy
    Sergiy over 2 years
    @VytenisBivainis MySQL supports that as well
  • StriplingWarrior
    StriplingWarrior about 2 years
    This question is not about Oracle. It's about SQL Server.
  • StriplingWarrior
    StriplingWarrior about 2 years
    @devloper152: It has no special meaning. For some reason count() always has to have an argument, so depending on people's taste they'll typically use count(*), count(1), or count(null).
  • StriplingWarrior
    StriplingWarrior about 2 years
    To be clear, || is a concatenation operator in some databases. This question is about SQL Server, where + would be the equivalent. Just like all the other answers on this question recommending concatenation, this suffers from the problem that combinations of different values ('a', 'bc' vs 'ab', 'c') can concatenate to the same value ('abc'), giving you an incorrect count.
  • karmakaze
    karmakaze about 2 years
    @Sergiy thanks, updated answer with link to non-standard syntax supported by MySQL
  • surj
    surj about 2 years
    Love these kinds of creative solutions, thanks for sharing.
  • AdamO
    AdamO about 2 years
    @AnwarShaikh I don't understand your comment. Do you mean to say it does not give you the count of distinct rows in the two columns "DocumentID" and "DocumentSessionID"?
  • Erick de Vathaire
    Erick de Vathaire almost 2 years
    i used with a table that has only 169 rows and it was wrong, the values "2, 550" and "3, 550" have the same result, don't use CHECKSUM, using a loop to have the values 1 to 10 in 2 columns with cross join, it should have 100 unique values but it shows 86