Counting DISTINCT over multiple columns

sql sql-server performance tsql query-optimization

592,558

Solution 1

If you are trying to improve performance, you could try creating a persisted computed column on either a hash or concatenated value of the two columns.

Once it is persisted, provided the column is deterministic and you are using "sane" database settings, it can be indexed and / or statistics can be created on it.

I believe a distinct count of the computed column would be equivalent to your query.

Solution 2

Edit: Altered from the less-than-reliable checksum-only query I've discovered a way to do this (in SQL Server 2005) that works pretty well for me and I can use as many columns as I need (by adding them to the CHECKSUM() function). The REVERSE() function turns the ints into varchars to make the distinct more reliable

SELECT COUNT(DISTINCT (CHECKSUM(DocumentId,DocumentSessionId)) + CHECKSUM(REVERSE(DocumentId),REVERSE(DocumentSessionId)) )
FROM DocumentOutPutItems

Solution 3

What is it about your existing query that you don't like? If you are concerned that DISTINCT across two columns does not return just the unique permutations why not try it?

It certainly works as you might expect in Oracle.

SQL> select distinct deptno, job from emp
  2  order by deptno, job
  3  /

    DEPTNO JOB
---------- ---------
        10 CLERK
        10 MANAGER
        10 PRESIDENT
        20 ANALYST
        20 CLERK
        20 MANAGER
        30 CLERK
        30 MANAGER
        30 SALESMAN

9 rows selected.


SQL> select count(*) from (
  2  select distinct deptno, job from emp
  3  )
  4  /

  COUNT(*)
----------
         9

SQL>

edit

I went down a blind alley with analytics but the answer was depressingly obvious...

SQL> select count(distinct concat(deptno,job)) from emp
  2  /

COUNT(DISTINCTCONCAT(DEPTNO,JOB))
---------------------------------
                                9

SQL>

edit 2

Given the following data the concatenating solution provided above will miscount:

col1  col2
----  ----
A     AA
AA    A

So we to include a separator...

select col1 + '*' + col2 from t23
/

Obviously the chosen separator must be a character, or set of characters, which can never appear in either column.

Solution 4

To run as a single query, concatenate the columns, then get the distinct count of instances of the concatenated string.

SELECT count(DISTINCT concat(DocumentId, DocumentSessionId)) FROM DocumentOutputItems;

In MySQL you can do the same thing without the concatenation step as follows:

SELECT count(DISTINCT DocumentId, DocumentSessionId) FROM DocumentOutputItems;

This feature is mentioned in the MySQL documentation:

http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_count-distinct

Solution 5

How about something like:

select count(*)
from
  (select count(*) cnt
   from DocumentOutputItems
   group by DocumentId, DocumentSessionId) t1

Probably just does the same as you are already though but it avoids the DISTINCT.

View more solutions

592,558

Novitzky

Updated on July 05, 2022

Comments

Novitzky almost 2 years
Is there a better way of doing a query like this:
```
SELECT COUNT(*) 
FROM (SELECT DISTINCT DocumentId, DocumentSessionId
      FROM DocumentOutputItems) AS internalQuery
```
I need to count the number of distinct items from this table but the distinct is over two columns.

My query works fine but I was wondering if I can get the final result using just one query (without using a sub-query)
- Novitzky over 14 years
  
  IordanTanev, Mark Brackett, RC - thanks for replies, it was a nice try, but you need to check what you doing before posting to SO. The queries you provided are not equivalent to my query. You can easily see I always have a scalar a result but your query returns multiple rows.
- Jeff about 8 years
  
  Just updated the question to include your clarifying comment from one of the answers
- quetzalcoatl over 5 years
  
  FYI: community.oracle.com/ideas/18664
- Anupam almost 4 years
  
  This is a good question. I was wondering as well if there was a simpler way to do this
Dave Costa over 14 years

In order for this to give the final answer, you would have to wrap it in another SELECT COUNT(*) FROM ( ... ). Essentially this answer is just giving you another way to list the distinct values you want to count. It's no better than your original solution.
Novitzky over 14 years

Thanks Dave. I know you can use group by instead of distinct in my case. I was wondering if you get the final result using just one query. I think is impossible but I might be wrong.
KM. over 14 years

in SQL Server you get: Msg 102, Level 15, State 1, Line 1 Incorrect syntax near ','.
Novitzky over 14 years

This is what I was thinking of. I want do similar thing in MSSQL if possible.
KM. over 14 years

in my tests (using SET SHOWPLAN_ALL ON), it had the same execution plan and exact same TotalSubtreeCost
KM. over 14 years

@Kamil Nowicki, in SQL Server, you can only have one field in a COUNT(), in my answer I show that you can concatenate the two fields into one and try this approach. However, I'd just stick with the original since the query plans would end up the same.
Novitzky over 14 years

+1 from me. Thanks for your answer. My query works fine but I was wondering if I can get the final result using just one query (without using a subquery)
Novitzky over 14 years

+1 from me. Thanks but I will stick with my query as you suggested. Using "convert" can decrease performance even more.
Bernoulli IT over 11 years

+1 Nice one, works perfect (when you have the right column types to perform a CheckSum on... ;)
Custodio over 11 years

Please give a look in @JayTee answer. It works like a charm. count ( distinct CHECKSUM ([Field1], [Field2])
tumchaaditya over 10 years

Excellent suggestion! The more I read, the more I am realizing that SQL is less about knowing syntax and functions and more about applying pure logic.. I wish I had 2 upvotes!
Lukas Eder over 10 years

Depending on the complexity of the original query, solving this with GROUP BY may introduce a couple of additional challenges to the query transformation to achieve the desired output (e.g. when the original query already had GROUP BY or HAVING clauses...)
crokusek over 10 years

With hashes like Checksum(), there is small chance that the same hash will be returned for different inputs so the count may be very slightly off. HashBytes() is an even smaller chance but still not zero. If those two Ids were int's (32b) then a "lossless hash" could combine them into an bigint (64b) like Id1 << 32 + Id2.
pvolders almost 10 years

the chance is not so small even, especially when you start combining columns (which is what it was supposed to be meant for). I was curious about this approach and in a particular case the checksum ended up with a count 10% smaller. If you think of it a bit longer, Checksum just returns an int, so if you'd checksum a full bigint range you'll end up with a distinct count about 2 billion times smaller than there actually is. -1
Anthony Geoghegan almost 10 years

The above query will return a different set of results than what the OP was looking for (the distinct combinations of DocumentId and DocumentSessionId). Alexander Kjäll already posted the correct answer if the OP was using MySQL and not MS SQL Server.
JayTee over 9 years

Updated the query to include the use of "REVERSE" to remove the chance of duplicates
The Red Pea over 8 years

Could we avoid CHECKSUM -- could we just concatenate the two values together? I suppose that risks considering as the same thing: ('he', 'art') == 'hear', 't'). But I think that can be solved with a delimiter as @APC proposes (some value that doesn't appear in either column), so 'he|art' != 'hear|t' Are there other problems with a simple "concatenation" approach?
JayTee about 8 years

I think concatentation can work - the db still has to determine uniqueness
Avrajit Roy about 8 years

Too good suggestion. It avoided me to write unnecessary code to this.
sstan almost 8 years

This was a SQL Server question, and both options you posted have already been mentioned in the following answers to this question: stackoverflow.com/a/1471444/4955425 and stackoverflow.com/a/1471713/4955425.
ijoseph almost 6 years

FWIW, this almost works in PostgreSQL; just need extra parentheses: SELECT COUNT(DISTINCT (DocumentId, DocumentSessionId)) FROM DocumentOutputItems;
Vytenis Bivainis over 5 years

what databases support select count(distinct(a, b))? :D
karmakaze over 5 years

@VytenisBivainis I know PostgreSQL does--not sure since which version.
naviram over 5 years

this doesn't do as require in the question, it counts the distinct in separate for each column
Anwar Shaikh over 4 years

It does not give you the count of distinct values in conjunction of two columns. At least not in MySQL 5.8.
Tab Alleman over 4 years

This question is tagged SQL Server, and this isn't SQL Server syntax
Aaron West over 4 years

Do not use CHECKSUM for this. It's simply left-shift-by-4 then xor, which is the worst hash I have seen (CRC-32 would be far superior; even CRC-16 might be better!) It is very easy to construct colliding values; for example, the following are both 0: select checksum('1234123412341234') select checksum('abcdabcdabcdabcd') The only valid use I can imagine for CHECKSUM is the recommended one; when you want a small index over large text values, and you add a comparison of those text values to the where clause, to eliminate collisions. HASHBytes may be adequate, though possibly slow.
jayqui about 4 years

Would you please add an example or code sample to show more about what this means and how to do it?
Bort over 3 years

Be very careful with this method as it could lead to incorrect counts. The following example will return a count of 1. DocumentID | DocumentSessionID "A" | "AB" "AA" | "B"
Sreram over 3 years

How is it different from creating a multi-column index on those columns? I'm sorry if this makes no sense. I'm new to SQL.
Tomty over 3 years

Even in MySQL, this isn't entirely equivalent to the original query, because rows with NULLs won't be counted.
Tomty over 3 years

As @Bort notes, the first option can lead to incorrect results, and would be better written using CONCAT_WS. The 2nd method also isn't guaranteed to produce the same results as the original query, in case any of the columns are nullable.
devloper152 over 2 years

What does 1 mean in count(1)?
Kota Mori over 2 years

How does this trick care about hash collisions? I think the distinct counts on hash values would be smaller than the truth due to the collisions.
Sergiy over 2 years

@VytenisBivainis MySQL supports that as well
StriplingWarrior about 2 years

This question is not about Oracle. It's about SQL Server.
StriplingWarrior about 2 years

@devloper152: It has no special meaning. For some reason count() always has to have an argument, so depending on people's taste they'll typically use count(*), count(1), or count(null).
StriplingWarrior about 2 years

To be clear, || is a concatenation operator in some databases. This question is about SQL Server, where + would be the equivalent. Just like all the other answers on this question recommending concatenation, this suffers from the problem that combinations of different values ('a', 'bc' vs 'ab', 'c') can concatenate to the same value ('abc'), giving you an incorrect count.
karmakaze about 2 years

@Sergiy thanks, updated answer with link to non-standard syntax supported by MySQL
surj about 2 years

Love these kinds of creative solutions, thanks for sharing.
AdamO about 2 years

@AnwarShaikh I don't understand your comment. Do you mean to say it does not give you the count of distinct rows in the two columns "DocumentID" and "DocumentSessionID"?
Erick de Vathaire almost 2 years

i used with a table that has only 169 rows and it was wrong, the values "2, 550" and "3, 550" have the same result, don't use CHECKSUM, using a loop to have the values 1 to 10 in 2 columns with cross join, it should have 100 unique values but it shows 86