Counting DISTINCT over multiple columns
Solution 1
If you are trying to improve performance, you could try creating a persisted computed column on either a hash or concatenated value of the two columns.
Once it is persisted, provided the column is deterministic and you are using "sane" database settings, it can be indexed and / or statistics can be created on it.
I believe a distinct count of the computed column would be equivalent to your query.
Solution 2
Edit: Altered from the less-than-reliable checksum-only query I've discovered a way to do this (in SQL Server 2005) that works pretty well for me and I can use as many columns as I need (by adding them to the CHECKSUM() function). The REVERSE() function turns the ints into varchars to make the distinct more reliable
SELECT COUNT(DISTINCT (CHECKSUM(DocumentId,DocumentSessionId)) + CHECKSUM(REVERSE(DocumentId),REVERSE(DocumentSessionId)) )
FROM DocumentOutPutItems
Solution 3
What is it about your existing query that you don't like? If you are concerned that DISTINCT
across two columns does not return just the unique permutations why not try it?
It certainly works as you might expect in Oracle.
SQL> select distinct deptno, job from emp
2 order by deptno, job
3 /
DEPTNO JOB
---------- ---------
10 CLERK
10 MANAGER
10 PRESIDENT
20 ANALYST
20 CLERK
20 MANAGER
30 CLERK
30 MANAGER
30 SALESMAN
9 rows selected.
SQL> select count(*) from (
2 select distinct deptno, job from emp
3 )
4 /
COUNT(*)
----------
9
SQL>
edit
I went down a blind alley with analytics but the answer was depressingly obvious...
SQL> select count(distinct concat(deptno,job)) from emp
2 /
COUNT(DISTINCTCONCAT(DEPTNO,JOB))
---------------------------------
9
SQL>
edit 2
Given the following data the concatenating solution provided above will miscount:
col1 col2
---- ----
A AA
AA A
So we to include a separator...
select col1 + '*' + col2 from t23
/
Obviously the chosen separator must be a character, or set of characters, which can never appear in either column.
Solution 4
To run as a single query, concatenate the columns, then get the distinct count of instances of the concatenated string.
SELECT count(DISTINCT concat(DocumentId, DocumentSessionId)) FROM DocumentOutputItems;
In MySQL you can do the same thing without the concatenation step as follows:
SELECT count(DISTINCT DocumentId, DocumentSessionId) FROM DocumentOutputItems;
This feature is mentioned in the MySQL documentation:
http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_count-distinct
Solution 5
How about something like:
select count(*) from (select count(*) cnt from DocumentOutputItems group by DocumentId, DocumentSessionId) t1
Probably just does the same as you are already though but it avoids the DISTINCT.
Related videos on Youtube
Novitzky
Updated on July 05, 2022Comments
-
Novitzky almost 2 years
Is there a better way of doing a query like this:
SELECT COUNT(*) FROM (SELECT DISTINCT DocumentId, DocumentSessionId FROM DocumentOutputItems) AS internalQuery
I need to count the number of distinct items from this table but the distinct is over two columns.
My query works fine but I was wondering if I can get the final result using just one query (without using a sub-query)
-
Novitzky over 14 yearsIordanTanev, Mark Brackett, RC - thanks for replies, it was a nice try, but you need to check what you doing before posting to SO. The queries you provided are not equivalent to my query. You can easily see I always have a scalar a result but your query returns multiple rows.
-
Jeff about 8 yearsJust updated the question to include your clarifying comment from one of the answers
-
quetzalcoatl over 5 years
-
Anupam almost 4 yearsThis is a good question. I was wondering as well if there was a simpler way to do this
-
-
Dave Costa over 14 yearsIn order for this to give the final answer, you would have to wrap it in another SELECT COUNT(*) FROM ( ... ). Essentially this answer is just giving you another way to list the distinct values you want to count. It's no better than your original solution.
-
Novitzky over 14 yearsThanks Dave. I know you can use group by instead of distinct in my case. I was wondering if you get the final result using just one query. I think is impossible but I might be wrong.
-
KM. over 14 yearsin SQL Server you get: Msg 102, Level 15, State 1, Line 1 Incorrect syntax near ','.
-
Novitzky over 14 yearsThis is what I was thinking of. I want do similar thing in MSSQL if possible.
-
KM. over 14 yearsin my tests (using SET SHOWPLAN_ALL ON), it had the same execution plan and exact same TotalSubtreeCost
-
KM. over 14 years@Kamil Nowicki, in SQL Server, you can only have one field in a COUNT(), in my answer I show that you can concatenate the two fields into one and try this approach. However, I'd just stick with the original since the query plans would end up the same.
-
Novitzky over 14 years+1 from me. Thanks for your answer. My query works fine but I was wondering if I can get the final result using just one query (without using a subquery)
-
Novitzky over 14 years+1 from me. Thanks but I will stick with my query as you suggested. Using "convert" can decrease performance even more.
-
Bernoulli IT over 11 years+1 Nice one, works perfect (when you have the right column types to perform a CheckSum on... ;)
-
Custodio over 11 yearsPlease give a look in @JayTee answer. It works like a charm.
count ( distinct CHECKSUM ([Field1], [Field2])
-
tumchaaditya over 10 yearsExcellent suggestion! The more I read, the more I am realizing that SQL is less about knowing syntax and functions and more about applying pure logic.. I wish I had 2 upvotes!
-
Lukas Eder over 10 yearsDepending on the complexity of the original query, solving this with
GROUP BY
may introduce a couple of additional challenges to the query transformation to achieve the desired output (e.g. when the original query already hadGROUP BY
orHAVING
clauses...) -
crokusek over 10 yearsWith hashes like Checksum(), there is small chance that the same hash will be returned for different inputs so the count may be very slightly off. HashBytes() is an even smaller chance but still not zero. If those two Ids were int's (32b) then a "lossless hash" could combine them into an bigint (64b) like Id1 << 32 + Id2.
-
pvolders almost 10 yearsthe chance is not so small even, especially when you start combining columns (which is what it was supposed to be meant for). I was curious about this approach and in a particular case the checksum ended up with a count 10% smaller. If you think of it a bit longer, Checksum just returns an int, so if you'd checksum a full bigint range you'll end up with a distinct count about 2 billion times smaller than there actually is. -1
-
Anthony Geoghegan almost 10 yearsThe above query will return a different set of results than what the OP was looking for (the distinct combinations of
DocumentId
andDocumentSessionId
). Alexander Kjäll already posted the correct answer if the OP was using MySQL and not MS SQL Server. -
JayTee over 9 yearsUpdated the query to include the use of "REVERSE" to remove the chance of duplicates
-
The Red Pea over 8 yearsCould we avoid CHECKSUM -- could we just concatenate the two values together? I suppose that risks considering as the same thing: ('he', 'art') == 'hear', 't'). But I think that can be solved with a delimiter as @APC proposes (some value that doesn't appear in either column), so 'he|art' != 'hear|t' Are there other problems with a simple "concatenation" approach?
-
JayTee about 8 yearsI think concatentation can work - the db still has to determine uniqueness
-
Avrajit Roy about 8 yearsToo good suggestion. It avoided me to write unnecessary code to this.
-
sstan almost 8 yearsThis was a SQL Server question, and both options you posted have already been mentioned in the following answers to this question: stackoverflow.com/a/1471444/4955425 and stackoverflow.com/a/1471713/4955425.
-
ijoseph almost 6 yearsFWIW, this almost works in PostgreSQL; just need extra parentheses:
SELECT COUNT(DISTINCT (DocumentId, DocumentSessionId)) FROM DocumentOutputItems;
-
Vytenis Bivainis over 5 yearswhat databases support
select count(distinct(a, b))
? :D -
karmakaze over 5 years@VytenisBivainis I know PostgreSQL does--not sure since which version.
-
naviram over 5 yearsthis doesn't do as require in the question, it counts the distinct in separate for each column
-
Anwar Shaikh over 4 yearsIt does not give you the count of distinct values in conjunction of two columns. At least not in MySQL 5.8.
-
Tab Alleman over 4 yearsThis question is tagged SQL Server, and this isn't SQL Server syntax
-
Aaron West over 4 yearsDo not use CHECKSUM for this. It's simply left-shift-by-4 then xor, which is the worst hash I have seen (CRC-32 would be far superior; even CRC-16 might be better!) It is very easy to construct colliding values; for example, the following are both 0: select checksum('1234123412341234') select checksum('abcdabcdabcdabcd') The only valid use I can imagine for CHECKSUM is the recommended one; when you want a small index over large text values, and you add a comparison of those text values to the where clause, to eliminate collisions. HASHBytes may be adequate, though possibly slow.
-
jayqui about 4 yearsWould you please add an example or code sample to show more about what this means and how to do it?
-
Bort over 3 yearsBe very careful with this method as it could lead to incorrect counts. The following example will return a count of 1. DocumentID | DocumentSessionID "A" | "AB" "AA" | "B"
-
Sreram over 3 yearsHow is it different from creating a multi-column index on those columns? I'm sorry if this makes no sense. I'm new to SQL.
-
Tomty over 3 yearsEven in MySQL, this isn't entirely equivalent to the original query, because rows with NULLs won't be counted.
-
Tomty over 3 yearsAs @Bort notes, the first option can lead to incorrect results, and would be better written using CONCAT_WS. The 2nd method also isn't guaranteed to produce the same results as the original query, in case any of the columns are nullable.
-
devloper152 over 2 yearsWhat does 1 mean in count(1)?
-
Kota Mori over 2 yearsHow does this trick care about hash collisions? I think the distinct counts on hash values would be smaller than the truth due to the collisions.
-
Sergiy over 2 years@VytenisBivainis MySQL supports that as well
-
StriplingWarrior about 2 yearsThis question is not about Oracle. It's about SQL Server.
-
StriplingWarrior about 2 years@devloper152: It has no special meaning. For some reason
count()
always has to have an argument, so depending on people's taste they'll typically usecount(*)
,count(1)
, orcount(null)
. -
StriplingWarrior about 2 yearsTo be clear,
||
is a concatenation operator in some databases. This question is about SQL Server, where+
would be the equivalent. Just like all the other answers on this question recommending concatenation, this suffers from the problem that combinations of different values ('a', 'bc' vs 'ab', 'c') can concatenate to the same value ('abc'), giving you an incorrect count. -
karmakaze about 2 years@Sergiy thanks, updated answer with link to non-standard syntax supported by MySQL
-
surj about 2 yearsLove these kinds of creative solutions, thanks for sharing.
-
AdamO about 2 years@AnwarShaikh I don't understand your comment. Do you mean to say it does not give you the count of distinct rows in the two columns "DocumentID" and "DocumentSessionID"?
-
Erick de Vathaire almost 2 yearsi used with a table that has only 169 rows and it was wrong, the values "2, 550" and "3, 550" have the same result, don't use CHECKSUM, using a loop to have the values 1 to 10 in 2 columns with cross join, it should have 100 unique values but it shows 86