GROUP BY and COUNT in PostgreSQL
Solution 1
I think you just need COUNT(DISTINCT post_id) FROM votes
.
See "4.2.7. Aggregate Expressions" section in http://www.postgresql.org/docs/current/static/sql-expressions.html.
EDIT: Corrected my careless mistake per Erwin's comment.
Solution 2
There is also EXISTS
:
SELECT count(*) AS post_ct
FROM posts p
WHERE EXISTS (SELECT FROM votes v WHERE v.post_id = p.id);
In Postgres and with multiple entries on the n-side like you probably have, it's generally faster than count(DISTINCT post_id)
:
SELECT count(DISTINCT p.id) AS post_ct
FROM posts p
JOIN votes v ON v.post_id = p.id;
The more rows per post there are in votes
, the bigger the difference in performance. Test with EXPLAIN ANALYZE
.
count(DISTINCT post_id)
has to read all rows, sort or hash them, and then only consider the first per identical set. EXISTS
will only scan votes
(or, preferably, an index on post_id
) until the first match is found.
If every post_id
in votes
is guaranteed to be present in the table posts
(referential integrity enforced with a foreign key constraint), this short form is equivalent to the longer form:
SELECT count(DISTINCT post_id) AS post_ct
FROM votes;
May actually be faster than the EXISTS
query with no or few entries per post.
The query you had works in simpler form, too:
SELECT count(*) AS post_ct
FROM (
SELECT FROM posts
JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) sub;
Benchmark
To verify my claims I ran a benchmark on my test server with limited resources. All in a separate schema:
Test setup
Fake a typical post / vote situation:
CREATE SCHEMA y;
SET search_path = y;
CREATE TABLE posts (
id int PRIMARY KEY
, post text
);
INSERT INTO posts
SELECT g, repeat(chr(g%100 + 32), (random()* 500)::int) -- random text
FROM generate_series(1,10000) g;
DELETE FROM posts WHERE random() > 0.9; -- create ~ 10 % dead tuples
CREATE TABLE votes (
vote_id serial PRIMARY KEY
, post_id int REFERENCES posts(id)
, up_down bool
);
INSERT INTO votes (post_id, up_down)
SELECT g.*
FROM (
SELECT ((random()* 21)^3)::int + 1111 AS post_id -- uneven distribution
, random()::int::bool AS up_down
FROM generate_series(1,70000)
) g
JOIN posts p ON p.id = g.post_id;
All of the following queries returned the same result (8093 of 9107 posts had votes).
I ran 4 tests with EXPLAIN ANALYZE
ant took the best of five on Postgres 9.1.4 with each of the three queries and appended the resulting total runtimes.
As is.
-
After ..
ANALYZE posts; ANALYZE votes;
-
After ..
CREATE INDEX foo on votes(post_id);
-
After ..
VACUUM FULL ANALYZE posts; CLUSTER votes using foo;
count(*) ... WHERE EXISTS
- 253 ms
- 220 ms
- 85 ms -- winner (seq scan on posts, index scan on votes, nested loop)
- 85 ms
count(DISTINCT x)
- long form with join
- 354 ms
- 358 ms
- 373 ms -- (index scan on posts, index scan on votes, merge join)
- 330 ms
count(DISTINCT x)
- short form without join
- 164 ms
- 164 ms
- 164 ms -- (always seq scan)
- 142 ms
Best time for original query in question:
- 353 ms
For simplified version:
- 348 ms
@wildplasser's query with a CTE uses the same plan as the long form (index scan on posts, index scan on votes, merge join) plus a little overhead for the CTE. Best time:
- 366 ms
Index-only scans in the upcoming PostgreSQL 9.2 can improve the result for each of these queries, most of all for EXISTS
.
Related, more detailed benchmark for Postgres 9.5 (actually retrieving distinct rows, not just counting):
Solution 3
Using OVER()
and LIMIT 1
:
SELECT COUNT(1) OVER()
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
LIMIT 1;
Solution 4
WITH uniq AS (
SELECT DISTINCT posts.id as post_id
FROM posts
JOIN votes ON votes.post_id = posts.id
-- GROUP BY not needed anymore
-- GROUP BY posts.id
)
SELECT COUNT(*)
FROM uniq;
Related videos on Youtube
skinkelynet
Updated on July 09, 2022Comments
-
skinkelynet almost 2 years
The query:
SELECT COUNT(*) as count_all, posts.id as post_id FROM posts INNER JOIN votes ON votes.post_id = posts.id GROUP BY posts.id;
Returns
n
records in Postgresql:count_all | post_id -----------+--------- 1 | 6 3 | 4 3 | 5 3 | 1 1 | 9 1 | 10 (6 rows)
I just want to retrieve the number of records returned:
6
.I used a subquery to achieve what I want, but this doesn't seem optimum:
SELECT COUNT(*) FROM ( SELECT COUNT(*) as count_all, posts.id as post_id FROM posts INNER JOIN votes ON votes.post_id = posts.id GROUP BY posts.id ) as x;
How would I get the number of records in this context right in PostgreSQL?
-
Samson almost 12 yearsWhy would you think it's not optimum?
-
skinkelynet almost 12 yearsThis would seem like an operation so common there would be an easier way.
-
rogerdpack over 2 yearsdoes
SELECT COUNT(*) from POSTS
work in your case?
-
-
skinkelynet almost 12 yearsPG::Error: ERROR: column "posts.id" must appear in the GROUP BY clause or be used in an aggregate function
-
Erwin Brandstetter almost 12 years@skinkelynet: that's because the answer is subtly wrong - it has to be
FROM votes
. I added the correct form to my answer. -
a_horse_with_no_name almost 12 yearsWhat do you mean with "more portable"?
-
Erwin Brandstetter almost 12 years@a_horse_with_no_name: "more portable" was nonsense, really. Removed that bit, thanks for pointing out. I was under the wrong impression that SQLite would not support
DISTINCT
in aggregate functions. Turns out, it does - just as all other major RDBMS. As compensation (and because I wanted to clarify that for myself) I elaborate on the performance angle with a benchmark. -
wildplasser almost 12 yearsIf I read correctly, you missed my CTE-version. It should be equivalent to a subquery, though.
-
Erwin Brandstetter almost 12 years@wildplasser: Sorry, recreated the scenario (not identical, but close as can be seen from the setup) and added the result for the CTE version. As expected, a CTE doesn't help performance here.
-
Steve Jorgensen over 3 years@LostCrotchet It turns out you can do that in PostgreSQL. You need to put the list of fields in parentheses, so for example…
SELECT COUNT(DISTINCT (firstname, lastname)) FROM people
. -
stefansundin over 2 yearsThis is what worked for my case since I wanted to filter out things with a
HAVING SUM(..) > 5
clause that summed values across rows.