DISTINCT ON in an aggregate function in postgres
Solution 1
The most simple thing I discovered is to use DISTINCT
over jsonb
(not json!).
(jsonb_build_object
creates jsonb objects)
SELECT
JSON_AGG(
DISTINCT jsonb_build_object('tag_id', photo_tag.tag_id,
'name', tag.name)) AS tags
FROM photo
LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id
LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
GROUP BY photo.photo_id
Solution 2
Whenever you have a central table and want to left-join it to many rows in table A and also left-join it to many rows in table B, you get these problems of duplicating rows. It can especially throw off aggregrate functions like COUNT
and SUM
if you're not careful! So I think you need to construct your tags-for-each-photo and comments-for-each-photo separately, and then join them together:
WITH tags AS (
SELECT photo.photo_id, json_agg(row_to_json(tag.*)) AS tags
FROM photo
LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id
LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
GROUP BY photo.photo_id
),
comments AS (
SELECT photo.photo_id, json_agg(row_to_json(comment.*)) AS comments
FROM photo
LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
GROUP BY photo.photo_id
)
SELECT COALESCE(tags.photo_id, comments.photo_id) AS photo_id,
tags.tags,
comments.comments
FROM tags
FULL OUTER JOIN comments
ON tags.photo_id = comments.photo_id
EDIT: If you really want to join everything together without CTEs, this looks like it gives correct results:
SELECT photo.photo_id,
to_json(array_agg(DISTINCT tag.*)) AS tags,
to_json(array_agg(DISTINCT comment.*)) AS comments
FROM photo
LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id
LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
GROUP BY photo.photo_id
Solution 3
The cheapest and simplest DISTINCT
operation is ... not to multiply rows in a "proxy cross join" in the first place. Aggregate first, then join. See:
Best for returning few selected rows
Assuming you actually don't want to retrieve the whole table, but just one or few selected photos at a time, with aggregated details. Then LATERAL
subqueries are fast and elegant:
SELECT *
FROM photo p
CROSS JOIN LATERAL (
SELECT json_agg(c) AS comments
FROM comment c
WHERE photo_id = p.photo_id
) c1
CROSS JOIN LATERAL (
SELECT json_agg(t) AS tags
FROM photo_tag pt
JOIN tag t USING (tag_id)
WHERE pt.photo_id = p.photo_id
) t
WHERE p.photo_id = 2; -- arbitrary selection
This returns whole rows from comment
and tag
, aggregated into JSON arrays separately. Rows are not multiplies like in your attempt, but they are only as "distinct" as they are in your base tables.
To additionally fold duplicates in the base data, see below.
Notes:
-
LATERAL
andjson_agg()
require Postgres 9.3 or later. -
json_agg(c)
is short forjson_agg(c.*)
. -
We do not need to
LEFT JOIN
because an aggregate function likejson_agg()
always returns a row.
Typically, you'd only want a subset of columns - at least excluding the redundant photo_id
:
SELECT *
FROM photo p
CROSS JOIN LATERAL (
SELECT json_agg(json_build_object('comment_id', comment_id
, 'comment', comment)) AS comments
FROM comment
WHERE photo_id = p.photo_id
) c
CROSS JOIN LATERAL (
SELECT json_agg(t) AS tags
FROM photo_tag pt
JOIN tag t USING (tag_id)
WHERE pt.photo_id = p.photo_id
) t
WHERE p.photo_id = 2;
json_build_object()
was introduced with Postgres 9.4. Used to be cumbersome in older versions because a ROW
constructor doesn't preserve column names. But there are generic workarounds:
Also allows to choose JSON key names freely, you don't have to stick to column names.
Best for returning the whole table
To return all rows, this is more efficient:
SELECT p.*
, COALESCE(c1.comments, '[]') AS comments
, COALESCE(t.tags, '[]') AS tags
FROM photo p
LEFT JOIN (
SELECT photo_id
, json_agg(json_build_object('comment_id', comment_id
, 'comment', comment)) AS comments
FROM comment c
GROUP BY 1
) c1 USING (photo_id)
LEFT JOIN LATERAL (
SELECT photo_id , json_agg(t) AS tags
FROM photo_tag pt
JOIN tag t USING (tag_id)
GROUP BY 1
) t USING (photo_id);
Once we retrieve enough rows, this gets cheaper than LATERAL
subqueries. Works for Postgres 9.3+.
Note the USING
clause in the join condition. This way we can conveniently use SELECT *
in the outer query without getting duplicate columns for photo_id
. I didn't use SELECT *
here because your deleted answer indicates you want empty JSON arrays instead of NULL for no tags / no comments.
Also remove existing duplicates in base tables
You can't just json_agg(DISTINCT json_build_object(...))
because there is no equality operator for the data type json
. See:
There are various better ways:
SELECT *
FROM photo p
CROSS JOIN LATERAL (
SELECT json_agg(to_json(c1.comment)) AS comments1
, json_agg(json_build_object('comment', c1.comment)) AS comments2
, json_agg(to_json(c1)) AS comments3
FROM (
SELECT DISTINCT c.comment -- folding dupes here
FROM comment c
WHERE c.photo_id = p.photo_id
-- ORDER BY comment -- any particular order?
) c1
) c2
CROSS JOIN LATERAL (
SELECT jsonb_agg(DISTINCT t) AS tags -- demonstrating jsonb_agg
FROM photo_tag pt
JOIN tag t USING (tag_id)
WHERE pt.photo_id = p.photo_id
) t
WHERE p.photo_id = 2;
Demonstrating 4 different techniques in comments1
, comments2
, comments3
(redundantly) and tags
.
db<>fiddle here
Old: sqlfiddle backpatched to Postgres 9.3; sqlfiddle for Postgres 9.6
Solution 4
As stated in comments, json_agg does not serialize a row as an object, but builds a JSON array of the values that you pass it. You'll need row_to_json
to turn your row into a JSON object, and then json_agg
to perform the aggregation to an array:
SELECT json_agg(DISTINCT row_to_json(comment)) as tags
FROM
photo
LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id
LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
GROUP BY photo.photo_id
Related videos on Youtube
Migwell
Updated on July 09, 2022Comments
-
Migwell almost 2 years
For my problem, we have a schema whereby one photo has many tags and also many comments. So if I have a query where I want all the comments and tags, it will multiply the rows together. So if one photo has 2 tags and 13 comments, I get 26 rows for that one photo:
SELECT tag.name, comment.comment_id FROM photo LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
That's fine for most things, but it means that if I
GROUP BY
and thenjson_agg(tag.*)
, I get 13 copies of the first tag, and 13 copies of the second tag.SELECT json_agg(tag.name) as tags FROM photo LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id GROUP BY photo.photo_id
Instead I want an array that is only 'suburban' and 'city', like this:
[ {"tag_id":1,"name":"suburban"}, {"tag_id":2,"name":"city"} ]
I could
json_agg(DISTINCT tag.name)
, but this will only make an array of tag names, when I want the entire row as json. I would like tojson_agg(DISTINCT ON(tag.name) tag.*)
, but that's not valid SQL apparently.How then can I simulate
DISTINCT ON
inside an aggregate function in Postgres?-
PinnyM about 9 years
json_agg(DISTINCT tag.name)
should work for you, have you tried this? -
JNevill about 9 yearsCan you just remove the join to
Comments
? You aren't selecting on anything in that table. If you need to keep it because you have something from it in yourWHERE
clause, perhaps then add everytag
field to yourGROUP BY
and don't useDISTINCT
-
Migwell about 9 years@PinnyM No because then json_agg just makes an array of tag names. I want an array of objects representing the whole row (my second query is just an example, in reality I want to
json_agg(tag.*)
-
Migwell about 9 years@JNevill No I can't, because I also want to select
json_agg(comment.*) AS comments
. -
PinnyM about 9 yearsAccording to the docs, json_agg "aggregates values as a JSON array". If all you want is to turn the row into json, use the
row_to_json
function. -
Migwell about 9 yearsNo, I want an array of rows. Which is what json_agg(comment.*) would do. I don't just want one json object.
-
PinnyM about 9 yearsCan you please post exactly what you'd like the expected output to look like? Still not clear to me...
-
Migwell about 9 yearsUpdated the question to have an expected output
-
Erwin Brandstetter about 8 yearsYou have presented your question clearly, but important information is still missing to determine the best query. 1st, always your version of Postgres. 2nd, retrieve the whole table, a buch of photos of just a single photo per query? Also, I doubt you really want
json_agg(comment.*)
that would includephoto_id
redundantly. Typically you'd want a subset without the redundant column, no? -
Joe Love almost 3 yearsjson_agg doesn't work with distinct. It says there is no equality operator in JSON
-
-
Migwell about 9 yearsActually json_agg does implicitly convert a row into JSON. Have a look at this example I just made pastebin.com/8jVSNstd. But the problem with your query is that because there is no JSON comparison operator, it can't find distinct rows because they're all JSON.
-
Migwell about 9 yearsIf I copy your exact code and run it, I get
ERROR: could not identify an equality operator for type json
, which is what I'm referring to. -
PinnyM about 9 years@Miguel Hmm, have you tried calling DISTINCT on the row, like this: json_agg(DISTINCT tag)? If that doesn't work, you'll likely need a subquery to accomplish this.
-
Migwell about 9 yearsThat doesn't work because the row is implicitly converted to json and you get the same error. If you know how to do this with a sub query I'd love to know because I can't think of anything.
-
victor n. almost 6 yearshaving json_build_object(distinct...)) would have been awesome.
-
Saurabh Chauhan over 5 yearsHello, @paul is there anyway to add DISTINCT in the following query:-
json_agg( json_build_object ( 'id', ca.id, 'name',ca.name, 'url',ca.url) ) as authors
-
Saurabh Chauhan over 5 yearsHello, @Erwin is there anyway to add DISTINCT in the following query:- json_agg( json_build_object ( 'id', ca.id, 'name',ca.name, 'url',ca.url) ) as authors
-
Paul A Jungwirth over 5 yearsI think you can do it, but it's a little tricky. It seems worth making a separate question!
-
Erwin Brandstetter over 5 years@SaurabhChauhan: I added a chapter addressing that. Ask a question with details if anything is still unclear.
-
mmey about 5 yearsBeen working with SQL for over twenty years, but there's always something new to learn... :o
-
kyw over 4 yearsNice. Though is there a downside to using
jsonb
overjson
? -
Eugene Kovalev over 4 years@kyw Please, have a look at this article. There are pros and cons of
jsonb
overjson
at the very beginning. As for me, the pros are really good :) -
Surya almost 4 years@EugeneKovalev Any doc or links that explain behaviour on DISTINCT in JSONB objects? DISTINCT seems to me calculating distinct on the entire JSONB object and not just any key in the JSONB object would like to know how it works though.
-
Mohammad Ashraful Islam over 2 yearsDISTINCT with jsonb_build_object() work like a champ. Loved it.
-
Alveona over 2 yearsthanks mate, much easier than the answers above