DISTINCT ON in an aggregate function in postgres

33,565

Solution 1

The most simple thing I discovered is to use DISTINCT over jsonb (not json!). (jsonb_build_object creates jsonb objects)

SELECT 
   JSON_AGG(
       DISTINCT jsonb_build_object('tag_id', photo_tag.tag_id, 
                                  'name', tag.name)) AS tags
FROM photo
    LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
    LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id
    LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
GROUP BY photo.photo_id

Solution 2

Whenever you have a central table and want to left-join it to many rows in table A and also left-join it to many rows in table B, you get these problems of duplicating rows. It can especially throw off aggregrate functions like COUNT and SUM if you're not careful! So I think you need to construct your tags-for-each-photo and comments-for-each-photo separately, and then join them together:

WITH tags AS (
  SELECT  photo.photo_id, json_agg(row_to_json(tag.*)) AS tags
  FROM    photo
  LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id
  LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
  GROUP BY photo.photo_id
),
comments AS (
  SELECT  photo.photo_id, json_agg(row_to_json(comment.*)) AS comments
  FROM    photo
  LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
  GROUP BY photo.photo_id
)
SELECT  COALESCE(tags.photo_id, comments.photo_id) AS photo_id,
        tags.tags,
        comments.comments
FROM    tags
FULL OUTER JOIN comments
ON      tags.photo_id = comments.photo_id

EDIT: If you really want to join everything together without CTEs, this looks like it gives correct results:

SELECT  photo.photo_id,
        to_json(array_agg(DISTINCT tag.*)) AS tags,
        to_json(array_agg(DISTINCT comment.*)) AS comments
FROM    photo
LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
LEFT OUTER JOIN photo_tag on photo_tag.photo_id = photo.photo_id
LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
GROUP BY photo.photo_id

Solution 3

The cheapest and simplest DISTINCT operation is ... not to multiply rows in a "proxy cross join" in the first place. Aggregate first, then join. See:

Best for returning few selected rows

Assuming you actually don't want to retrieve the whole table, but just one or few selected photos at a time, with aggregated details. Then LATERAL subqueries are fast and elegant:

SELECT *
FROM   photo p
CROSS  JOIN LATERAL (
   SELECT json_agg(c) AS comments
   FROM   comment c
   WHERE  photo_id = p.photo_id
   ) c1
CROSS  JOIN LATERAL (
   SELECT json_agg(t) AS tags
   FROM   photo_tag pt
   JOIN   tag       t USING (tag_id)
   WHERE  pt.photo_id = p.photo_id
   ) t
WHERE  p.photo_id = 2;  -- arbitrary selection

This returns whole rows from comment and tag, aggregated into JSON arrays separately. Rows are not multiplies like in your attempt, but they are only as "distinct" as they are in your base tables.

To additionally fold duplicates in the base data, see below.

Notes:

  • LATERAL and json_agg() require Postgres 9.3 or later.

  • json_agg(c) is short for json_agg(c.*).

  • We do not need to LEFT JOIN because an aggregate function like json_agg() always returns a row.

Typically, you'd only want a subset of columns - at least excluding the redundant photo_id:

SELECT *
FROM   photo p
CROSS  JOIN LATERAL (
   SELECT json_agg(json_build_object('comment_id', comment_id
                                   , 'comment', comment)) AS comments
   FROM   comment
   WHERE  photo_id = p.photo_id
   ) c
CROSS  JOIN LATERAL (
   SELECT json_agg(t) AS tags
   FROM   photo_tag pt
   JOIN   tag       t USING (tag_id)
   WHERE  pt.photo_id = p.photo_id
   ) t
WHERE  p.photo_id = 2;

json_build_object() was introduced with Postgres 9.4. Used to be cumbersome in older versions because a ROW constructor doesn't preserve column names. But there are generic workarounds:

Also allows to choose JSON key names freely, you don't have to stick to column names.

Best for returning the whole table

To return all rows, this is more efficient:

SELECT p.*
     , COALESCE(c1.comments, '[]') AS comments
     , COALESCE(t.tags, '[]') AS tags
FROM   photo p
LEFT   JOIN (
   SELECT photo_id
        , json_agg(json_build_object('comment_id', comment_id
                                   , 'comment', comment)) AS comments
   FROM   comment c
   GROUP  BY 1
   ) c1 USING (photo_id)
LEFT  JOIN LATERAL (
   SELECT photo_id , json_agg(t) AS tags
   FROM   photo_tag pt
   JOIN   tag       t USING (tag_id)
   GROUP  BY 1
   ) t USING (photo_id);

Once we retrieve enough rows, this gets cheaper than LATERAL subqueries. Works for Postgres 9.3+.

Note the USING clause in the join condition. This way we can conveniently use SELECT * in the outer query without getting duplicate columns for photo_id. I didn't use SELECT * here because your deleted answer indicates you want empty JSON arrays instead of NULL for no tags / no comments.

Also remove existing duplicates in base tables

You can't just json_agg(DISTINCT json_build_object(...)) because there is no equality operator for the data type json. See:

There are various better ways:

SELECT *
FROM   photo p
CROSS  JOIN LATERAL (
   SELECT json_agg(to_json(c1.comment)) AS comments1
        , json_agg(json_build_object('comment', c1.comment)) AS comments2
        , json_agg(to_json(c1)) AS comments3
   FROM  (
      SELECT DISTINCT c.comment  -- folding dupes here
      FROM   comment c
      WHERE  c.photo_id = p.photo_id
   -- ORDER  BY comment --  any particular order?
      ) c1
   ) c2
CROSS  JOIN LATERAL (
   SELECT jsonb_agg(DISTINCT t) AS tags  -- demonstrating jsonb_agg
   FROM   photo_tag pt
   JOIN   tag       t USING (tag_id)
   WHERE  pt.photo_id = p.photo_id
   ) t
WHERE  p.photo_id = 2;

Demonstrating 4 different techniques in comments1, comments2, comments3 (redundantly) and tags.

db<>fiddle here
Old: sqlfiddle backpatched to Postgres 9.3; sqlfiddle for Postgres 9.6

Solution 4

As stated in comments, json_agg does not serialize a row as an object, but builds a JSON array of the values that you pass it. You'll need row_to_json to turn your row into a JSON object, and then json_agg to perform the aggregation to an array:

SELECT json_agg(DISTINCT row_to_json(comment)) as tags
FROM
    photo
    LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
    LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id
    LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
GROUP BY photo.photo_id
Share:
33,565

Related videos on Youtube

Migwell
Author by

Migwell

Updated on July 09, 2022

Comments

  • Migwell
    Migwell almost 2 years

    For my problem, we have a schema whereby one photo has many tags and also many comments. So if I have a query where I want all the comments and tags, it will multiply the rows together. So if one photo has 2 tags and 13 comments, I get 26 rows for that one photo:

    SELECT
            tag.name, 
            comment.comment_id
    FROM
            photo
            LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
            LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id
            LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
    

    enter image description here

    That's fine for most things, but it means that if I GROUP BY and then json_agg(tag.*), I get 13 copies of the first tag, and 13 copies of the second tag.

    SELECT json_agg(tag.name) as tags
    FROM
            photo
            LEFT OUTER JOIN comment ON comment.photo_id = photo.photo_id
            LEFT OUTER JOIN photo_tag ON photo_tag.photo_id = photo.photo_id
            LEFT OUTER JOIN tag ON photo_tag.tag_id = tag.tag_id
    GROUP BY photo.photo_id
    

    enter image description here

    Instead I want an array that is only 'suburban' and 'city', like this:

     [
          {"tag_id":1,"name":"suburban"}, 
          {"tag_id":2,"name":"city"}
     ]
    

    I could json_agg(DISTINCT tag.name), but this will only make an array of tag names, when I want the entire row as json. I would like to json_agg(DISTINCT ON(tag.name) tag.*), but that's not valid SQL apparently.

    How then can I simulate DISTINCT ON inside an aggregate function in Postgres?

    • PinnyM
      PinnyM about 9 years
      json_agg(DISTINCT tag.name) should work for you, have you tried this?
    • JNevill
      JNevill about 9 years
      Can you just remove the join to Comments? You aren't selecting on anything in that table. If you need to keep it because you have something from it in your WHERE clause, perhaps then add every tag field to your GROUP BY and don't use DISTINCT
    • Migwell
      Migwell about 9 years
      @PinnyM No because then json_agg just makes an array of tag names. I want an array of objects representing the whole row (my second query is just an example, in reality I want to json_agg(tag.*)
    • Migwell
      Migwell about 9 years
      @JNevill No I can't, because I also want to select json_agg(comment.*) AS comments.
    • PinnyM
      PinnyM about 9 years
      According to the docs, json_agg "aggregates values as a JSON array". If all you want is to turn the row into json, use the row_to_json function.
    • Migwell
      Migwell about 9 years
      No, I want an array of rows. Which is what json_agg(comment.*) would do. I don't just want one json object.
    • PinnyM
      PinnyM about 9 years
      Can you please post exactly what you'd like the expected output to look like? Still not clear to me...
    • Migwell
      Migwell about 9 years
      Updated the question to have an expected output
    • Erwin Brandstetter
      Erwin Brandstetter about 8 years
      You have presented your question clearly, but important information is still missing to determine the best query. 1st, always your version of Postgres. 2nd, retrieve the whole table, a buch of photos of just a single photo per query? Also, I doubt you really want json_agg(comment.*) that would include photo_id redundantly. Typically you'd want a subset without the redundant column, no?
    • Joe Love
      Joe Love almost 3 years
      json_agg doesn't work with distinct. It says there is no equality operator in JSON
  • Migwell
    Migwell about 9 years
    Actually json_agg does implicitly convert a row into JSON. Have a look at this example I just made pastebin.com/8jVSNstd. But the problem with your query is that because there is no JSON comparison operator, it can't find distinct rows because they're all JSON.
  • Migwell
    Migwell about 9 years
    If I copy your exact code and run it, I get ERROR: could not identify an equality operator for type json, which is what I'm referring to.
  • PinnyM
    PinnyM about 9 years
    @Miguel Hmm, have you tried calling DISTINCT on the row, like this: json_agg(DISTINCT tag)? If that doesn't work, you'll likely need a subquery to accomplish this.
  • Migwell
    Migwell about 9 years
    That doesn't work because the row is implicitly converted to json and you get the same error. If you know how to do this with a sub query I'd love to know because I can't think of anything.
  • victor n.
    victor n. almost 6 years
    having json_build_object(distinct...)) would have been awesome.
  • Saurabh Chauhan
    Saurabh Chauhan over 5 years
    Hello, @paul is there anyway to add DISTINCT in the following query:- json_agg( json_build_object ( 'id', ca.id, 'name',ca.name, 'url',ca.url) ) as authors
  • Saurabh Chauhan
    Saurabh Chauhan over 5 years
    Hello, @Erwin is there anyway to add DISTINCT in the following query:- json_agg( json_build_object ( 'id', ca.id, 'name',ca.name, 'url',ca.url) ) as authors
  • Paul A Jungwirth
    Paul A Jungwirth over 5 years
    I think you can do it, but it's a little tricky. It seems worth making a separate question!
  • Erwin Brandstetter
    Erwin Brandstetter over 5 years
    @SaurabhChauhan: I added a chapter addressing that. Ask a question with details if anything is still unclear.
  • mmey
    mmey about 5 years
    Been working with SQL for over twenty years, but there's always something new to learn... :o
  • kyw
    kyw over 4 years
    Nice. Though is there a downside to using jsonb over json?
  • Eugene Kovalev
    Eugene Kovalev over 4 years
    @kyw Please, have a look at this article. There are pros and cons of jsonb over json at the very beginning. As for me, the pros are really good :)
  • Surya
    Surya almost 4 years
    @EugeneKovalev Any doc or links that explain behaviour on DISTINCT in JSONB objects? DISTINCT seems to me calculating distinct on the entire JSONB object and not just any key in the JSONB object would like to know how it works though.
  • Mohammad Ashraful Islam
    Mohammad Ashraful Islam over 2 years
    DISTINCT with jsonb_build_object() work like a champ. Loved it.
  • Alveona
    Alveona over 2 years
    thanks mate, much easier than the answers above