Select DISTINCT on JPA

21,941

Solution 1

To answer your question, the JPQL query you wrote is just fine:

SELECT DISTINCT c.currencyName, c.alphabeticCode, c.numericCode
FROM Currency c 
WHERE c.alphabeticCode IN ('EUR','GBP','USD','JPY') 
ORDER BY c.currencyName

And it should translate to the SQL statement you are expecting:

select distinct currency_name, alphabetic_code, numeric_code 
from currency 
where ALPHABETIC_CODE IN ('USD','EUR','JPY','GBP') 
order by currency_name;

Depending on the underlying JPQL or Criteria API query type, [DISTINCT][1] has two meanings in JPA.

Scalar queries

For scalar queries, which return a scalar projection, like the following query:

List<Integer> publicationYears = entityManager
.createQuery(
    "select distinct year(p.createdOn) " +
    "from Post p " +
    "order by year(p.createdOn)", Integer.class)
.getResultList();

LOGGER.info("Publication years: {}", publicationYears);

The DISTINCT keyword should be passed to the underlying SQL statement because we want the DB engine to filter duplicates prior to returning the result set:

SELECT DISTINCT
    extract(YEAR FROM p.created_on) AS col_0_0_
FROM
    post p
ORDER BY
    extract(YEAR FROM p.created_on)

-- Publication years: [2016, 2018]

Entity queries

For entity queries, DISTINCT has a different meaning.

Without using DISTINCT, a query like the following one:

List<Post> posts = entityManager
.createQuery(
    "select p " +
    "from Post p " +
    "left join fetch p.comments " +
    "where p.title = :title", Post.class)
.setParameter(
    "title", 
    "High-Performance Java Persistence eBook has been released!"
)
.getResultList();

LOGGER.info(
    "Fetched the following Post entity identifiers: {}", 
    posts.stream().map(Post::getId).collect(Collectors.toList())
);

is going to JOIN the post and the post_comment tables like this:

SELECT p.id AS id1_0_0_,
       pc.id AS id1_1_1_,
       p.created_on AS created_2_0_0_,
       p.title AS title3_0_0_,
       pc.post_id AS post_id3_1_1_,
       pc.review AS review2_1_1_,
       pc.post_id AS post_id3_1_0__
FROM   post p
LEFT OUTER JOIN
       post_comment pc ON p.id=pc.post_id
WHERE
       p.title='High-Performance Java Persistence eBook has been released!'

-- Fetched the following Post entity identifiers: [1, 1]

But the parent post records are duplicated in the result set for each associated post_comment row. For this reason, the List of Post entities will contain duplicate Post entity references.

To eliminate the Post entity references, we need to use DISTINCT:

List<Post> posts = entityManager
.createQuery(
    "select distinct p " +
    "from Post p " +
    "left join fetch p.comments " +
    "where p.title = :title", Post.class)
.setParameter(
    "title", 
    "High-Performance Java Persistence eBook has been released!"
)
.getResultList();
 
LOGGER.info(
    "Fetched the following Post entity identifiers: {}", 
    posts.stream().map(Post::getId).collect(Collectors.toList())
);

But then DISTINCT is also passed to the SQL query, and that's not desirable at all:

SELECT DISTINCT
       p.id AS id1_0_0_,
       pc.id AS id1_1_1_,
       p.created_on AS created_2_0_0_,
       p.title AS title3_0_0_,
       pc.post_id AS post_id3_1_1_,
       pc.review AS review2_1_1_,
       pc.post_id AS post_id3_1_0__
FROM   post p
LEFT OUTER JOIN
       post_comment pc ON p.id=pc.post_id
WHERE
       p.title='High-Performance Java Persistence eBook has been released!'
 
-- Fetched the following Post entity identifiers: [1]

By passing DISTINCT to the SQL query, the EXECUTION PLAN is going to execute an extra Sort phase which adds an overhead without bringing any value since the parent-child combinations always return unique records because of the child PK column:

Unique  (cost=23.71..23.72 rows=1 width=1068) (actual time=0.131..0.132 rows=2 loops=1)
  ->  Sort  (cost=23.71..23.71 rows=1 width=1068) (actual time=0.131..0.131 rows=2 loops=1)
        Sort Key: p.id, pc.id, p.created_on, pc.post_id, pc.review
        Sort Method: quicksort  Memory: 25kB
        ->  Hash Right Join  (cost=11.76..23.70 rows=1 width=1068) (actual time=0.054..0.058 rows=2 loops=1)
              Hash Cond: (pc.post_id = p.id)
              ->  Seq Scan on post_comment pc  (cost=0.00..11.40 rows=140 width=532) (actual time=0.010..0.010 rows=2 loops=1)
              ->  Hash  (cost=11.75..11.75 rows=1 width=528) (actual time=0.027..0.027 rows=1 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 9kB
                    ->  Seq Scan on post p  (cost=0.00..11.75 rows=1 width=528) (actual time=0.017..0.018 rows=1 loops=1)
                          Filter: ((title)::text = 'High-Performance Java Persistence eBook has been released!'::text)
                          Rows Removed by Filter: 3
Planning time: 0.227 ms
Execution time: 0.179 ms

Entity queries with HINT_PASS_DISTINCT_THROUGH

To eliminate the Sort phase from the execution plan, we need to use the HINT_PASS_DISTINCT_THROUGH JPA query hint:

List<Post> posts = entityManager
.createQuery(
    "select distinct p " +
    "from Post p " +
    "left join fetch p.comments " +
    "where p.title = :title", Post.class)
.setParameter(
    "title", 
    "High-Performance Java Persistence eBook has been released!"
)
.setHint(QueryHints.HINT_PASS_DISTINCT_THROUGH, false)
.getResultList();
 
LOGGER.info(
    "Fetched the following Post entity identifiers: {}", 
    posts.stream().map(Post::getId).collect(Collectors.toList())
);

And now, the SQL query will not contain DISTINCT but Post entity reference duplicates are going to be removed:

SELECT
       p.id AS id1_0_0_,
       pc.id AS id1_1_1_,
       p.created_on AS created_2_0_0_,
       p.title AS title3_0_0_,
       pc.post_id AS post_id3_1_1_,
       pc.review AS review2_1_1_,
       pc.post_id AS post_id3_1_0__
FROM   post p
LEFT OUTER JOIN
       post_comment pc ON p.id=pc.post_id
WHERE
       p.title='High-Performance Java Persistence eBook has been released!'
 
-- Fetched the following Post entity identifiers: [1]

And the Execution Plan is going to confirm that we no longer have an extra Sort phase this time:

Hash Right Join  (cost=11.76..23.70 rows=1 width=1068) (actual time=0.066..0.069 rows=2 loops=1)
  Hash Cond: (pc.post_id = p.id)
  ->  Seq Scan on post_comment pc  (cost=0.00..11.40 rows=140 width=532) (actual time=0.011..0.011 rows=2 loops=1)
  ->  Hash  (cost=11.75..11.75 rows=1 width=528) (actual time=0.041..0.041 rows=1 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        ->  Seq Scan on post p  (cost=0.00..11.75 rows=1 width=528) (actual time=0.036..0.037 rows=1 loops=1)
              Filter: ((title)::text = 'High-Performance Java Persistence eBook has been released!'::text)
              Rows Removed by Filter: 3
Planning time: 1.184 ms
Execution time: 0.160 ms

Solution 2

The issue you have is when you are trying to retrieve the columns list (c.currencyName, c.alphabeticCode, c.numericCode, c.minorUnit, c.id) the

  • distinct is run on the entire columns mentioned in the select clause

and I believe "id" column is unique for every record in your db table and hence you have the possibility of getting duplicates in your other columns (c.currencyName, c.alphabeticCode, c.numericCode, c.minorUnit).

So here in your case DISTINCT is running on the entire row, not a specific column. If you want to get the unique names, select only that column.

IN case you want to run the distinct on more than one column you can do something like this, using the GROUP BY for example to intent to find using c.currencyName, c.alphabeticCode

SELECT DISTINCT c.currencyName, c.alphabeticCode, c.numericCode,c.id
FROM Currency c 
WHERE c.alphabeticCode IN ('EUR','GBP','USD','JPY') GROUP BY c.currencyName, c.alphabeticCode
ORDER BY c.currencyName
Share:
21,941
Neuromante
Author by

Neuromante

Updated on January 12, 2021

Comments

  • Neuromante
    Neuromante over 3 years

    I have a table with the ISO 4217 values for currencies (With 6 rows, ID, Country, Currency_Name, Alphabetic_code, Numeric_Code, Minor_Unit).

    I need to get some of the data for the 4 most used currencies, and my "pure" SQL query goes like this:

    select distinct currency_name, alphabetic_code, numeric_code 
    from currency 
    where ALPHABETIC_CODE IN ('USD','EUR','JPY','GBP') 
    order by currency_name;
    

    Which returns a 4-row table with the data I need. So far, so good.

    Now I have to translate this to our JPA xml file, and the problems begin. The query I'm trying to get is like this:

    SELECT DISTINCT c.currencyName, c.alphabeticCode, c.numericCode
    FROM Currency c 
    WHERE c.alphabeticCode IN ('EUR','GBP','USD','JPY') 
    ORDER BY c.currencyName
    

    This returns a list with one row for each country that has some of those currencies (As if there were no "DISTINCT" on the query). And I'm scratching my head on why. So the questions would be:

    1) How can I make this query to return what the "pure" SQL is giving me?

    2) Why is this query seemingly ignoring my "DISTINCT" clause? There's something I'm missing here, and I don't get what. What's going on, what I'm not getting?

    EDIT: Well, this is getting weirder. Somehow, that JPA query works as intended (Returning 4 rows). I've tried this (As I needed some more info):

    SELECT DISTINCT c.currencyName, c.alphabeticCode, c.numericCode, c.minorUnit, c.id
    FROM Currency c 
    WHERE c.alphabeticCode IN ('EUR','GBP','USD','JPY') 
    ORDER BY c.currencyName
    

    And it seems the ID is messing everything, as removing it from the query goes back to return the 4-row table. And putting parenthesis is useless.

    btw, we are using eclipse link.

  • Neuromante
    Neuromante about 7 years
    Ok, now I'm getting it, and after running some more tests, I'm going to blame a typo in some of the queries, as the DISTINCT behaved differently on SQL and JPA. BTW, the GROUP BY clause did not work... I guess I'll leave the ID out for now.