JPA Select latest instance for each item

39,536

Solution 1

I think I've got it with this query.

select m from Meeting m 
    where m.meetingDate = 
        (select max(m1.meetingDate) 
            from Meeting m1 
            where m1.attendee = m.attendee )
    and not exists 
        (select m2 from Meeting m2 
            where m2.attendee = m.attendee 
            and m2.meetingDate > m.meetingDate)

Solution 2

In SQL the solution is very simple - join the table with a subquery, which gives you the most recent meeting for each attendee:

select * from Meeting ALL
join ( select max(meetingDate) as newest, attendee
from Meeting group by attendee ) LATEST
on ALL.meetingDate = LATEST.newest AND ALL.attendee = LATEST.attendee

This works, and works fast!

The problem with JPA is that it (or most implementations) won't allow a subquery for a join. After spending several hours trying what will compile in the first place, and then, how slow it is, I decided that I hate JPA. Solutions like the ones above - like EXISTS (SELECT .. ) or IN ( SELECT .. ) - take ages to execute, orders of magnitude slower than they should.

Having a solution that works meant that I just needed to access that solution from JPA. There are two magic words in SQL that help you do just that:

CREATE VIEW

and the life becomes so much simpler... Just define such entity and use it. Caution: it's read-only.

Of course, any JPA purists will look down on you when you do that, so if anyone has a pure JPA solution, please let us both know!

Solution 3

Well in SQL that would be quite simple I think, so I assume that can be mapped to JPA:

SELECT m.AttendeeId, MAX(m.MeetingDate) from Meeting m GROUP BY m.AttendeeId

Edit: If you need the messageId itself as well you can do that with a simple subquery that returns the messageId for a message where the other two values are equal. Just make sure you handle the case where there are several messageIds for the same Attendee and Date (eg pick the first result since they should all be equally good - although I'd doubt that such data even makes sense for meetings)

Solution 4

Plain SQL

As Bulba has said appropriate way is to join a subquery with group by.

JPA, JPQL

The problem is that you can't join a subquery.

Here is a workaround.

Lets see what you get in the subquery with group by. You get a list of pairs (attendee_id, max(meeting_date)). This pair is like a new unique id for row with max date you want to join on. Then note that each row in the table forms a pair (attendee_id, meeting_date). So every row has an id as a pair (attendee_id, meeting_date). Lets take a row if only it forms an id that belongs to list received in the subquery.

For simplicity lets represent this id-pair as a concatenation of attendee_id and meeting_date: concat(attendee_id, meeting_date).

Then the query in SQL(similarly for JPQL and JPA CriteriaBuilder) would be as follows:

SELECT * FROM meetings 
WHERE concat(attendee_id, meeting_date) IN
(SELECT concat(attendee_id, max(meeting_date)) FROM meetings GROUP BY attendee_id)

Note that there is only one subquery per query, not one subquery for each row like in some answers.

Afraid of comparing strings?

We have a special offer for you!

Lets encode that id-pair to number. It will be a sum of attendee_id and meeting_date but with modifications to ensure uniqueness of code. We can take number representation of date as Unix time. We will fix the value of max date that our code can capture because final code has max value limit (e.g. bigint(int8)<263). Lets take for convenience max date as 2149-06-07 03:00:00. It equals 5662310400 in seconds and 65536 in days. I will assume here that we need precision for date in days(so we ignore hours and below). To construct unique code we can interpret it as a number in a numerical system with base of 65536. The last symbol(number from 0 to 216-1) in or code in such numerical system is number of days. Other symbols will capture attendee_id. In such interpretation code looks like XXXX, where each X is in range [0,216-1] (to be more accurate, first X is in range [0,215-1] because of 1 bit for sign), first three X represents attendee_id and last X represents meeting_date. So the max value of attendee_id our code can capture is 247-1. The code can be computed as attendee_id*65536 + "date in days".

In postgresql it will be:

attendee_id*65536 + date_part('epoch', meeting_date)/(60*60*24)

Where date_part returns date in seconds which we convert to days by dividing on constant.

And final query to get the latest meetings for all attendees:

SELECT * FROM meetings
WHERE attendee_id*65536 + date_part('epoch', meeting_date)/(60*60*24)
IN (SELECT attendee_id*65536 + date_part('epoch', max(meeting_date))/(60*60*24) from meetings GROUP BY attendee_id);

Benchmarking

I have created a table with stucture as in the question and populated it with 100000 rows randomly selecting attendee_id from [1, 10000] and random date from range [1970-01-01, 2017-09-16]. I have benchmarked (with EXPLAIN ANALYZE) queries with the following techniques:

  1. Correlated subquery

    SELECT * FROM meetings m1 WHERE m1.meeting_date=
    (SELECT max(m2.meeting_date) FROM meetings m2 WHERE m2.attendee_id=m1.attendee_id);
    

    Execution time: 873260.878 ms

  2. Join subquery with group by

    SELECT * FROM meetings m
    JOIN (SELECT attendee_id, max(meeting_date) from meetings GROUP BY attendee_id) attendee_max_date
    ON attendee_max_date.attendee_id = m.attendee_id;</code>
    

    Execution time: 103.427 ms

  3. Use pair (attendee_id, date) as a key

    • Concat attendee_id and meeting_date as strings

      SELECT * FROM meetings WHERE concat(attendee_id, meeting_date) IN
      (SELECT concat(attendee_id, max(meeting_date)) from meetings GROUP BY attendee_id);
      

      Execution time: 207.720 ms

    • Encode attendee_id and meeting_date to a single number(code)

      SELECT * FROM meetings
      WHERE attendee_id*65536 + date_part('epoch',meeting_date)/(60*60*24)
      IN (SELECT attendee_id*65536 + date_part('epoch',max(meeting_date))/(60*60*24) from meetings GROUP BY attendee_id);
      

      Execution time: 127.595 ms

Here is a git with table scheme, table data (as csv), code for populating table, and queries.

Share:
39,536

Related videos on Youtube

digitaljoel
Author by

digitaljoel

Professional Java programmer since 2000 working with Spring, JPA, hibernate, maven, and other related technologies. Also an AKKI Kenpo black belt.

Updated on July 09, 2022

Comments

  • digitaljoel
    digitaljoel almost 2 years

    Let's say I have a Meeting entity. Each meeting has a single attendee and a meeting date. Within my meeting table I may have multiple meetings for each attendee, with different dates for each. I need a JPA query that will select only the latest meeting for all attendees. For instance, if my table looks like this

    Meeting ID | Attendee ID | Meeting Date
    1          | 1           |  6/1/2011
    2          | 2           |  6/1/2011
    3          | 1           |  6/6/2011
    4          | 3           |  6/6/2011
    

    My result should be

    Meeting ID | Attendee ID | Meeting Date
    2          | 2           |  6/1/2011
    3          | 1           |  6/6/2011
    4          | 3           |  6/6/2011
    

    Using JPA 2 against postgres. Meeting has 1-1 to attendee and a simple timestamp date. I suspect I'm going to need to do a group by and max(blah) and maybe a join to myself, but I'm not sure of the best way to approach this.

    Update: After spending the evening playing with this, I still do not have an acceptable JPQL solution to this. Here is what I have so far:

    select m from Meeting m 
    where m.meetingDate in 
        ( select max(meet.meetingDate) 
          from Meeting meet group by meet.attendee )
    

    I have various other conditions that are not relevant to this question, like filtering by attendee department and whatnot. The only reason this works is because we are tracking meeting date to the second (or finer) and the chance that there will be two meetings at exactly the same time is minimal. We are putting some java stuff around it to keep only hte last meeting for each attendee just in case we do get two at the same time, but that's a pretty crappy solution. It really shouldn't be too difficult to get it all in a query, but I have yet to figure it out.

    Update2: Adding sql tag because if I need to use sql to create a view and create a JPA object to map to the view I'm ok with that.

    • toto2
      toto2 almost 13 years
      You might want to add the sql tag.
    • digitaljoel
      digitaljoel almost 13 years
      it's not sql, it's jpql.
    • toto2
      toto2 almost 13 years
      Well, maybe people with SQL knowledge would be helpful too. It's the same nuts and bolts underneath. And there are probably many people watching the SQL tag.
    • digitaljoel
      digitaljoel almost 13 years
      Yeah, I can see your point. I didn't want to mislead people into thinking I was looking for sql since there's enough of a difference in the two that sql answers may not be as helpful.
  • digitaljoel
    digitaljoel almost 13 years
    that will get me the most recent meeting among all attendees. I need the most recent meeting for each attendee.
  • digitaljoel
    digitaljoel almost 13 years
    yeah, that's what I was thinking. In JPQL I'll be selecting the Meeting entity, which to me (not a JPQL GURU) says that the max(m.meetingDate) logic must be in the where clause or joined somehow. It's the "somehow" that is throwing me.
  • Sean Patrick Floyd
    Sean Patrick Floyd almost 13 years
    Looks exactly right (except that in JPQL, you'll probably query for m.attendee, not m.attendeeId)
  • toto2
    toto2 almost 13 years
    @digitaljoel You seem to think that you cannot have the MAX() right after SELECT. I'm not 100% sure, but I think it's valid.
  • Voo
    Voo almost 13 years
    I assumed that "Meeting" was the name of the table containing all the information seen in your post in the first table. MeetingID being its primary key. I don't see why that wouldn't work since JPA does have a group by statement (quick googling on my part) - which also means that MAX() better be allowed in the select part (otherwise group by is completely useless)
  • digitaljoel
    digitaljoel almost 13 years
    Yep, I could have Max right after the select, but I don't really want to select the max. I want to select the Meeting entity, which will contain the attendee and meeting date due to its associations. That's why I was looking for a clause that would simply filter out the other entries instead of selecting individual fields. Something like "select m from Meeting m group by m.attendee having max(m.meetingDate)" I'll have to give that a try when I get home to make sure it takes the max meeting date for each attendee, not the absolute max meeting date.
  • Voo
    Voo almost 13 years
    @digitaljoel Well that is a simple subquery as soon as you have the attendeeId and the meetingDate (just decide what you do if there are several MeetingIDs for the same other values; probably pick the first)
  • digitaljoel
    digitaljoel almost 13 years
    alright, I'll have a look at it tonight and see if I can make it work with this information. Thanks for the help.
  • Jarek Przygódzki
    Jarek Przygódzki over 8 years
    I understand semi-join, but what is anti-join for ?
  • digitaljoel
    digitaljoel over 8 years
    Great question. 5 years on and it sure looks redundant now doesn't it. Not sure I have that source handy anymore to figure out why I needed that.