Get top 1 row of each group

785,995

Solution 1

;WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
   FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

  • maintain status in 2 places
  • preserve status history
  • ...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

Solution 2

I just learned how to use cross apply. Here's how to use it in this scenario:

 select d.DocumentID, ds.Status, ds.DateCreated 
 from Documents as d 
 cross apply 
     (select top 1 Status, DateCreated
      from DocumentStatusLogs 
      where DocumentID = d.DocumentId
      order by DateCreated desc) as ds

Solution 3

I know this is an old thread but the TOP 1 WITH TIES solutions is quite nice and might be helpful to some reading through the solutions.

select top 1 with ties
   DocumentID
  ,Status
  ,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)

The select top 1 with ties clause tells SQL Server that you want to return the first row per group. But how does SQL Server know how to group up the data? This is where the order by row_number() over (partition by DocumentID order by DateCreated desc comes in. The column/columns after partition by defines how SQL Server groups up the data. Within each group, the rows will be sorted based on the order by columns. Once sorted, the top row in each group will be returned in the query.

More about the TOP clause can be found here.

Solution 4

I've done some timings over the various recommendations here, and the results really depend on the size of the table involved, but the most consistent solution is using the CROSS APPLY These tests were run against SQL Server 2008-R2, using a table with 6,500 records, and another (identical schema) with 137 million records. The columns being queried are part of the primary key on the table, and the table width is very small (about 30 bytes). The times are reported by SQL Server from the actual execution plan.

Query                                  Time for 6500 (ms)    Time for 137M(ms)

CROSS APPLY                                    17.9                17.9
SELECT WHERE col = (SELECT MAX(COL)…)           6.6               854.4
DENSE_RANK() OVER PARTITION                     6.6               907.1

I think the really amazing thing was how consistent the time was for the CROSS APPLY regardless of the number of rows involved.

Solution 5

If you're worried about performance, you can also do this with MAX():

SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)

ROW_NUMBER() requires a sort of all the rows in your SELECT statement, whereas MAX does not. Should drastically speed up your query.

Share:
785,995
kazinix
Author by

kazinix

Updated on February 13, 2022

Comments

  • kazinix
    kazinix over 2 years

    I have a table which I want to get the latest entry for each group. Here's the table:

    DocumentStatusLogs Table

    |ID| DocumentID | Status | DateCreated |
    | 2| 1          | S1     | 7/29/2011   |
    | 3| 1          | S2     | 7/30/2011   |
    | 6| 1          | S1     | 8/02/2011   |
    | 1| 2          | S1     | 7/28/2011   |
    | 4| 2          | S2     | 7/30/2011   |
    | 5| 2          | S3     | 8/01/2011   |
    | 6| 3          | S1     | 8/02/2011   |
    

    The table will be grouped by DocumentID and sorted by DateCreated in descending order. For each DocumentID, I want to get the latest status.

    My preferred output:

    | DocumentID | Status | DateCreated |
    | 1          | S1     | 8/02/2011   |
    | 2          | S3     | 8/01/2011   |
    | 3          | S1     | 8/02/2011   |
    
    • Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTop below:

      SELECT
        DocumentID,
        GetOnlyTheTop(Status),
        GetOnlyTheTop(DateCreated)
      FROM DocumentStatusLogs
      GROUP BY DocumentID
      ORDER BY DateCreated DESC
      
    • If such function doesn't exist, is there any way I can achieve the output I want?

    • Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that status also be located in the parent table?

    Please see the parent table for more information:

    Current Documents Table

    | DocumentID | Title  | Content  | DateCreated |
    | 1          | TitleA | ...      | ...         |
    | 2          | TitleB | ...      | ...         |
    | 3          | TitleC | ...      | ...         |
    

    Should the parent table be like this so that I can easily access its status?

    | DocumentID | Title  | Content  | DateCreated | CurrentStatus |
    | 1          | TitleA | ...      | ...         | s1            |
    | 2          | TitleB | ...      | ...         | s3            |
    | 3          | TitleC | ...      | ...         | s1            |
    

    UPDATE I just learned how to use "apply" which makes it easier to address such problems.

  • kazinix
    kazinix almost 13 years
    And... What is Partition By? With is new to me also :( I'm using mssql 2005 anyway.
  • ZygD
    ZygD almost 13 years
    @domanokz: Partition By resets the count. So in this case, it says to count per DocumentID
  • ZygD
    ZygD almost 13 years
    The clue was in the title: MSSQL. SQL Server does not have USING but the idea is OK.
  • kazinix
    kazinix almost 13 years
    Hm, I worry about the performance, I'll be querying millions of rows. Is SELECT * FROM (SELECT ...) affects the performance? Also, is ROW_NUMBER some kind of a subquery for each row?
  • ZygD
    ZygD almost 13 years
    @domanokz: no, it's not a subquery. If you have correct indexes then millions shouldn't be a problem. There are only 2 set based ways anyway: this and the aggregate (Ariel's solution). So try them both...
  • kazinix
    kazinix almost 13 years
    would you mind to look at my question again? I've edited it, I added the ID to DocumentStatusLogs table. I think it might help us to optimize the query. Thanks!
  • ZygD
    ZygD almost 13 years
    @domanokz: Just change ORDER BY DateCreated DESC to ORDER BY ID DESC
  • BitwiseMan
    BitwiseMan almost 12 years
    Doesn't this give you the date the document was created, not the date the status was created?
  • kazinix
    kazinix almost 12 years
    That actually makes no difference since the issue is still addressed.
  • Kristoffer L
    Kristoffer L over 10 years
    Cannot performance issues with ROW_NUMBER() be addressed with proper indexing? (I feel that should be done anyhow)
  • dbd
    dbd over 10 years
    This is great, I'm used to subquery for this task. I find this solutions much more appealing.
  • theSpyCry
    theSpyCry over 9 years
    this so simple and effective. Much more efficient than some subqueries. Thank you !
  • John Fairbanks
    John Fairbanks over 9 years
    I just posted the results of my timing tests against all of the proposed solutions and yours came out on top. Giving you an up vote :-)
  • NickG
    NickG almost 9 years
    @gbn The stupid moderators usually delete important keywords from titles, as they have done here. Making it very difficult to find the correct answers in search results or Google.
  • TamusJRoyce
    TamusJRoyce over 8 years
    +1 for huge speed improvement. This is much faster than a windowing function such as ROW_NUMBER(). It would be nice if SQL recognized ROW_NUMBER() = 1 like queries and optimized them into Applies. Note: I used OUTER APPLY as I needed results, even if they didn't exist in the apply.
  • TamusJRoyce
    TamusJRoyce over 8 years
    With datetime, you cannot guarantee two entries won't be added on the same date and time. Precision isn't high enough.
  • TamusJRoyce
    TamusJRoyce over 8 years
    Unfortunately MaxDate is not unique. It is possible to have two dates entered at the same exact time. So this can result in duplicates per group. You can, however, use an identity column or GUID. Identity Column would get you the latest one that's been entered (default identity calc being used, 1...x step 1).
  • SalientBrain
    SalientBrain over 8 years
    I didn't get improvement but idea is interesting
  • Martin Smith
    Martin Smith about 8 years
    @TamusJRoyce you can't extrapolate that just because it was faster once this is always the case. It depends. As described here sqlmag.com/database-development/optimizing-top-n-group-queri‌​es
  • TamusJRoyce
    TamusJRoyce about 8 years
    My comment is about having multiple rows, and only desiring one of those multiple rows per group. Joins are for when you want one to many. Applies are for when you have one to many, but want to filter out all except a one to one. Scenario: For 100 members, give me each their best phone number (where each could have several numbers). This is where Apply excels. Less reads = less disk access = better performance. Given my experience is with poorly designed non-normalized databases.
  • TamusJRoyce
    TamusJRoyce about 8 years
    @MartinSmith From your article, "Some solutions work well only when the right indexes are available, but without those indexes the solutions perform badly." - Great point! The above scenario is when you are able to view execution plan and add indexes where needed. If you are not able to add indexes, you will need to do a case-by-case test (which you should probably do anyways).
  • Vladimir Baranov
    Vladimir Baranov over 7 years
    It all depends on the data distribution and available indexes. It was discussed at great lengths on dba.se.
  • rich s
    rich s about 7 years
    Well I kind of agree, but the author asked for the latest entry - which unless you include an auto-increment identity column means two items added at exactly the same time are equally 'the latest'
  • cibercitizen1
    cibercitizen1 about 7 years
    +1 for simplicity. @TamusJRoyce is right. What about? 'select * from DocumentStatusLog D where ID = (select ID from DocumentsStatusLog where D.DocumentID = DocumentID order by DateCreated DESC limit 1);'
  • Trevor Nestman
    Trevor Nestman almost 7 years
    This has to be black magic. This helped me to find the most recent entry and the first entry for each resource returned. Very useful.
  • MoonKnight
    MoonKnight over 6 years
    Jus to point out that this "solution" can still give you multiple records if you have a tie on the max(DateCreated)
  • TamusJRoyce
    TamusJRoyce over 6 years
    Latest record will be one record. So yes. You need to consider the auto-increment identity column.
  • ufo
    ufo over 6 years
    This doesn't work in SQL Server 2008 R2. I think first_value was introduced in 2012!
  • Arun Prasad E S
    Arun Prasad E S over 6 years
    SELECT * FROM EventScheduleTbl D WHERE DatesPicked = (SELECT top 1 min(DatesPicked) FROM EventScheduleTbl WHERE EventIDf = D.EventIDf and DatesPicked>= convert(date,getdate()) )
  • Adam Wells
    Adam Wells almost 6 years
    Thank you! This is a very slick solution to this sort of problem! +1 Glad i found this answer, saved me about two hours of pain.
  • pim
    pim almost 6 years
    There are definitely cases where this will outperform row_number() even with proper indexing. I find it especially valuable in self-join scenarios. The thing to be cognizant of though, is that this method will often yield a higher number of both logical reads and scan counts, despite reporting a low subtree cost. You'll need to weigh the cost/benefits in your particular case to determine if it's actually better.
  • George Menoutis
    George Menoutis over 5 years
    This is the most elegant solution imo
  • Caltor
    Caltor over 5 years
    What is the starting semicolon for?
  • Andreas Reiff
    Andreas Reiff about 5 years
    @Caltor I left the ; when having above in a larger SQL statement and got the following error: 'Incorrect syntax near the keyword 'with'. If this statement is a common table expression, an xmlnamespaces clause or a change tracking context clause, the previous statement must be terminated with a semicolon.' A GO just before - or no other statement - works as well.
  • MattSlay
    MattSlay almost 5 years
    Very fast! I was using the Cross Apply solution offered by @dpp, but this one is waaaay faster.
  • Scott
    Scott over 4 years
    In my case, this approach was SLOWER than using ROW_NUMBER(), due to the introduction of a subquery. You should test different approaches to see what performs best for your data.
  • Helen Araya
    Helen Araya over 4 years
    @dpp your answer is not giving me one row per group. It is returning the whole group? Am I missing something.
  • Chris Umphlett
    Chris Umphlett over 4 years
    agreed - this best replicates what is very easy to do in other versions of SQL and other languages imo
  • Suraj Kumar
    Suraj Kumar over 4 years
    You should always describe your SQL statement how it will work and solve the OP's query.
  • Krishna Gupta
    Krishna Gupta about 4 years
    Super old. But super gold!
  • Extragorey
    Extragorey almost 4 years
    This works well when you already have a separate Documents table that gives one row per group, as desired in the output. But if you're only working with the one table (DocumentStatusLogs in this case), you'd first have to do some sort of DISTINCT operation on DocumentID (or ROW_NUMBER(), MAX(ID), etc.), losing all that gained performance.
  • Extragorey
    Extragorey almost 4 years
    For large numbers of columns (Status, DateCreated, etc.), does this do a separate partition/sort for each column, or does it get optimised into one?
  • Extragorey
    Extragorey almost 4 years
    This is just going to return everything in the table.
  • N8allan
    N8allan almost 4 years
    I agree that this is an elegant solution. In my particular query and on SQL Server 2019 this was twice as slow as the cross apply with top 1 solution, but measure for yourself.
  • mpn275
    mpn275 almost 4 years
    Wish I could upvote more than once. I have returned to this answer about 7.000 times already. There might come a day, when I take the time to understand this, so I wouldn't have to come back. But it is not this day.
  • mario ruiz
    mario ruiz over 3 years
    Thanks for the different solutions proposed. I went through the second one and saved me today man!
  • TK Bruin
    TK Bruin over 3 years
    Hmm, 'With Ties' might cause more rows to be returned than the value specified in expression (TOP 1). If the OP wants only 1, then you need to remove this phrase, right?
  • Josh Gilfillan
    Josh Gilfillan over 3 years
    @TKBruin that is why the order by row_number() is required. This allows the top record per partition to be retrieved.
  • user3341592
    user3341592 over 3 years
    Used in my context, the CTE solution is much quicker than the CROSS APPLY: a couple of seconds in the first case (less than 10) vs 56 mins in the second one. That's appreciable!
  • faheem khan
    faheem khan over 3 years
    Remove the Using (in MS SQL ) and complete the Join code , then it would work .
  • Pedro Ludovico Bozzini
    Pedro Ludovico Bozzini over 3 years
    I have a 100M rows table where I nedded to get both the 1st and the last record for each group. The first two approaches took several minutes to execute. Approach 3 took less than a second.
  • PedroC88
    PedroC88 about 3 years
    Is this t-sql? Using isn't supported like that...
  • Union find
    Union find about 3 years
    mysql 8 should support @PedroC88
  • PedroC88
    PedroC88 about 3 years
    Yeah I mention it because the OP specified sql-server
  • Union find
    Union find about 3 years
    @PedroC88 the question seems to have been changed so that it doesn't reference sql-server anymore. so this is an OK answer.
  • PedroC88
    PedroC88 about 3 years
    It’s on the tags
  • Union find
    Union find about 3 years
    @PedroC88 If you look at the other comments, it was originally a question specific to that platform but the question changed. Downvoting in this case misses the mark.
  • Turab
    Turab about 3 years
    This works very well. But keep in mind that if you need to preserve the select results even if APPLY returns empty, then you need to use OUTER APPLY rather than CROSS APPLY.
  • yuliansen
    yuliansen almost 3 years
    I just heard about cross apply, the practical of it is kind of confusing. some people comparing this with inner join, question is: inner join states the connectors or pk to fk. But Cross Apply doesn't. How do we grouping it with cross apply does it automatically detecting the same column name? this thread might me outdated but I really want to know. Thank you for the knowledge @dpp
  • Lonnie Best
    Lonnie Best almost 3 years
    I up-voted this answer for its compliance with the SQL standard: this approach also works in databases that are not SQL Server.
  • jarlh
    jarlh over 2 years
    Syntax error. And will not return the row having the latest timestamp (for each id.)
  • Sergey Nudnov
    Sergey Nudnov over 2 years
    @Extragorey, totally agree. I was having a query for most recent results for a device-command pairs from the 'results' table hanging intermittently - for indefinite time. And the main problem was a full scan of the 'results' table to obtain these pairs. So I just made a new table with device and command columns and unique primary key on both, and then applied dpp's solution. It worked like a charm!
  • Charlieface
    Charlieface over 2 years
    How is this different from @JoshGilfillan 's answer stackoverflow.com/a/48412942/14868997
  • Reversed Engineer
    Reversed Engineer over 2 years
    Thank you for this really comprehensive answer! Deserves many more votes, although it hasn't been around for as long as the others.
  • Marcos J.D Junior
    Marcos J.D Junior about 2 years
    The OP tagged MS-SQL not My SQL
  • Matt
    Matt about 2 years
    I come back to having to do this every so often and still use this method.
  • Jürgen Zornig
    Jürgen Zornig about 2 years
    This deserves to be the best answer... its speed is absolutely compareable to using CTE with window function, but its so much more maintainable... I have hundrets of satellite tables in my DataVault models and with this solution I don't have to retype the attribute projection again and again for each table to get a view on its most recent entry. Also this solution is often faster than joining to the PIT Table to get the latest entries. Truly a gamechanger for me
  • niico
    niico about 2 years
    This doesn't seem to work in sql server?! I get the error 'invalid column name rn'