Get records with highest/smallest <whatever> per group
So you want to get the row with the highest OrderField
per group? I'd do it this way:
SELECT t1.*
FROM `Table` AS t1
LEFT OUTER JOIN `Table` AS t2
ON t1.GroupId = t2.GroupId AND t1.OrderField < t2.OrderField
WHERE t2.GroupId IS NULL
ORDER BY t1.OrderField; // not needed! (note by Tomas)
(EDIT by Tomas: If there are more records with the same OrderField within the same group and you need exactly one of them, you may want to extend the condition:
SELECT t1.*
FROM `Table` AS t1
LEFT OUTER JOIN `Table` AS t2
ON t1.GroupId = t2.GroupId
AND (t1.OrderField < t2.OrderField
OR (t1.OrderField = t2.OrderField AND t1.Id < t2.Id))
WHERE t2.GroupId IS NULL
end of edit.)
In other words, return the row t1
for which no other row t2
exists with the same GroupId
and a greater OrderField
. When t2.*
is NULL, it means the left outer join found no such match, and therefore t1
has the greatest value of OrderField
in the group.
No ranks, no subqueries. This should run fast and optimize access to t2 with "Using index" if you have a compound index on (GroupId, OrderField)
.
Regarding performance, see my answer to Retrieving the last record in each group. I tried a subquery method and the join method using the Stack Overflow data dump. The difference is remarkable: the join method ran 278 times faster in my test.
It's important that you have the right index to get the best results!
Regarding your method using the @Rank variable, it won't work as you've written it, because the values of @Rank won't reset to zero after the query has processed the first table. I'll show you an example.
I inserted some dummy data, with an extra field that is null except on the row we know is the greatest per group:
select * from `Table`;
+---------+------------+------+
| GroupId | OrderField | foo |
+---------+------------+------+
| 10 | 10 | NULL |
| 10 | 20 | NULL |
| 10 | 30 | foo |
| 20 | 40 | NULL |
| 20 | 50 | NULL |
| 20 | 60 | foo |
+---------+------------+------+
We can show that the rank increases to three for the first group and six for the second group, and the inner query returns these correctly:
select GroupId, max(Rank) AS MaxRank
from (
select GroupId, @Rank := @Rank + 1 AS Rank
from `Table`
order by OrderField) as t
group by GroupId
+---------+---------+
| GroupId | MaxRank |
+---------+---------+
| 10 | 3 |
| 20 | 6 |
+---------+---------+
Now run the query with no join condition, to force a Cartesian product of all rows, and we also fetch all columns:
select s.*, t.*
from (select GroupId, max(Rank) AS MaxRank
from (select GroupId, @Rank := @Rank + 1 AS Rank
from `Table`
order by OrderField
) as t
group by GroupId) as t
join (
select *, @Rank := @Rank + 1 AS Rank
from `Table`
order by OrderField
) as s
-- on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField;
+---------+---------+---------+------------+------+------+
| GroupId | MaxRank | GroupId | OrderField | foo | Rank |
+---------+---------+---------+------------+------+------+
| 10 | 3 | 10 | 10 | NULL | 7 |
| 20 | 6 | 10 | 10 | NULL | 7 |
| 10 | 3 | 10 | 20 | NULL | 8 |
| 20 | 6 | 10 | 20 | NULL | 8 |
| 20 | 6 | 10 | 30 | foo | 9 |
| 10 | 3 | 10 | 30 | foo | 9 |
| 10 | 3 | 20 | 40 | NULL | 10 |
| 20 | 6 | 20 | 40 | NULL | 10 |
| 10 | 3 | 20 | 50 | NULL | 11 |
| 20 | 6 | 20 | 50 | NULL | 11 |
| 20 | 6 | 20 | 60 | foo | 12 |
| 10 | 3 | 20 | 60 | foo | 12 |
+---------+---------+---------+------------+------+------+
We can see from the above that the max rank per group is correct, but then the @Rank continues to increase as it processes the second derived table, to 7 and on higher. So the ranks from the second derived table will never overlap with the ranks from the first derived table at all.
You'd have to add another derived table to force @Rank to reset to zero in between processing the two tables (and hope the optimizer doesn't change the order in which it evaluates tables, or else use STRAIGHT_JOIN to prevent that):
select s.*
from (select GroupId, max(Rank) AS MaxRank
from (select GroupId, @Rank := @Rank + 1 AS Rank
from `Table`
order by OrderField
) as t
group by GroupId) as t
join (select @Rank := 0) r -- RESET @Rank TO ZERO HERE
join (
select *, @Rank := @Rank + 1 AS Rank
from `Table`
order by OrderField
) as s
on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField;
+---------+------------+------+------+
| GroupId | OrderField | foo | Rank |
+---------+------------+------+------+
| 10 | 30 | foo | 3 |
| 20 | 60 | foo | 6 |
+---------+------------+------+------+
But the optimization of this query is terrible. It can't use any indexes, it creates two temporary tables, sorts them the hard way, and even uses a join buffer because it can't use an index when joining temp tables either. This is example output from EXPLAIN
:
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
| 1 | PRIMARY | <derived4> | system | NULL | NULL | NULL | NULL | 1 | Using temporary; Using filesort |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 2 | |
| 1 | PRIMARY | <derived5> | ALL | NULL | NULL | NULL | NULL | 6 | Using where; Using join buffer |
| 5 | DERIVED | Table | ALL | NULL | NULL | NULL | NULL | 6 | Using filesort |
| 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 6 | Using temporary; Using filesort |
| 3 | DERIVED | Table | ALL | NULL | NULL | NULL | NULL | 6 | Using filesort |
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
Whereas my solution using the left outer join optimizes much better. It uses no temp table and even reports "Using index"
which means it can resolve the join using only the index, without touching the data.
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
| 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 6 | Using filesort |
| 1 | SIMPLE | t2 | ref | GroupId | GroupId | 5 | test.t1.GroupId | 1 | Using where; Using index |
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
You'll probably read people making claims on their blogs that "joins make SQL slow," but that's nonsense. Poor optimization makes SQL slow.
Tomas
Updated on July 08, 2022Comments
-
Tomas almost 2 years
How to do that?
Former title of this question was "using rank (@Rank := @Rank + 1) in complex query with subqueries - will it work?" because I was looking for solution using ranks, but now I see that the solution posted by Bill is much much better.
Original question:
I'm trying to compose a query that would take last record from each group given some defined order:
SET @Rank=0; select s.* from (select GroupId, max(Rank) AS MaxRank from (select GroupId, @Rank := @Rank + 1 AS Rank from Table order by OrderField ) as t group by GroupId) as t join ( select *, @Rank := @Rank + 1 AS Rank from Table order by OrderField ) as s on t.GroupId = s.GroupId and t.MaxRank = s.Rank order by OrderField
Expression
@Rank := @Rank + 1
is normally used for rank, but for me it looks suspicious when used in 2 subqueries, but initialized only once. Will it work this way?And second, will it work with one subquery that is evaluated multiple times? Like subquery in where (or having) clause (another way how to write the above):
SET @Rank=0; select Table.*, @Rank := @Rank + 1 AS Rank from Table having Rank = (select max(Rank) AS MaxRank from (select GroupId, @Rank := @Rank + 1 AS Rank from Table as t0 order by OrderField ) as t where t.GroupId = table.GroupId ) order by OrderField
Thanks in advance!
-
Andriy M over 12 yearsThis may prove quite useful (for the OP too), but, sadly, answers neither of the two questions asked.
-
Tomas over 12 yearsThanks Bill, that's a good idea how to avoid the ranks, but ... wouldn't the join be slow? The join (without the where clause limitation) would be of much larger size than in my queries. Anyway, thanks for the idea! But I would be also interesting in the original question, i.e. if the ranks would work this way.
-
Tomas over 12 yearsThanks for excellent answer, Bill. However, what if I used
@Rank1
and@Rank2
, one for each subquery? Would that fix the problem? Would that be faster than your solution? -
Bill Karwin over 12 yearsUsing
@Rank1
and@Rank2
would make no difference. -
Tomas over 12 yearsIn which sense you mean it would make no difference? In working/non-working or in efficiency?
-
Bill Karwin over 12 yearsIt would probably make the query work, so you wouldn't need to add the extra subquery to reset the variable. But the query would still be very inefficient.
-
Tomas over 12 yearsBill, now I realized during debugging that your query doesn't work! If there are more records with the same
OrderField
within the same group, it returns all of them. I need exactly one of them (doesn't matter which one). This will never happen in the rank solution (it will work in this case). Maybe you should extend the join condition also to compare ids:left join .. on t1.GroupId = t2.GroupId AND (t1.OrderField < t2.OrderField OR (t1.OrderField = t2.OrderField AND t1.Id < t2.Id))
. This is the easiest fix that came on my mind. -
Bill Karwin over 12 yearsYes, you do have to have some way of resolving ties. I'll point out that in your ranking solution, it may return a single row per group, but it's arbitrary which row is returned. By resolving ties against
id
(or any column that is unique within the group), you can control which row is considered the "greatest". -
Tomas over 12 yearsBill, I can do the same in the ranking solution using
order by
clause :-) PS: please always use notifications like @Tomas otherwise I'm not notified of the comment, thanks. -
Bill Karwin over 12 years@Tomas: It still remains true that the ranking query has terrible optimization and is likely to run orders of magnitude slower than the indexed lookups once your table has a large number of rows.
-
Tomas about 12 yearsJust a note, just not to forget it: it is still question which solution would peform better - the join solution is in principle quadratic, while the rank solution is linear! The same applies to solutions of this similar question. That said, it is possible that the SQL optimizer will make the quadratic problem down to linear, while no more optimizations can be expected from the ranking solution.
-
Bill Karwin about 12 years@Tomas: The only definitive answer is to run tests on a given server with a representative dataset. The bottom-line performance depends not only on big-O analysis, but on server hardware, data types, number of rows, etc.
-
ownking almost 10 yearsThanks for that great solution. I was struggling long time with that problem. For the people who want to add filters for the other fields e.g. "foo" you need to add them to the join condition
... AND t1.foo = t2.foo
to later get the correct results forWHERE ... AND foo='bar'
-
scotru over 9 yearsIf there is a row with OrderField NULL here it seems to get returned (be treated as the greatest). What if OrderField NULL should be treated as the least?
-
Bill Karwin over 9 years@scotru,
ORDER BY COALESCE(OrderField, -1)
-
Martin over 7 yearsHow would you combine
Table
(where are multipleGroupId
) and someOtherTable
whereGroupId
is primary key (=OtherTable
contains one record for everyGroupId
)? -
Martin over 7 yearsDoes
SELECT * FROM OtherTable LEFT JOIN (...answered query...) AS aq USING (GroupId)
look good? -
Bill Karwin over 7 years@Martin, your question appears to be a simple question about joins. It has nothing to do with the question you're commenting on. You should post a new question.
-
Johnny Wong over 7 yearsFrom my test, if there are two order fields, this answer's direction leads to very poor performance. Subquery with max() approach give much better performance, like this:
select * from main_table join (select GroupId, OrderField1, max(OrderField2) as OrderField2 from main_table join (select GroupId, max(OrderField1) as OrderField1 from main_table group by GroupId) as sub1 using (GroupId, OrderField1) group by GroupId) as sub2 using (GroupId, OrderField1, OrderField2) group by GroupId
. For reference, not sure if just my case. Index involved: main_table(GroupId, OrderField2) -
Rick James almost 7 yearsThe first solution in this set has terrible performance -- O(N*N).
-
Bill Karwin almost 7 years@RickJames, check the EXPLAIN I provided. It's a table-scan, but the join is a
ref
. So it's O(N*logN). -
Rick James almost 7 years@BillKarwin - This may be version-dependent. I looked at the
SESSION STATUS LIKE 'Handler%'
values and got 30M for my 5K test table. -
Bill Karwin almost 7 years@RickJames, or it may depend on how many distinct groups there are. If there are many groups with few rows per group, it could be a better choice. But few groups with many rows would have to scan a lot of rows.
-
Rick James almost 7 years@BillKarwin - I need to ponder that. I was using 5K+ Canadian cities in 13 provinces ("group") based on population ("order").
-
Paul almost 7 years@BillKarwin Regarding
(t1.OrderField < t2.OrderField OR (t1.OrderField = t2.OrderField AND t1.Id < t2.Id))
, can't you do(t1.OrderField, t1.Id) < (t2.OrderField, t2.Id)
? Thanks!