MySQL is not using INDEX in subquery

12,438

Solution 1

Your query:

SELECT *
FROM people
LEFT JOIN (
  SELECT *
  FROM visits
  ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id;
  • First, is using non-standard SQL syntax (items appear in the SELECT list that are not part of the GROUP BY clause, are not aggregate functions and do not sepend on the grouping items). This can give indeterminate (semi-random) results.

  • Second, ( to avoid the indeterminate results) you have added an ORDER BY inside a subquery which (non-standard or not) is not documented anywhere in MySQL documentation that it should work as expected. So, it may be working now but it may not work in the not so distant future, when you upgrade to MySQL version X (where the optimizer will be clever enough to understand that ORDER BY inside a derived table is redundant and can be eliminated).

Try using this query:

SELECT 
    p.*, v.*
FROM 
    people AS p
  LEFT JOIN 
        ( SELECT 
              id_people
            , MAX(year) AS year
          FROM
              visits
          GROUP BY
              id_people
         ) AS vm
      JOIN
          visits AS v
        ON  v.id_people = vm.id_people
        AND v.year = vm.year 
    ON  v.id_people = p.id;

The: SQL-fiddle

A compound index on (id_people, year) would help efficiency.


A different approach. It works fine if you limit the persons to a sensible limit (say 30) first and then join to the visits table:

SELECT 
    p.*, v.*
FROM 
    ( SELECT *
      FROM people
      ORDER BY name
        LIMIT 30
    ) AS p
  LEFT JOIN 
    visits AS v
      ON  v.id_people = p.id
      AND v.year =
    ( SELECT 
          year
      FROM
          visits
      WHERE
          id_people = p.id
      ORDER BY
          year DESC
        LIMIT 1
     )  
ORDER BY name ;

Solution 2

Why do you have a subquery when all you need is a table name for joining?

It is also not obvious to me why your query has a GROUP BY clause in it. GROUP BY is ordinarily used with aggregate functions like MAX or COUNT, but you don't have those.

How about this? It may solve your problem.

    SELECT people.id, people.name, MAX(visits.year) year
      FROM people
      JOIN visits ON people.id = visits.id_people
  GROUP BY people.id, people.name

If you need to show the person, the most recent visit, and the note from the most recent visit, you're going to have to explicitly join the visits table again to the summary query (virtual table) like so.

SELECT a.id, a.name, a.year, v.note
  FROM (
         SELECT people.id, people.name, MAX(visits.year) year
          FROM people
          JOIN visits ON people.id = visits.id_people
      GROUP BY people.id, people.name
  )a
  JOIN visits v ON (a.id = v.id_people and a.year = v.year)

Go fiddle: http://www.sqlfiddle.com/#!2/d67fc/20/0

If you need to show something for people that have never had a visit, you should try switching the JOIN items in my statement with LEFT JOIN.

As someone else wrote, an ORDER BY clause in a subquery is not standard, and generates unpredictable results. In your case it baffled the optimizer.

Edit: GROUP BY is a big hammer. Don't use it unless you need it. And, don't use it unless you use an aggregate function in the query.

Notice that if you have more than one row in visits for a person and the most recent year, this query will generate multiple rows for that person, one for each visit in that year. If you want just one row per person, and you DON'T need the note for the visit, then the first query will do the trick. If you have more than one visit for a person in a year, and you only need the latest one, you have to identify which row IS the latest one. Usually it will be the one with the highest ID number, but only you know that for sure. I added another person to your fiddle with that situation. http://www.sqlfiddle.com/#!2/4f644/2/0

This is complicated. But: if your visits.id numbers are automatically assigned and they are always in time order, you can simply report the highest visit id, and be guaranteed that you'll have the latest year. This will be a very efficient query.

SELECT p.id, p.name, v.year, v.note
  FROM (
         SELECT id_people, max(id) id
          FROM visits
      GROUP BY id_people
  )m
  JOIN people p ON (p.id = m.id_people)
  JOIN visits v ON (m.id = v.id)

http://www.sqlfiddle.com/#!2/4f644/1/0 But this is not the way your example is set up. So you need another way to disambiguate your latest visit, so you just get one row per person. The only trick we have at our disposal is to use the largest id number.

So, we need to get a list of the visit.id numbers that are the latest ones, by this definition, from your tables. This query does that, with a MAX(year)...GROUP BY(id_people) nested inside a MAX(id)...GROUP BY(id_people) query.

  SELECT v.id_people,
         MAX(v.id) id
    FROM (
         SELECT id_people, 
                MAX(year) year
           FROM visits
          GROUP BY id_people
         )p
    JOIN visits v ON (p.id_people = v.id_people AND p.year = v.year)
   GROUP BY v.id_people

The overall query (http://www.sqlfiddle.com/#!2/c2da2/1/0) is this.

SELECT p.id, p.name, v.year, v.note
  FROM (
      SELECT v.id_people,
             MAX(v.id) id
        FROM (
             SELECT id_people, 
                    MAX(year) year
               FROM visits
              GROUP BY id_people
             )p
        JOIN visits v ON (     p.id_people = v.id_people 
                           AND p.year = v.year)
       GROUP BY v.id_people
      )m
   JOIN people p ON (m.id_people = p.id)
   JOIN visits v ON (m.id = v.id)

Disambiguation in SQL is a tricky business to learn, because it takes some time to wrap your head around the idea that there's no inherent order to rows in a DBMS.

Share:
12,438
meridius
Author by

meridius

Updated on June 05, 2022

Comments

  • meridius
    meridius almost 2 years

    I have these tables and queries as defined in sqlfiddle.

    First my problem was to group people showing LEFT JOINed visits rows with the newest year. That I solved using subquery.

    Now my problem is that that subquery is not using INDEX defined on visits table. That is causing my query to run nearly indefinitely on tables with approx 15000 rows each.

    Here's the query. The goal is to list every person once with his newest (by year) record in visits table.

    Unfortunately on large tables it gets real sloooow because it's not using INDEX in subquery.

    SELECT *
    FROM people
    LEFT JOIN (
      SELECT *
      FROM visits
      ORDER BY visits.year DESC
    ) AS visits
    ON people.id = visits.id_people
    GROUP BY people.id
    

    Does anyone know how to force MySQL to use INDEX already defined on visits table?