Why is LEFT JOIN slower than INNER JOIN?

11,123

Solution 1

With INNER JOIN, MySQL generally will start with the table with the smallest number of rows. In this case, it starts with table finished and does a look up for the corresponding record in saved using the index on saved.email.

For a LEFT JOIN, (excluding some optimizations) MySQL generally joins the records in order (starting with the left most table). In this case, MySQL starts with the table saved, then attempts to find each corresponding record in finished. Since there is no usable index on finished.email, it must do a full scan for each look up.

Edit

Now that you posted your schema, I can see that MySQL is ignoring the index (finished.email) when going from utf8 to latin1 character set. You've not posted the character sets and collations for each column, so I'm going by the default character set for the table. The collations must be compatible in order for MySQL to use the index.

MySQL can coerce (upgrade) a latin1 collation, which is very limited, up to a utf8 collation such as unicode_ci (so the first query can use the index on saved.email by upgrading latin1 collation to utf8), but the opposite is not true (the second query can't use the index on finished.email since it can't downgrade a utf8 collation down to latin1).

The solution is to change both email columns to a compatible collation, perhaps most easily by making them identical character sets and collations.

Solution 2

The LEFT JOIN query is slower than the INNER JOIN query because it's doing more work.

From the EXPLAIN output, it looks like MySQL is doing nested loop join. (There's nothing wrong with nested loops; I think that's the only join operation that MySQL uses in version 5.5 and earlier.)

For the INNER JOIN query, MySQL is using an efficient "ref" (index lookup) operation to locate the matching rows.

But for the LEFT JOIN query, it looks like MySQL is doing a full scan of the index to find the matching rows. So, with the nested loops join operation, MySQL is doing a full index scan scan for each row from the other table. So, that's on the order of tens of thousands of scans, and each of those scans is inspecting tens of thousands of rows.

Using the estimated row counts from the EXPLAIN output, that's going to require (40971*32168=) 1,317,955,128 string comparisons.

The INNER JOIN query avoids a lot of that work, so it's a lot faster. (It's avoiding all those string comparisons by using an index operation.

-- LEFT JOIN
id select table    type   key   key_len ref    rows  Extra
-- ------ -------- -----  ----- ------- ----  -----  ------------------------
1  SIMPLE saved    index  email     383 NULL  40971  Using index
1  SIMPLE finished index  email     258 NULL  32168  Using index

-- INNER JOIN 
id select table    type   key   key_len ref    rows  Extra
-- ------ -------- -----  ----- ------- ----  -----  ------------------------  
1  SIMPLE finished index  email     258 NULL  32168  Using index
1  SIMPLE saved    ref    email     383 func      1  Using where; Using index
                   ^^^^^                ^^^^  ^^^^^  ^^^^^^^^^^^^

NOTE: Markus Adams spied the difference in characterset in the email columns CREATE TABLE statements that were added to your question.

I believe that it's the difference in the characterset that's preventing MySQL from using an index for your query.


Q2: How do I make the LEFT JOIN query faster?

A: I don't believe it's going to be possible to get that specific query to run faster, without a schema change, such as changing the characterset of the two email columns to match.

The only affect that the "outer join" to the finished table looks like it is to produce "duplicate" rows whenever more than one matching row is found. I'm not understanding why the outer join is needed. Why not just get rid of it altogether, and just do:

SELECT saved.email FROM saved

Solution 3

I'm afraid more info will probably be needed.

However, inner joins eliminate any item that has a null foreign key (no match, if you will). This means that there are less rows to scan to associate.

For a left join however, any non-match needs to be given a blank row, so all of the rows are scanned regardless -- nothing can be eliminated.

This makes the data set larger and requires more resources to process. Also, when you write your select, don't do select * -- instead, explicitly state which columns you want.

Solution 4

The data types of saved.email and finished.email differ in two respects. First, they have different lengths. Second, finished.email can be NULL. So, your LEFT JOIN operation can't exploit the index on finished.email.

Can you change the definition of finished.email to this, so it matches the field you're joining it with?

`email` varchar(127) NOT NULL

If you do you'll probably get a speedup.

Share:
11,123
allenylzhou
Author by

allenylzhou

Computer science is not a field, it is a lifestyle.

Updated on June 14, 2022

Comments

  • allenylzhou
    allenylzhou almost 2 years

    I have two queries, the first one (inner join) is super fast, and the second one (left join) is super slow. How do I make the second query fast?

    EXPLAIN SELECT saved.email FROM saved INNER JOIN finished ON finished.email = saved.email;
    
    id  select_type table   type    possible_keys   key key_len ref rows    Extra
    1   SIMPLE  finished    index   NULL    email   258 NULL    32168   Using index
    1   SIMPLE  saved   ref email   email   383 func    1   Using where; Using index
    
    EXPLAIN SELECT saved.email FROM saved LEFT JOIN finished ON finished.email = saved.email;
    
    id  select_type table   type    possible_keys   key key_len ref rows    Extra
    1   SIMPLE  saved   index   NULL    email   383 NULL    40971   Using index
    1   SIMPLE  finishedindex   NULL    email   258 NULL    32168   Using index
    

    Edit: I have added table info for both tables down below.

    CREATE TABLE `saved` (
      `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
      `slug` varchar(255) DEFAULT NULL,
      `email` varchar(127) NOT NULL,
      [omitted fields include varchar, text, longtext, int],
      PRIMARY KEY (`id`),
      KEY `slug` (`slug`),
      KEY `email` (`email`)
    ) ENGINE=MyISAM AUTO_INCREMENT=56329 DEFAULT CHARSET=utf8;
    
    CREATE TABLE `finished` (
      `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
      `slug` varchar(255) DEFAULT NULL,
      `submitted` int(11) DEFAULT NULL,
      `status` int(1) DEFAULT '0',
      `name` varchar(255) DEFAULT NULL,
      `email` varchar(255) DEFAULT NULL,
      [omitted fields include varchar, text, longtext, int],
      PRIMARY KEY (`id`),
      KEY `assigned_user_id` (`assigned_user_id`),
      KEY `event_id` (`event_id`),
      KEY `slug` (`slug`),
      KEY `email` (`email`),
      KEY `city_id` (`city_id`),
      KEY `status` (`status`),
      KEY `recommend` (`recommend`),
      KEY `pending_user_id` (`pending_user_id`),
      KEY `submitted` (`submitted`)
    ) ENGINE=MyISAM AUTO_INCREMENT=33063 DEFAULT CHARSET=latin1;
    
    • Michael Berkowski
      Michael Berkowski over 9 years
      Please post the SHOW CREATE TABLE output for both tables.
    • radar
      radar over 9 years
      the output is same for both the queries as you are getting the column from saved table alone, what is the main intention?
    • crazy_in_love
      crazy_in_love over 9 years
      I think this belongs to dba.stackexchange.com.
    • zerkms
      zerkms over 9 years
      How big is saved table and what is the number of rows returned by both queries.
    • zerkms
      zerkms over 9 years
      @Marcus Adams: according to the explain output - there is such
    • allenylzhou
      allenylzhou over 9 years
      Added SHOW CREATE TABLE output for both tables
    • Deadooshka
      Deadooshka over 9 years
      try SELECT STRAIGHT_JOIN ... FROM finished LEFT JOIN saved ...
    • Marcus Adams
      Marcus Adams over 9 years
      See my updated answer for the actual problem.
    • Marcus Adams
      Marcus Adams over 9 years
      @zerkms, you're assuming things based on the name of an index.
    • zerkms
      zerkms over 9 years
      @Marcus Adams: uhm, not sure what you mean. 1. There are schemas available already 2. In the given query only email column is used for predicates and SELECT which automatically means that if any index is used - then email is its left most part. So, excuse me, but my "assumption" is based on facts.
  • Mike Brant
    Mike Brant over 9 years
    There are no row scans happening here at all. You are correct that more rows are returned in the LEFT JOIN, but the join is able to be calculated using index in either case as shown in the explain.
  • Marcus Adams
    Marcus Adams over 9 years
    @MikeBrant, Using index doesn't mean it used the index to satisfy the WHERE. Using where; Using index means it used the index to satisfy the WHERE.
  • allenylzhou
    allenylzhou over 9 years
    But there is an index on finished.email.
  • Mike Brant
    Mike Brant over 9 years
    @MarcusAdams There is no WHERE clause in the example. I was talking about the indexes being used to satisfy the join condition.
  • zerkms
    zerkms over 9 years
    @Mike Brant: actually that's not true. using index without using where means the index was read just to fetch the data (to avoid reading the actual data pages). So in the second EXPLAIN output it's basically an index full scan, not lookup. Hence it's slow. And there is WHERE in the first query (the ON node transformed into WHERE clause during optimization)
  • Mike Brant
    Mike Brant over 9 years
    @zerks Yes agreed that the indexes are scanned and read into memory. I was more making note that the answer refers to all rows being scanned. Seemed to be hinting at full table scan, which is not the case here.
  • zerkms
    zerkms over 9 years
    @Mike Brant: it is still a full index scan, which causes exactly the same effect: cartesian product of 2 sets. So from my point it's fair to say so (but it needs to be carefully explained indeed to avoid further confusion). However I completely agree the explanation of the roots of the issue is kind of vague and in the current form it does not explain anything :-)
  • zerkms
    zerkms over 9 years
    Aaaaand we have the winner :-) Nice explanation, +1
  • spencer7593
    spencer7593 over 9 years
    Good catch on the difference in the characterset.
  • spencer7593
    spencer7593 over 9 years
    I missed the difference in characterset (Markus spied it) that may probably explain why MySQL isn't using a "ref" join operation.
  • allenylzhou
    allenylzhou over 9 years
    I'm going to accept Adam's answer because the character set is the culprit but thank you for the detailed explanation of how LEFT JOINs differ from INNER JOINs.
  • Darwayne
    Darwayne over 7 years
    Character set point was a really great one. Made a query of mine go down from 1 hour to 87ms
  • Sammy Larbi
    Sammy Larbi over 5 years
    "The collations must be compatible in order for MySQL to use the index." -- Thanks, I hadn't considered looking at the collation as part of my problem. I converted the tables to use the same collation, and it took my multiple-minute query down to a second or so. Thanks!
  • Christian Saiki
    Christian Saiki over 4 years
    oh my god, you are life saver! Our query went from 8h to 1 minute because we were using different enccoding formats in our tables. Thanks a lot.