SQL query : inner joins optimization between big tables

14,075

Solution 1

For one thing I wouldn't do the CONCAT in the query. Do it outside.

But really you're query runs slowly because you're retrieving millions of rows.

Solution 2

Perhaps you should include a WHERE clause? Or do you really need ALL the data?

Solution 3

This looks to me like a case where over-zealous use of surrogate keys is slowing you down. If the tables were:

  • hosts :

    • name (VARCHAR 100) PRIMARY KEY
  • paths :

    • name (VARCHAR 100) PRIMARY KEY
  • urls :

    • host (VARCHAR 100) PRIMARY KEY <--- links to hosts.name
    • path (VARCHAR 100) PRIMARY KEY <--- links to paths.name

Then your query would require no joins at all:

SELECT CONCAT(U.host, U.path) FROM urls U;

True, table URLS would occupy more disk space - but does that matter?

EDIT: On second thoughts, what is the point of that PATHS table anyway? How often do different hosts share the same paths?

Why not:

  • hosts :

    • name (VARCHAR 100) PRIMARY KEY
  • urls :

    • host (VARCHAR 100) PRIMARY KEY <--- links to hosts.name
    • path (VARCHAR 100) PRIMARY KEY <--- no link to anywhere

EDIT2: Or if you really need the surrogate key for hosts:

  • hosts :

    • id integer PRIMARY KEY
    • name (VARCHAR 100)
  • urls :

    • host integer PRIMARY KEY <--- links to hosts.name
    • path (VARCHAR 100) PRIMARY KEY <--- no link to anywhere

    SELECT CONCAT(H.name, U.path) FROM urls U JOIN hosts H ON H.id = U.host;

Solution 4

Overall, the best advice is to trace and profile to see what is really taking up time. But here are my thoughts about specific things to look at.

(1) I would say that you want to ensure that indexes are NOT used in the execution of this query. Since you have no filtering conditions, it should be more efficient to full-scan all the tables and then join them together with a sort-merge or hash operation.

(2) The string concatenation is surely taking some time, but I don't understand why people are recommending to remove it. You would presumably then need to do the concatenation in another piece of code, where it would still take about the same amount of time (unless MySQL's string concatenation is particularly slow for some reason).

(3) The data transferral from the server to the client is probably taking significant time, quite possibly more than the time the server needs to fetch the data. If you have tools to trace this sort of thing, use them. If you can increase the fetch array size in your client, experiment with different sizes (e.g. in JDBC use Statement.setFetchSize() ). This can be significant even if the client and server are on the same host.

Solution 5

You need to look at your server configuration. The default memory parameters for MySQL will cripple performance on a table that size. If you are using the defaults, you need to raise at least key_buffer_size and join_buffer_size by at least a factor of 4, perhaps much more. Look in the documentation; there are other memory parameters you can tweak.

MySQL has a funny performance quirk where if your tables go over a certain size with queries that will return most of the data, performance goes into the toilet. Unfortunately, it has no way of telling you when that threshold is reached. It looks to me like you have, though.

Share:
14,075
Nicolas
Author by

Nicolas

Projet manager, former developper, still coding at times :)

Updated on June 16, 2022

Comments

  • Nicolas
    Nicolas almost 2 years

    I have the 3 following tables in a MySQL 4.x DB :

    • hosts: (300.000 records)
      • id (UNSIGNED INT) PRIMARY KEY
      • name (VARCHAR 100)
    • paths: (6.000.000 records)
      • id (UNSIGNED INT) PRIMARY KEY
      • name (VARCHAR 100)
    • urls: (7.000.000 records)
      • host (UNSIGNED INT) PRIMARY KEY <--- links to hosts.id
      • path (UNSIGNED INT) PRIMARY KEY <--- links to paths.id

    As you can see, the schema is really simple but the problem is the amount of data in these tables.

    Here is the query I'm running :

    SELECT CONCAT(H.name, P.name)
    FROM hosts AS H
    INNER JOIN urls as U ON H.id = U.host
    INNER JOIN paths AS P ON U.path = P.id;
    

    This query works perfectly fine, but takes 50 minutes to run. Does anyone have any idea about how I could speed up that query?

    Thanks in advance. Nicolas

  • Bikash Mahata
    Bikash Mahata about 15 years
    Yes, a "materialized view" would be recommendable, if he does not need the latest data all the time.
  • Dave Costa
    Dave Costa about 15 years
    Actually, if he is really looking to get all rows returned, indexes might not be helpful. Doing an index lookup for each value in the table is probably slower than full-scanning the tables and hashing or merging them together.
  • James Curran
    James Curran about 15 years
    I was just about to add an answer saying the same as the "On Second Thought" part.
  • Bikash Mahata
    Bikash Mahata about 15 years
    I see several 100 megabytes of data. If it all fits into the memory - you are right. But a proper DBMS (and I guess even MySQL 4.x, as it is proper enough) will ignore existing indexes then by itself.
  • Manzabar
    Manzabar about 15 years
    MySQL only allows 1 primary key per table, but that key can be made up of multiple columns from the table. So in Nicolas's example, the urls table has a single primary key made up of host + path.
  • Tony Andrews
    Tony Andrews about 15 years
    Dems, I pity your clients if you insist on surrogate keys on EVERY table. Relational databases work just as well with natural keys - sometimes even better. "ARGH!!!" indeed!
  • Christian Nunciato
    Christian Nunciato about 15 years
    Sure, that makes sense -- I neglected to ask whether the keys are actually one composite one (which I don't think is necessarily made clear anyway). Mainly, though, I just wanted to point out the importance of those two columns being indexed explicitly somehow.