SQL: Inner joining two massive tables

45,893

Solution 1

For huge joins, sometimes explicitly choosing a loop join speeds things up:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM 
INNER LOOP JOIN BioEntity 
    ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId

As always, posting your estimated execution plan could help us provide better answers.

EDIT: If both inputs are sorted (they should be, with the covering index), you can try a MERGE JOIN:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM 
INNER JOIN BioEntity 
    ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
OPTION (MERGE JOIN)

Solution 2

I'm not an SQL tuning expert, but joining hundreds of millions of rows on a VARCHAR field doesn't sound like a good idea in any database system I know.

You could try adding an integer column to each table and computing a hash on the NAME field that should get the possible matches to a reasonable number before the engine has to look at the actual VARCHAR data.

Solution 3

First, 100M-row joins are not at all unreasonable or uncommon.

However, I suspect the cause of the poor performance you're seeing may be related to the INTO clause. With that, you are not only doing a join, you are also writing the results to a new table. Your observation about the log file growing so huge is basically confirmation of this.

One thing to try: remove the INTO and see how it performs. If the performance is reasonable, then to address the slow write you should make sure that your DB log file is on a separate physical volume from the data. If it isn't, the disk heads will thrash (lots of seeks) as they read the data and write the log, and your perf will collapse (possibly to as little as 1/40th to 1/60th of what it could be otherwise).

Solution 4

Maybe a bit offtopic, but: " I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds."

This behavior is characteristic for cheap RAID5 array (or maybe for single disk) while copying (and your query mostly copies data) gigabytes of information.

More about problem - can't you partition your query into smaller blocks? Like names starting with A, B etc or IDs in specific ranges? This could substantially decrease transactional/locking overhead.

Solution 5

I'd try maybe removing the 'LIKE' operator; as you don't seem to be doing any wildcard matching.

Share:
45,893
João Pereira
Author by

João Pereira

Updated on September 26, 2020

Comments

  • João Pereira
    João Pereira over 3 years

    I have two massive tables with about 100 million records each and I'm afraid I needed to perform an Inner Join between the two. Now, both tables are very simple; here's the description:

    BioEntity table:

    • BioEntityId (int)
    • Name (nvarchar 4000, although this is an overkill)
    • TypeId (int)

    EGM table (an auxiliar table, in fact, resulting of bulk import operations):

    • EMGId (int)
    • PId (int)
    • Name (nvarchar 4000, although this is an overkill)
    • TypeId (int)
    • LastModified (date)

    I need to get a matching Name in order to associate BioEntityId with the PId residing in the EGM table. Originally, I tried to do everything with a single inner join but the query appeared to be taking way too long and the logfile of the database (in simple recovery mode) managed to chew up all the available disk space (that's just over 200 GB, when the database occupies 18GB) and the query would fail after waiting for two days, If I'm not mistaken. I managed to keep the log from growing (only 33 MB now) but the query has been running non-stop for 6 days now and it doesn't look like it's gonna stop anytime soon.

    I'm running it on a fairly decent computer (4GB RAM, Core 2 Duo (E8400) 3GHz, Windows Server 2008, SQL Server 2008) and I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds. This makes it quite hard to use it for anything else, which is really getting on my nerves.

    Now, here's the query:

     SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
     FROM EGM INNER JOIN BioEntity 
     ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
    

    I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).

    It has been running for 6 days now and I'm still not sure what to do... Any ideas guys? How can I make this faster (or, at least, finite)?

    Update: - Ok, I've canceled the query and rebooted the server to get the OS up and running again - I'm rerunning the workflow with your proposed changes, specifically cropping the nvarchar field to a much smaller size and swapping "like" for "=". This is gonna take at least two hours, so I'll be posting further updates later on

    Update 2 (1PM GMT time, 18/11/09): - The estimated execution plan reveals a 67% cost regarding table scans followed by a 33% hash match. Next comes 0% parallelism (isn't this strange? This is the first time I'm using the estimated execution plan but this particular fact just lifted my eyebrow), 0% hash match, more 0% parallelism, 0% top, 0% table insert and finally another 0% select into. Seems the indexes are crap, as expected, so I'll be making manual indexes and discard the crappy suggested ones.