Deleting millions of rows in MySQL

100,659

Solution 1

DELETE FROM `table`
WHERE (whatever criteria)
ORDER BY `id`
LIMIT 1000

Wash, rinse, repeat until zero rows affected. Maybe in a script that sleeps for a second or three between iterations.

Solution 2

I had a use case of deleting 1M+ rows in the 25M+ rows Table in the MySQL. Tried different approaches like batch deletes (described above).
I've found out that the fastest way (copy of required records to new table):

  1. Create Temporary Table that holds just ids.

CREATE TABLE id_temp_table ( temp_id int);

  1. Insert ids that should be removed:

insert into id_temp_table (temp_id) select.....

  1. Create New table table_new

  2. Insert all records from table to table_new without unnecessary rows that are in id_temp_table

insert into table_new .... where table_id NOT IN (select distinct(temp_id) from id_temp_table);

  1. Rename tables

The whole process took ~1hr. In my use case simple delete of batch on 100 records took 10 mins.

Solution 3

the following deletes 1,000,000 records, one at a time.

 for i in `seq 1 1000`; do 
     mysql  -e "select id from table_name where (condition) order by id desc limit 1000 " | sed 's;/|;;g' | awk '{if(NR>1)print "delete from table_name where id = ",$1,";" }' | mysql; 
 done

you could group them together and do delete table_name where IN (id1,id2,..idN) im sure too w/o much difficulty

Solution 4

I'd also recommend adding some constraints to your table to make sure that this doesn't happen to you again. A million rows, at 1000 per shot, will take 1000 repetitions of a script to complete. If the script runs once every 3.6 seconds you'll be done in an hour. No worries. Your clients are unlikely to notice.

Solution 5

I think the slowness is due to MySQl's "clustered index" where the actual records are stored within the primary key index - in the order of the primary key index. This means access to a record via the primary key is extremely fast because it only requires one disk fetch because the record on the disk is right there where it found the correct primary key in the index.

In other databases without clustered indexes the index itself does not hold the record but just an "offset" or "location" indicating where the record is located in the table file and then a second fetch must be made in that file to retrieve the actual data.

You can imagine that when deleting a record in a clustered index (like MySQL uses) all records above that record in the index (=table) must be moved downwards to avoid massive holes being created in the index (well that is what I recall from a few years ago at least - version 8.x may have improved this issue).

Armed with knowledge of the above 'under the hood' operations, what we discovered that really sped up deletes in MySQL 5.x was to perform the deletes in reverse order. This produces the least amount of record movement because you are deleting records from the end first meaning that subsequent deletes have less records to relocate - logical right?!

Share:
100,659
Steven Surowiec
Author by

Steven Surowiec

I'm a PHP/MySQL Developer. Me on GitHub

Updated on July 08, 2022

Comments

  • Steven Surowiec
    Steven Surowiec almost 2 years

    I recently found and fixed a bug in a site I was working on that resulted in millions of duplicate rows of data in a table that will be quite large even without them (still in the millions). I can easily find these duplicate rows and can run a single delete query to kill them all. The problem is that trying to delete this many rows in one shot locks up the table for a long time, which I would like to avoid if possible. The only ways I can see to get rid of these rows, without taking down the site (by locking up the table) are:

    1. Write a script that will execute thousands of smaller delete queries in a loop. This will theoretically get around the locked table issue because other queries will be able to make it into the queue and run in between the deletes. But it will still spike the load on the database quite a bit and will take a long time to run.
    2. Rename the table and recreate the existing table (it'll now be empty). Then do my cleanup on the renamed table. Rename the new table, name the old one back and merge the new rows into the renamed table. This is way takes considerably more steps, but should get the job done with minimal interruption. The only tricky part here is that the table in question is a reporting table, so once it's renamed out of the way and the empty one put in its place all historic reports go away until I put it back in place. Plus the merging process could be a bit of a pain because of the type of data being stored. Overall this is my likely choice right now.

    I was just wondering if anyone else has had this problem before and, if so, how you dealt with it without taking down the site and, hopefully, with minimal if any interruption to the users? If I go with number 2, or a different, similar, approach, I can schedule the stuff to run late at night and do the merge early the next morning and just let the users know ahead of time, so that's not a huge deal. I'm just looking to see if anyone has any ideas for a better, or easier, way to do the cleanup.

  • MarkR
    MarkR almost 15 years
    If you use DELETE with LIMIT, you should really use ORDER BY to make the query deterministic; not doing so would have strange effects (including breaking replication in some cases)
  • adam rowe
    adam rowe almost 8 years
    This will definitely delete rows. I'm pretty sure the OP wants to be selective though.
  • bishop
    bishop almost 8 years
    Note that one can't combine DELETE ... JOIN with ORDER BY or LIMIT.
  • user2693017
    user2693017 over 7 years
    This is the only solution that worked for me with a 100GB table. Select with limit 1000 was just a few milliseconds but the delete with the same query took an hour for just 1000 records, although a SSD is in place. Deleting this way is still slow but at least a thousand rows per second and not hour.
  • Diogo Medeiros
    Diogo Medeiros about 7 years
    I still have my doubts if a pivot table isn't the best way, but, I made a procedure, just to keep the sanity anyway: hastebin.com/nabejehure.pas
  • simplifiedDB
    simplifiedDB over 6 years
    deleting 1 M record in one go will kill your server
  • tsauerwein
    tsauerwein over 5 years
    Here is a simple Python script which implements this approach: gist.github.com/tsauerwein/ffb159d1ab95d7fd91ef43b9609c471d
  • Buttle Butkus
    Buttle Butkus over 5 years
    I was able to delete 100,000 records at a time (DELETE FROM table WHERE id <= 100000, then 200000, etc). Each batch took between 30 seconds and 1 minute. But when I previously tried to delete 1,300,000 at once, the query ran for at least 30 minutes before failing with ERROR 2013 (HY000): Lost connection to MySQL server during query. I ran these queries in the MySQL client on the same virtual machine as the server, but maybe the connection timed out.
  • Softlion
    Softlion about 4 years
    for step 4 you can left join to use the index: insert into table_new ... select ... from table left join id_temp_table t on t.temp_id = table.id where t.temp_id is NULL;
  • David Mann
    David Mann over 3 years
    I really like this thinking! I love that it makes sense visually, like a toy a child could understand.
  • khue bui
    khue bui over 3 years
    Why we have to sleep between iterations?
  • Ng Sek Long
    Ng Sek Long over 3 years
    Maybe in a script that sleeps for a second or three it is now possible to just use MySQL to inject the delay between query, see this: DELETE FROM ... LIMIT 1000; SELECT SLEEP(5); DELETE FROM ... LIMIT 1000; SELECT SLEEP(5); And so on
  • Muhammad Omer Aslam
    Muhammad Omer Aslam about 3 years
    is it safe to have a shell script that deletes rows in this manner?
  • Gani Simsek
    Gani Simsek almost 3 years
    This really made the difference for me. Deleting 10K rows in a table that had 5M rows took 5 minutes initially. Then I added ORDER BY id DESC LIMIT 10000 to the delete statement and it took only 1 second. Later I increased the size to 1M at a time. The whole process took 10 minutes.
  • Volksman
    Volksman almost 3 years
    @GaniSimsek I'm always happy to hear of cases where others have benefited from some of my "that's just so crazy it might just work" ideas :)
  • lkraav
    lkraav over 2 years
  • lkraav
    lkraav over 2 years
    Even better CREATE PROCEDURE solution here dba.stackexchange.com/questions/195941/…