Best way to delete millions of rows by ID

sql postgresql bigdata sql-delete postgresql-performance

77,457

Solution 1

It all depends ...

Assuming no concurrent write access to involved tables or you may have to lock tables exclusively or this route may not be for you at all.
Delete all indexes (possibly except the ones needed for the delete itself).
Recreate them afterwards. That's typically much faster than incremental updates to indexes.
Check if you have triggers that can safely be deleted / disabled temporarily.
Do foreign keys reference your table? Can they be deleted? Temporarily deleted?
Depending on your autovacuum settings it may help to run VACUUM ANALYZE before the operation.
Some of the points listed in the related chapter of the manual Populating a Database may also be of use, depending on your setup.
If you delete large portions of the table and the rest fits into RAM, the fastest and easiest way may be this:

BEGIN; -- typically faster and safer wrapped in a single transaction

SET LOCAL temp_buffers = '1000MB'; -- enough to hold the temp table

CREATE TEMP TABLE tmp AS
SELECT t.*
FROM   tbl t
LEFT   JOIN del_list d USING (id)
WHERE  d.id IS NULL;      -- copy surviving rows into temporary table
-- ORDER BY ?             -- optionally order favorably while being at it

TRUNCATE tbl;             -- empty table - truncate is very fast for big tables

INSERT INTO tbl
TABLE tmp;        -- insert back surviving rows.

COMMIT;

This way you don't have to recreate views, foreign keys or other depending objects. And you get a pristine (sorted) table without bloat.

Read about the temp_buffers setting in the manual. This method is fast as long as the table fits into memory, or at least most of it. The transaction wrapper defends against losing data if your server crashes in the middle of this operation.

Run VACUUM ANALYZE afterwards. Or (typically not necessary after going the TRUNCATE route) VACUUM FULL ANALYZE to bring it to minimum size (takes exclusive lock). For big tables consider the alternatives CLUSTER / pg_repack or similar:

Optimize Postgres query on timestamp range

For small tables, a simple DELETE instead of TRUNCATE is often faster:

DELETE FROM tbl t
USING  del_list d
WHERE  t.id = d.id;

Read the Notes section for TRUNCATE in the manual. In particular (as Pedro also pointed out in his comment):

TRUNCATE cannot be used on a table that has foreign-key references from other tables, unless all such tables are also truncated in the same command. [...]

And:

TRUNCATE will not fire any ON DELETE triggers that might exist for the tables.

Solution 2

I just hit this issue myself and for me the, by far, fastest method was using WITH Queries in combination with USING

Basically the WITH-query creates a temporary table with the primary keys to delete in the table you want to delete from.

WITH to_delete AS (
   SELECT item_id FROM other_table WHERE condition_x = true
)
DELETE FROM table 
USING to_delete 
WHERE table.item_id = to_delete.item_id 
  AND NOT to_delete.item_id IS NULL;

Ofcourse the SELECT inside the WITH-query can be as complex as any other select with multiple joins etc. It just has to return one or more columns that are used to identify the items in the target table that need to be deleted.

NOTE: AND NOT to_delete.item_id IS NULL most likely is not necessary, but I didn't dare to try.

Other things to consider are

creating indexes on other tables referring to this one via foreign key. Which can reduce a delete taking hours to mere seconds in certain situations
deferring constraint checks: It's not clear how much, if any improvement this achieves, but according to this it can increase performance. Downside is, if you have a foreign key violation you will learn it only at the very last moment.
DANGEROUS but big possible boost: disable constaint checks and triggers during the delete

Solution 3

We know the update/delete performance of PostgreSQL is not as powerful as Oracle. When we need to delete millions or 10's of millions of rows, it's really difficult and takes a long time.

However, we can still do this in production dbs. The following is my idea:

First, we should create a log table with 2 columns - id & flag (id refers to the id you want to delete; flag can be Y or null, with Y signifying the record is successfully deleted).

Later, we create a function. We do the delete task every 10,000 rows. You can see more details on my blog. Though it's in Chinese, you can still can get the info you want from the SQL code there.

Make sure the id column of both tables are indexes, as it will run faster.

Solution 4

You may try copying all the data in the table except the IDs you want to delete onto a new table, then renaming then swapping the tables (provided you have enough resources to do it).

This is not an expert advice.

Solution 5

Two possible answers:

Your table may have lots of constraint or triggers attached to it when you try to delete a record. It will incur much processor cycles and checking from other tables.
You may need to put this statement inside a transaction.

View more solutions

77,457

Anthony Greco

Updated on February 04, 2022

Comments

Anthony Greco about 2 years
I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days.

I tried putting them in a table and doing it in batches of 100. 4 days later, this is still running with only 297268 rows deleted. (I had to select 100 id's from an ID table, delete where IN that list, delete from ids table the 100 I selected).

I tried:
```
DELETE FROM tbl WHERE id IN (select * from ids)
```
That's taking forever, too. Hard to gauge how long, since I can't see it's progress till done, but the query was still running after 2 days.

Just kind of looking for the most effective way to delete from a table when I know the specific ID's to delete, and there are millions of IDs.
- hansvb over 12 years
  
  How many rows will be left? An alternative would be to select the remaining rows into a work table, and then rename tables.
Anthony Greco over 12 years

I am really trying to avoid this because then I will just have to redo the process on all of it's foreign keys but I very well may have to. Thanks
Anthony Greco over 12 years

1. I do have constrains (foreign keys) that are auto deleted when a row in the table deletes
Anthony Greco over 12 years

Unfortunately I do have a few foreign keys, however I may do what you suggested by killing all keys / deleting / recreating. It is taking more time not doing this then to just do it. Thanks!
Erwin Brandstetter over 12 years

@AnthonyGreco: You could delete the foreign keys and recreate afterwards. Of course you'll have to take care of references to deleted rows, too. And referential integrity is not guaranteed during this window.
Anthony Greco over 12 years

Surely was not what I wanted to do, but deleting the index's made my deletes now fly... Now just gotta do this on all linked tables to deleted linked rows, but hell, beats all the time i spent trying to get it to work without
Anthony Greco over 12 years

Well I was basically doing a logic of that to do it in batch, however it was taking far to long because of my indexes. I finally dropped all my indexes (was something I did not want to do) and the rows got purged quick as hell. Now building all my indexes back up. Thanks though!
Anthony Greco over 12 years

Turned out no mater what I tried it was the keys that was killing me. Even just 15 was taking a minute or so, thats why i only did 100. Once I killed the index, it flew. Thanks though!
Erwin Brandstetter over 12 years

@AnthonyGreco: Cool! Don't forget to recreate those indexes afterwards that you still need.
Simon Lepkin about 8 years

Is it safe to do such a huge DELETE in one transaction, if the other rows (the ones not tagged for delete) are accessed frequently?
Erwin Brandstetter about 8 years

@SimonLepkin: Please ask a new question (providing all relevant details). Comments are not the place. You can always link to this one for context.
Simon Lepkin about 8 years

@ErwinBrandstetter Done: stackoverflow.com/questions/35188911/…
nclu over 6 years

Depending on the number of rows to be kept and how tricky other foreign keys are, this can work. Can also copy good rows to temp. Truncate current table. Then copy back from temp.
Pencilcheck about 6 years

TRUNCATE helps! Thanks a lot!
Pedro Borges about 6 years

This is a great solution, would just add that ignores delete cascades if not obvious for someone.
Cássio almost 5 years

Very, very clever. My tbl has 60 million records and del_list has 56 million records. Doing this way took me less than 3 minutes. Doing it like the original question, I had to abort it after 24h running without finishing. It's a huge difference.
Mikko Rantalainen over 4 years

Try explain (analyze,buffers,timing) ... and figure out which indexes you're missing.
Torge almost 4 years

You can even create multiple such tables that reference each other, as I had to do in one case where I wanted to delete all rows that were orphans and not referenced by any other table anymore. (WITH existing_items AS ( ... ), to_delete AS ( SELECT item_id FROM table LEFT JOIN existing_items e ON table.item_id = e.item_id WHERE e.item_id IS NULL ) DELETE FROM ...)