DELETE records which do not have a match in another table

sql postgresql exists bigdata sql-delete

51,668

Solution 1

I benchmarked four typical queries, with different settings for {work_mem, effective_cache_size, random_page_cost}, these settings have the largest influence on the selected plan. I first did a "run in" with my default settings to warm the cache. Note: the test-set is small enough to allow all needed pages to be present in cache.

The test-set

SET search_path=tmp;

/************************/
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;

CREATE TABLE one
        ( id SERIAL NOT NULL PRIMARY KEY
        , payload varchar
        );

CREATE TABLE two
        ( id SERIAL NOT NULL PRIMARY KEY
        , one_id INTEGER REFERENCES one
        , payload varchar
        );

INSERT INTO one (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;
INSERT INTO two (payload) SELECT 'Text_' || gs::text FROM generate_series(1,30000) gs;


UPDATE two t
SET one_id = o.id
FROM one o
WHERE o.id = t.id
AND random() < 0.1;

INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;
INSERT INTO two (one_id,payload) SELECT one_id,payload FROM two;

VACUUM ANALYZE one;
VACUUM ANALYZE two;
/***************/

The queries:

\echo NOT EXISTS()
EXPLAIN ANALYZE
DELETE FROM one o
WHERE NOT EXISTS ( SELECT * FROM two t
        WHERE t.one_id = o.id
        );

\echo NOT IN()
EXPLAIN ANALYZE 
DELETE FROM one o
WHERE o.id NOT IN ( SELECT one_id FROM two t)
        ;

\echo USING (subquery self LEFT JOIN two where NULL)
EXPLAIN ANALYZE
DELETE FROM one o
USING (
        SELECT o2.id
        FROM one o2
        LEFT JOIN two t ON t.one_id = o2.id
        WHERE t.one_id IS NULL
        ) sq
WHERE sq.id = o.id
        ;

\echo USING (subquery self WHERE NOT EXISTS(two)))
EXPLAIN ANALYZE
DELETE FROM one o
USING (
        SELECT o2.id
        FROM one o2
        WHERE NOT EXISTS ( SELECT *
                FROM two t WHERE t.one_id = o2.id
                )
        ) sq
WHERE sq.id = o.id
        ;

The result (summarised)

                        NOT EXISTS()    NOT IN()        USING(LEFT JOIN NULL)   USING(NOT EXISTS)
1) rpc=4.0.csz=1M wmm=64        80.358  14389.026       77.620                  72.917
2) rpc=4.0.csz=1M wmm=64000     60.527  69.104          51.851                  51.004
3) rpc=1.5.csz=1M wmm=64        69.804  10758.480       80.402                  77.356
4) rpc=1.5.csz=1M wmm=64000     50.872  69.366          50.763                  53.339
5) rpc=4.0.csz=1G wmm=64        84.117  7625.792        69.790                  69.627
6) rpc=4.0.csz=1G wmm=64000     49.964  67.018          49.968                  49.380
7) rpc=1.5.csz=1G wmm=64        68.567  3650.008        70.283                  69.933
8) rpc=1.5.csz=1G wmm=64000     49.800  67.298          50.116                  50.345

legend: 
rpc := "random_page_cost"
csz := "effective_cache_size"
wmm := "work_mem"

As you can see, the NOT IN() variant is very sensitive to shortage of work_mem. Agreed, the setting 64(KB) is very low, but this `more or less* corresponds to large data sets, which won't fit in hashtables, either.

EXTRA: during the warm-in phase, the NOT EXISTS() query suffered from extreme FK-trigger contention. This apears to be a result of a conflict with the vacuum deamon, which is still active after the table set-up.:

PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, 64-bit
NOT EXISTS()
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Delete on one o  (cost=6736.00..7623.94 rows=27962 width=12) (actual time=80.596..80.596 rows=0 loops=1)
   ->  Hash Anti Join  (cost=6736.00..7623.94 rows=27962 width=12) (actual time=49.174..61.327 rows=27050 loops=1)
         Hash Cond: (o.id = t.one_id)
         ->  Seq Scan on one o  (cost=0.00..463.00 rows=30000 width=10) (actual time=0.003..5.156 rows=30000 loops=1)
         ->  Hash  (cost=3736.00..3736.00 rows=240000 width=10) (actual time=49.121..49.121 rows=23600 loops=1)
               Buckets: 32768  Batches: 1  Memory Usage: 1015kB
               ->  Seq Scan on two t  (cost=0.00..3736.00 rows=240000 width=10) (actual time=0.006..33.790 rows=240000 loops=1)
 Trigger for constraint two_one_id_fkey: time=467720.117 calls=27050
 Total runtime: 467824.652 ms
(9 rows)

Solution 2

First off: your text says:

I would like to delete those orphan records from item_tbl.

But your code says:

DELETE FROM link_tbl lnk ...

Update: On rereading the Q I find it more likely you want to delete orphaned rows in link_tbl. The row-counts point in that direction. @Lucas) query would be correct in this case. But I am afraid, NOT EXISTS is actually slower than NOT IN in this case.

To verify I ran a test case, that's remotely like your setup. Couldn't make it much bigger, or SQLfiddle would run into a timeout.

-> SQLfiddle.

NOT EXISTS would be faster for the reversed case. (I tested that, too.) EXISTS is better suited for testing the "many"-side. And generally, there is more to gain with EXISTS than with NOT EXISTS - that form has to check the whole table anyway. It's much harder to prove something does not exist than to prove that something exists. This universal truth also applies to databases.

Divide and conquer

This operation is suited to be split up. Especially if you have concurrent transactions (but even without) I would consider splitting the DELETE into several slices, so that the transaction can COMMIT after a decent amount of time.

Something like:

DELETE FROM link_tbl l
WHERE  l.item_id < 1000000
AND    l.item_id NOT IN (SELECT i.id FROM item_tbl i)

Then l.item_id BETWEEN 100001 AND 200000, etc.

You cannot automate this with a function. That would wrap everything into a transaction and defy the purpose. So you'd have to script it from any client.
Or you could use ..

dblink

This additional module lets you run separate transactions in any database including the one it's running in. And that can be done via persistent connection, which should remove most of the connection overhead. For instructions how to install it:
How to use (install) dblink in PostgreSQL?

DO would do the job (PostgreSQL 9.0 or later). Running 100 DELETE commands for 50000 item_id at a time:

DO
$$
DECLARE
   _sql text;
BEGIN

PERFORM dblink_connect('port=5432 dbname=mydb');  -- your connection parameters

FOR i IN 0 .. 100
LOOP
   _sql := format('
   DELETE FROM link_tbl l
   WHERE  l.item_id BETWEEN %s AND %s
   AND    l.item_id NOT IN (SELECT i.id FROM item_tbl i)'
   , (50000 * i)::text
   , (50000 * (i+1))::text);

   PERFORM  dblink_exec(_sql);
END LOOP;

PERFORM dblink_disconnect();

END
$$

If the script should get interrupted: dblink_connect writes to the DB log what it executed, so you see what's done already.

Solution 3

Perhaps this:

DELETE FROM link_tbl lnk
WHERE NOT EXISTS
  ( SELECT 1 FROM item_tbl item WHERE item.id = lnk.item_id );

When dealing with large numbers of records, it can be much more efficient to create a temp table, perform INSERT INTO SELECT * FROM ... then drop the original table, rename the temp table, then add your indexes back...

51,668

Author by

miloxe

A programmer.

Updated on April 15, 2020

Comments

miloxe about 4 years

There are two tables linked by an id:

item_tbl (id)
link_tbl (item_id)

There are some records in item_tbl that don't have matching rows in link_tbl. A select which would count their amount would be:

SELECT COUNT(*)
FROM link_tbl lnk LEFT JOIN item_tbl itm ON lnk.item_id=itm.id
WHERE itm.id IS NULL

I would like to delete those orphan records (those which don't have match in the other table) from link_tbl but the only way I could think of was:

DELETE FROM link_tbl lnk
WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm)

There are
262,086,253 records in link_tbl
3,033,811 in item_tbl
16,844,347 orphan records in link_tbl.
The server has 4GB RAM and 8 core CPU.

EXPLAIN DELETE FROM link_tbl lnk
WHERE lnk.item_id NOT IN (SELECT itm.id FROM item_tbl itm)

Returns:

Delete on link lnk  (cost=0.00..11395249378057.98 rows=131045918 width=6)
->  Seq Scan on link lnk  (cost=0.00..11395249378057.98 rows=131045918 width=6)
     Filter: (NOT (SubPlan 1))
     SubPlan 1
       ->  Materialize  (cost=0.00..79298.10 rows=3063207 width=4)
             ->  Seq Scan on item itm  (cost=0.00..52016.07 rows=3063207 width=4)

The questions are:

Is there any better way how to delete orphan records from link_tbl?
How accurate is the explain above, or how long it could take to delete those records?
- Edit: fixed according to Erwin Brandstetter comment.
- Edit: PostgreSql version is 9.1
- Edit: some parts of postgresql.config
  1. shared_buffers = 368MB
  2. temp_buffers = 32MB
  3. work_mem = 32MB
  4. maintenance_work_mem = 64MB
  5. max_stack_depth = 6MB
  6. fsync = off
  7. synchronous_commit = off
  8. full_page_writes = off
  9. wal_buffers = 16MB
  10. wal_writer_delay = 5000ms
  11. commit_delay = 10
  12. commit_siblings = 10
  13. effective_cache_size = 1600MB

Resolution:

Thank you all for your advices, it was very helpful. I finally used the delete advised by Erwin Brandstetter https://stackoverflow.com/a/15959896/1331340 but I tweaked it a little:

DELETE FROM link_tbl lnk
WHERE lnk.item_id BETWEEN 0 AND 10000
  AND lnk.item_id NOT IN (SELECT itm.id FROM item itm
                          WHERE itm.id BETWEEN 0 AND 10000)

I compared results for NOT IN and NOT EXISTS and the output is below, although I used COUNT instead of DELETE which I think should be the same (I mean in sake of relative comparison):

EXPLAIN ANALYZE SELECT COUNT(*) 
FROM link_tbl lnk
WHERE lnk.item_id BETWEEN 0 AND 20000
  AND lnk.item_id NOT IN (SELECT itm.id
                          FROM item_tbl itm
                          WHERE itm.id BETWEEN 0 AND 20000);

QUERY PLAN
Aggregate  (cost=6002667.56..6002667.57 rows=1 width=0) (actual time=226817.086..226817.088 rows=1 loops=1)
->  Seq Scan on link_tbl lnk  (cost=1592.50..5747898.65 rows=101907564 width=0) (actual time=206.029..225289.570 rows=566625 loops=1)
     Filter: ((item_id >= 0) AND (item_id <= 20000) AND (NOT (hashed SubPlan 1)))
     SubPlan 1
       ->  Index Scan using item_tbl_pkey on item_tbl itm  (cost=0.00..1501.95 rows=36221 width=4) (actual time=0.056..99.266 rows=17560 loops=1)
             Index Cond: ((id >= 0) AND (id <= 20000))
Total runtime: 226817.211 ms


EXPLAIN ANALYZE SELECT COUNT(*)
FROM link_tbl lnk WHERE lnk.item_id>0 AND lnk.item_id<20000
  AND NOT EXISTS (SELECT 1 FROM item_tbl itm WHERE itm.id=lnk.item_id);

QUERY PLAN
Aggregate  (cost=8835772.00..8835772.01 rows=1 width=0)
   (actual time=1209235.133..1209235.135 rows=1 loops=1)
->  Hash Anti Join  (cost=102272.16..8835771.99 rows=1 width=0)
   (actual time=19315.170..1207900.612 rows=566534 loops=1)
     Hash Cond: (lnk.item_id = itm.id)
     ->  Seq Scan on link_tbl lnk  (cost=0.00..5091076.55 rows=203815128 width=4) (actual time=0.016..599147.604 rows=200301872 loops=1)
           Filter: ((item_id > 0) AND (item_id < 20000))
     ->  Hash  (cost=52016.07..52016.07 rows=3063207 width=4) (actual time=19313.976..19313.976 rows=3033811 loops=1)
           Buckets: 131072  Batches: 4  Memory Usage: 26672kB
           ->  Seq Scan on item_tbl itm  (cost=0.00..52016.07 rows=3063207 width=4) (actual time=0.013..9274.158 rows=3033811 loops=1)
Total runtime: 1209260.228 ms

NOT EXISTS was 5 times slower.

The actual delete of the data didn't take so long as I was worried, I was able to delete it in 5 batches (10000-20000,20000-100000,100000-200000,200000-1000000 and 1000000-1755441). At first I found out max item_id and I only had to went through half of the table.

When I tried NOT IN or EXISTS without the range (with select count) it didn't even finish, I let it run during the night and it was still running in the morning.

I think I was looking for DELETE with USING from wildplasser's answer https://stackoverflow.com/a/15988033/1331340 but it came too late.

DELETE FROM one o
USING (
    SELECT o2.id
    FROM one o2
    LEFT JOIN two t ON t.one_id = o2.id
    WHERE t.one_id IS NULL
    ) sq
WHERE sq.id = o.id
    ;

wildplasser about 11 years

Although I favor the NOT EXISTS too, I am afraid that this would yield exactly the same plan.
Lucas about 11 years

@wildplasser I cant speak to postgresql, but in db2, i got drastically increased performance using EXISTS... Also, here is an answer pertaining to sql server to the same effect: stackoverflow.com/a/2065403/516433
Lucas about 11 years

@wildplasser, though checking out your profile, you appear much more qualified than I when it comes to postgresql :)
wildplasser about 11 years

I know. (remember: I am a fan of NOT EISTS, too!) not exists is a primitive, and older plan-generators used to have problems with the duplicate (and NULL) removal of IN(), often forcing an extra sort-pass for the results of the subquery. The hash-thing solves it all, once the subquery can be detected, analysed and merged. BTW: I am not more qualified. I tend to go by may intuition. As if I were qualified ...
Erwin Brandstetter about 11 years

@wildplasser: I ran some tests. The plans for NOT EXISTS and NOT IN were never the same. But NOT IN is actually faster for checking the "1" side of a 1:n relation. I wrote more in my updated answer.
miloxe about 11 years

Thank you for your efforts, I used your delete but I added the range into the other select as well as without it it was too slow. You may find my comments in my edited question.
miloxe about 11 years

Thanks for your answer, I compared NOT EXISTS ans NOT IN and NOT EXISTS was slower for my data set. You can find my results in my edited question.
miloxe about 11 years

Thanks for you efforts, I think I was looking for DELETE with USING but in time when you posted your answer I had the data deleted already. See my results in my edited original question.
wildplasser about 11 years

You're welcome. This whole exercise is more or less intended for future readers. But it inspired by my hidden agenda: IN(subquery) is inferior to Not EXISTS (correlated subquery) As we speak, I am testing with 300K records, but the run-in is rather slow... BTW: I think that my test-rig is less specific: yours will delete about 1/16 of the rows, mine zero. (the distribution/sparseness of the affected (foreign) keys can also be a factor)
wildplasser about 11 years

Wel, thanks. but: 1) it is about Oracle. 2) it is 13 years old. 3) It is not true. 4) check the output of EXPLAIN (ANALYZE); NOT EXISTS() is mostly transformed into an anti-join, which will use indexes when appropiate, and a hashtable (given enough work_mem)
Erwin Brandstetter about 11 years

Right, repeat the condition in the NOT IN subquery. I tried with NOT EXISTS first, where it wouldn't help. Didn't think to optimize NOT IN later.
miloxe about 11 years

@wildplasser is 32MB of work_mem enough? Or maybe even better, if it's not too much to ask, maybe you could have a look at the postgres configuration in my post an make some suggestions?
wildplasser about 11 years

It is not about being enough, it is about seducing the optimiser into different behaviours. WRT configuration: I am missing random_page_cost . Lowering it often changes query plans dramatically, for example by favouring index scans.
miloxe about 11 years

@wildplasser #random_page_cost = 4.0 I guess the default value 4.0 is used.
wildplasser about 11 years

Setting random_page_cost lower (~ 1.5) is a way to seduce the planner to use indexes. Please not that using indexes can cause more pages to be fetched; this depends on the distribution/spread of your keys and the fraction of rows that the planner thinks you intend to hit.
tsionyx about 7 years

I get much faster result by using intermediate table: CREATE TABLE _del_ids AS SELECT o2.id FROM one o2 LEFT JOIN two t ON t.one_id = o2.id WHERE t.one_id IS NULL; DELETE FROM one o USING(SELECT o2.id FROM one o2 JOIN _del_ids t ON t.id=o.id) sq WHERE sq.id = o.id; DROP TABLE _del_ids;