How to delete duplicate entries?
Solution 1
For example you could:
CREATE TABLE tmp ...
INSERT INTO tmp SELECT DISTINCT * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;
Solution 2
Some of these approaches seem a little complicated, and I generally do this as:
Given table table
, want to unique it on (field1, field2) keeping the row with the max field3:
DELETE FROM table USING table alias
WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND
table.max_field < alias.max_field
For example, I have a table, user_accounts
, and I want to add a unique constraint on email, but I have some duplicates. Say also that I want to keep the most recently created one (max id among duplicates).
DELETE FROM user_accounts USING user_accounts ua2
WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;
- Note -
USING
is not standard SQL, it is a PostgreSQL extension (but a very useful one), but the original question specifically mentions PostgreSQL.
Solution 3
Instead of creating a new table, you can also re-insert unique rows into the same table after truncating it. Do it all in one transaction.
This approach is only useful where there are lots of rows to delete from all over the table. For just a few duplicates, use a plain DELETE
.
You mentioned millions of rows. To make the operation fast you want to allocate enough temporary buffers for the session. The setting has to be adjusted before any temp buffer is used in your current session. Find out the size of your table:
SELECT pg_size_pretty(pg_relation_size('tbl'));
Set temp_buffers
at least a bit above that.
SET temp_buffers = 200MB; -- example value
BEGIN;
CREATE TEMP TABLE t_tmp AS -- retains temp for duration of session
SELECT DISTINCT * FROM tbl -- DISTINCT folds duplicates
ORDER BY id; -- optionally "cluster" data
TRUNCATE tbl;
INSERT INTO tbl
SELECT * FROM t_tmp; -- retains order (implementation detail)
COMMIT;
This method can be superior to creating a new table if depending objects exist. Views, indexes, foreign keys or other objects referencing the table. TRUNCATE
makes you begin with a clean slate anyway (new file in the background) and is much faster than DELETE FROM tbl
with big tables (DELETE
can actually be faster with small tables).
For big tables, it is regularly faster to drop indexes and foreign keys (FK), refill the table and recreate these objects. As far as FK constraints are concerned you have to be certain the new data is valid, of course, or you'll run into exceptions on trying to create the FK.
Note that TRUNCATE
requires more aggressive locking than DELETE
. This may be an issue for tables with heavy, concurrent load. But it's still less disruptive than to drop and replace the table completely.
If TRUNCATE
is not an option or generally for small to medium tables there is a similar technique with a data-modifying CTE (Postgres 9.1+):
WITH del AS (DELETE FROM tbl RETURNING *)
INSERT INTO tbl
SELECT DISTINCT * FROM del;
ORDER BY id; -- optionally "cluster" data while being at it.
Slower for big tables, because TRUNCATE
is faster there. But may be faster (and simpler!) for small tables.
If you have no depending objects at all, you might create a new table and delete the old one, but you hardly gain anything over this universal approach.
For very big tables that would not fit into available RAM, creating a new table will be considerably faster. You'll have to weigh this against possible troubles / overhead with depending objects.
Solution 4
You can use oid or ctid, which is normally a "non-visible" columns in the table:
DELETE FROM table
WHERE ctid NOT IN
(SELECT MAX(s.ctid)
FROM table s
GROUP BY s.column_has_be_distinct);
Solution 5
The PostgreSQL window function is handy for this problem.
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id,
row_number() over (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
See Deleting duplicates.
gjrwebber
Updated on July 08, 2022Comments
-
gjrwebber almost 2 years
I have to add a unique constraint to an existing table. This is fine except that the table has millions of rows already, and many of the rows violate the unique constraint I need to add.
What is the fastest approach to removing the offending rows? I have an SQL statement which finds the duplicates and deletes them, but it is taking forever to run. Is there another way to solve this problem? Maybe backing up the table, then restoring after the constraint is added?
-
gjrwebber over 14 yearsThat is what I am currently doing, but it is taking a very long time to run.
-
gjrwebber over 14 yearsCan you make it distinct for group of columns. Maybe "SELECT DISTINCT (t.a, t.b, t.c), * FROM t"?
-
just somebody over 14 yearsDISTINCT ON (a, b, c): postgresql.org/docs/8.2/interactive/sql-select.html
-
Randal Schwartz about 14 yearseasier to type:
CREATE TABLE tmp AS SELECT ...;
. Then you don't need to even figure out what the layout oftmp
is. :) -
Eric Bowman - abstracto - over 12 yearsThat second approach is very fast on postgres! Thanks.
-
Erwin Brandstetter over 12 yearsThis answer is actually not very good for several reasons. @Randal named one. In most cases, especially if you have depending objects like indexes, constraints, views etc., the superior approach is to use an actual TEMPORARY TABLE, TRUNCATE the original and re-insert the data.
-
just somebody over 12 years@ErwinBrandstetter: the question asked for the fastest approach. mass import of data into a table with indexes and constraints is going to take ages. the PostgreSQL manual actually recommends dropping indexes and foreign keys: postgresql.org/docs/9.1/static/populate.html. i'd say your downvote is completely off the mark.
-
Erwin Brandstetter over 12 yearsYou are right about indexes. Dropping & recreating is much faster. But other depending objects will break or prevent dropping the table altogether - which the OP would find out after having made the copy - so much for the "fastest approach". Still, you are right about the downvote. It is unfounded, because it is not a bad answer. It is just not that good. You could have added some pointers about indexes or depending objects or a link to the manual like you did in the comment or any kind of explanation. I guess I got frustrated about how people vote. Removed the downvote.
-
xlash over 11 yearsI used this approach too. However, it might be personnal, but my temp table was deleted, and not available after the truncate... Be careful to do those steps if temp table was created successfully and is available.
-
Erwin Brandstetter over 11 years@xlash: You can check for existence to make sure, and either use a different name for the temp table or reuse the one in existence .. I added a bit to my answer.
-
shreedhar over 10 yearsWouldn't this fail if multiple rows in table have the same value in column something?
-
Erwin Brandstetter over 10 yearsFor deleting in place,
NOT EXISTS
should be considerably faster:DELETE FROM tbl t WHERE EXISTS (SELECT 1 FROM tbl t1 WHERE t1.dist_col = t.dist_col AND t1.ctid > t.ctid)
-- or use any other column or set of columns for sorting to to pick a survivor. -
Jordan Arseno about 10 yearsWARNING: Be careful +1 to @xlash -- I have to re-import my data because the temporary table was non-existent after
TRUNCATE
. As Erwin said, be sure to make sure it exists before truncating your table. See @codebykat's answer -
Erwin Brandstetter about 10 years@JordanArseno: I switched to a version without
ON COMMIT DROP
, so that people who miss the part where I wrote "in one transaction" don't lose data. And I added BEGIN / COMMIT to clarify "one transaction". -
Jordan Arseno about 10 yearsThx @ErwinBrandstetter
-
John about 10 years@ErwinBrandstetter, is the query you provide supposed to use
NOT EXISTS
? -
Erwin Brandstetter about 10 years@John: It must be
EXISTS
here. Read it like this: "Delete all rows where any other row exists with the same value indist_col
but a biggerctid
". The only survivor per group of dupes will be the one with the biggestctid
. -
Kalanidhi about 10 yearsYour explanation is very smart ,but you are missing one point ,In create table specify the oid then only access the oid else error message display
-
Bhavik Ambani about 10 years@Kalanidhi Thanks for your comments regarding improvement of the answer, I will take consideration this point.
-
Fopa Léon Constantin about 10 years@Tim can you better explain what does
USING
do in postgresql ? -
Fopa Léon Constantin about 10 years@ErwinBrandstetter I think This solution is less efficient when there is not that much duplicate to remove from the original table. And it is worst when there is no duplicate at all. Can you provide some improvement for example to avoid truncating when both t_tmp and the original table have same number of rows (=> there where no duplicate). does
DELETE
more suitable for those situations ? -
Martin F about 10 yearsThis really came from postgresql.org/message-id/…
-
Shane almost 10 yearsThis is by far the best answer. Even if you don't have a serial column in your table to use for the id comparison, it's worth it to temporarily add one to use this simple approach.
-
Skippy le Grand Gourou over 9 yearsEasiest solution if you have only a few duplicated rows. Can be used with
LIMIT
if you know the number of duplicates. -
Skippy le Grand Gourou over 9 yearsI know it doesn't address OP's issue, who has many duplicated in millions of rows, but it may be helpful anyway.
-
Erwin Brandstetter about 9 years@FopaLéonConstantin: Yes, of course. The suggested procedure only makes sense to delete large portions from a big table.
-
Parker Selbert about 9 yearsThe
USING
approach is vastly faster than max comparisons. Great answer. -
bradw2k about 9 yearsAnd using "ctid" instead of "id", this actually works for fully duplicate rows.
-
bradw2k about 9 yearsThis would have to be run once for each duplicate row. shekwi's answer need only be run once.
-
sschober about 9 years@ErwinBrandstetter Your last example is missing an
N
inDISTINCT
(one character edits are not allowed, at least for me...). -
Erwin Brandstetter about 9 years@sschober: Thanks, fixed.
-
Sergey Tsibel about 9 yearssolution with USING took more than 3 hours on table with 14 million records. This solution with temp_buffers took 13 minutes. Thanks.
-
André C. Andersen about 9 years@FopaLéonConstantin Will flipping the less-than (<) operator to greater-than (>) operator leave me with the minimum user_account.id?
-
André C. Andersen about 9 yearsI just checked. The answer is yes, it will. Using less-than (<) leaves you with only the max id, while greater-than (>) leaves you with only the min id, deleting the rest.
-
Rhys van der Waerden almost 9 yearsCould this approach cause cascading deletes on other tables with foreign key references to columns in
t
? -
Arlen Beiler almost 9 yearsThe second approach is much faster if email is indexed. Like 100X faster.
-
Nuno Aniceto almost 9 yearsUse: create table X as table Y; -- to copy the table data info from Y to X (new) Then truncate table X; -- to remove the copied data. Makes easy to abstract from table columns and details, but not so much efficient.
-
11101101b over 8 yearsThis approach also works for MySQL, you just have to restate the 2nd 'user_accounts' like this:
DELETE FROM user_accounts USING user_accounts, user_accounts ua2 WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;
-
sul4bh over 8 yearsYou can use the system column 'ctid' if 'oid' gives you an error.
-
Jan over 8 yearsGreat solution. I had to do this for a table with a billion records. I added a WHERE to the inner SELECT to do it in chunks.
-
alexkovelsky about 8 yearsYou could also compare records, which is shorter to write:
WHERE (table.field1, table.field2) = (alias.field1, alias.field2)
-
alexkovelsky about 8 years@Shane one can use:
WHERE table1.ctid<table2.ctid
- no need to add serial column -
msciwoj over 7 yearsthe only universal answer! Works without self/cartesian JOIN. Worth adding though that it's essential to correctly specify
GROUP BY
clause - this should be the 'uniqueness criteria' that is violated now or if you'd like the key to detect duplicates. If specified wrong it won't work correctly -
Tobias about 6 yearsI tested it, and it worked; I formatted it for readability. It looks quite sophisticated, but it could use some explanation. How would one change this example for his/her own use case?