Remove duplicate rows in MySQL

429,894

Solution 1

A really easy way to do this is to add a UNIQUE index on the 3 columns. When you write the ALTER statement, include the IGNORE keyword. Like so:

ALTER IGNORE TABLE jobs
ADD UNIQUE INDEX idx_name (site_id, title, company);

This will drop all the duplicate rows. As an added benefit, future INSERTs that are duplicates will error out. As always, you may want to take a backup before running something like this...

Solution 2

If you don't want to alter the column properties, then you can use the query below.

Since you have a column which has unique IDs (e.g., auto_increment columns), you can use it to remove the duplicates:

DELETE `a`
FROM
    `jobs` AS `a`,
    `jobs` AS `b`
WHERE
    -- IMPORTANT: Ensures one version remains
    -- Change "ID" to your unique column's name
    `a`.`ID` < `b`.`ID`

    -- Any duplicates you want to check for
    AND (`a`.`title` = `b`.`title` OR `a`.`title` IS NULL AND `b`.`title` IS NULL)
    AND (`a`.`company` = `b`.`company` OR `a`.`company` IS NULL AND `b`.`company` IS NULL)
    AND (`a`.`site_id` = `b`.`site_id` OR `a`.`site_id` IS NULL AND `b`.`site_id` IS NULL);

In MySQL, you can simplify it even more with the NULL-safe equal operator (aka "spaceship operator"):

DELETE `a`
FROM
    `jobs` AS `a`,
    `jobs` AS `b`
WHERE
    -- IMPORTANT: Ensures one version remains
    -- Change "ID" to your unique column's name
    `a`.`ID` < `b`.`ID`

    -- Any duplicates you want to check for
    AND `a`.`title` <=> `b`.`title`
    AND `a`.`company` <=> `b`.`company`
    AND `a`.`site_id` <=> `b`.`site_id`;

Solution 3

MySQL has restrictions about referring to the table you are deleting from. You can work around that with a temporary table, like:

create temporary table tmpTable (id int);

insert  into tmpTable
        (id)
select  id
from    YourTable yt
where   exists
        (
        select  *
        from    YourTabe yt2
        where   yt2.title = yt.title
                and yt2.company = yt.company
                and yt2.site_id = yt.site_id
                and yt2.id > yt.id
        );

delete  
from    YourTable
where   ID in (select id from tmpTable);

From Kostanos' suggestion in the comments:
The only slow query above is DELETE, for cases where you have a very large database. This query could be faster:

DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id

Solution 4

Deleting duplicates on MySQL tables is a common issue, that's genarally the result of a missing constraint to avoid those duplicates before hand. But this common issue usually comes with specific needs... that do require specific approaches. The approach should be different depending on, for example, the size of the data, the duplicated entry that should be kept (generally the first or the last one), whether there are indexes to be kept, or whether we want to perform any additional action on the duplicated data.

There are also some specificities on MySQL itself, such as not being able to reference the same table on a FROM cause when performing a table UPDATE (it'll raise MySQL error #1093). This limitation can be overcome by using an inner query with a temporary table (as suggested on some approaches above). But this inner query won't perform specially well when dealing with big data sources.

However, a better approach does exist to remove duplicates, that's both efficient and reliable, and that can be easily adapted to different needs.

The general idea is to create a new temporary table, usually adding a unique constraint to avoid further duplicates, and to INSERT the data from your former table into the new one, while taking care of the duplicates. This approach relies on simple MySQL INSERT queries, creates a new constraint to avoid further duplicates, and skips the need of using an inner query to search for duplicates and a temporary table that should be kept in memory (thus fitting big data sources too).

This is how it can be achieved. Given we have a table employee, with the following columns:

employee (id, first_name, last_name, start_date, ssn)

In order to delete the rows with a duplicate ssn column, and keeping only the first entry found, the following process can be followed:

-- create a new tmp_eployee table
CREATE TABLE tmp_employee LIKE employee;

-- add a unique constraint
ALTER TABLE tmp_employee ADD UNIQUE(ssn);

-- scan over the employee table to insert employee entries
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id;

-- rename tables
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;

Technical explanation

  • Line #1 creates a new tmp_eployee table with exactly the same structure as the employee table
  • Line #2 adds a UNIQUE constraint to the new tmp_eployee table to avoid any further duplicates
  • Line #3 scans over the original employee table by id, inserting new employee entries into the new tmp_eployee table, while ignoring duplicated entries
  • Line #4 renames tables, so that the new employee table holds all the entries without the duplicates, and a backup copy of the former data is kept on the backup_employee table

Using this approach, 1.6M registers were converted into 6k in less than 200s.

Chetan, following this process, you could fast and easily remove all your duplicates and create a UNIQUE constraint by running:

CREATE TABLE tmp_jobs LIKE jobs;

ALTER TABLE tmp_jobs ADD UNIQUE(site_id, title, company);

INSERT IGNORE INTO tmp_jobs SELECT * FROM jobs ORDER BY id;

RENAME TABLE jobs TO backup_jobs, tmp_jobs TO jobs;

Of course, this process can be further modified to adapt it for different needs when deleting duplicates. Some examples follow.

✔ Variation for keeping the last entry instead of the first one

Sometimes we need to keep the last duplicated entry instead of the first one.

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id DESC;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • On line #3, the ORDER BY id DESC clause makes the last ID's to get priority over the rest

✔ Variation for performing some tasks on the duplicates, for example keeping a count on the duplicates found

Sometimes we need to perform some further processing on the duplicated entries that are found (such as keeping a count of the duplicates).

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • On line #3, a new column n_duplicates is created
  • On line #4, the INSERT INTO ... ON DUPLICATE KEY UPDATE query is used to perform an additional update when a duplicate is found (in this case, increasing a counter) The INSERT INTO ... ON DUPLICATE KEY UPDATE query can be used to perform different types of updates for the duplicates found.

✔ Variation for regenerating the auto-incremental field id

Sometimes we use an auto-incremental field and, in order the keep the index as compact as possible, we can take advantage of the deletion of the duplicates to regenerate the auto-incremental field in the new temporary table.

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT (first_name, last_name, start_date, ssn) FROM employee ORDER BY id;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;
  • On line #3, instead of selecting all the fields on the table, the id field is skipped so that the DB engine generates a new one automatically

✔ Further variations

Many further modifications are also doable depending on the desired behavior. As an example, the following queries will use a second temporary table to, besides 1) keep the last entry instead of the first one; and 2) increase a counter on the duplicates found; also 3) regenerate the auto-incremental field id while keeping the entry order as it was on the former data.

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id DESC ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

CREATE TABLE tmp_employee2 LIKE tmp_employee;

INSERT INTO tmp_employee2 SELECT (first_name, last_name, start_date, ssn) FROM tmp_employee ORDER BY id;

DROP TABLE tmp_employee;

RENAME TABLE employee TO backup_employee, tmp_employee2 TO employee;

Solution 5

If the IGNORE statement won't work like in my case, you can use the below statement:

CREATE TABLE your_table_deduped LIKE your_table;


INSERT your_table_deduped
SELECT *
FROM your_table
GROUP BY index1_id,
         index2_id;

RENAME TABLE your_table TO your_table_with_dupes;

RENAME TABLE your_table_deduped TO your_table;

#OPTIONAL
ALTER TABLE `your_table` ADD UNIQUE `unique_index` (`index1_id`, `index2_id`);

#OPTIONAL
DROP TABLE your_table_with_dupes;
Share:
429,894
Chetan
Author by

Chetan

Updated on February 17, 2022

Comments

  • Chetan
    Chetan about 2 years

    I have a table with the following fields:

    id (Unique)
    url (Unique)
    title
    company
    site_id
    

    Now, I need to remove rows having same title, company and site_id. One way to do it will be using the following SQL along with a script (PHP):

    SELECT title, site_id, location, id, count( * ) 
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING count( * ) >1
    

    After running this query, I can remove duplicates using a server side script.

    But, I want to know if this can be done only using SQL query.

  • OMG Ponies
    OMG Ponies almost 14 years
    That won't work if there's more than two duplicates of a group.
  • Andomar
    Andomar almost 14 years
    Unfortunately, MySQL does not allow you to select from the table you are deleting from ERROR 1093: You can't specify target table 'Table' for update in FROM clause
  • OMG Ponies
    OMG Ponies almost 14 years
    Interesting, but the assumptions the IGNORE clause makes for removing those duplicates is a concern that might not match needs. Incorrect values being truncated to the closest acceptable match sound good to you?
  • Ezz Elkady
    Ezz Elkady almost 14 years
    In this particular case, that's definitely true. The collation of the title and company columns definitely matter. What, exactly, does incorrect values mean? I smell another question...
  • a coder
    a coder over 11 years
    @andomar, this works fine except when one of the fields in the where clause contain nulls. Example: sqlfiddle.com/#!2/983f3/1
  • DarkMantis
    DarkMantis over 11 years
    Just for the record if your using InnoDB then you may have an issue with it, there is a known bug about using ALTER IGNORE TABLE with InnoDB databases.
  • Jordan Arseno
    Jordan Arseno over 11 years
    The aforementioned bug @DarkMantis referred to and it's solution.
  • DarkMantis
    DarkMantis over 11 years
    @JordanArseno Yeah that's not really a solution as much as masking the problem.
  • Jordan Arseno
    Jordan Arseno over 11 years
    @DarkMantis I agree, but, it's a temporary solution that worked for me, and will work for others... It is link worthy, at least.
  • Cassio
    Cassio almost 11 years
    Is the Insert SQL an expensive one? I'm wondering because it times out in my MySQL database.
  • magdmartin
    magdmartin over 10 years
    works great if you have innoDB setting with foreign key constraint.
  • Jeshurun
    Jeshurun over 10 years
    As Trinity would say, "Thats a neat trick!". Just a note that this really messes up tables using columns in this one as a foreign key. I ended up having to dump the original contents into a temporary table, apply this to the original table, then cleaning up the referencing tables using the temporary tables (using other columns as composite keys) to match the correct id in the original table.
  • FireBear
    FireBear over 10 years
    As noted above you not can use this way if you table type is InnoDB, in this case you can change table type to MyISAM with SQL query : ALTER TABLE table_name ENGINE = MYISAM
  • Kostanos
    Kostanos over 10 years
    The only slow query here it the DELETE one, in case when you have big database. This query could be faster: DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id
  • shock_one
    shock_one about 10 years
    For InnoDB tables execute the following query first: set session old_alter_table=1;
  • Basilevs
    Basilevs over 9 years
    @magdmartin, but won't foreign constraints prevent table deletion?
  • Desty
    Desty over 9 years
    It seems to work with InnoDB tables now; I was able to do this without problems (or so it seems) yesterday without changing the table engine. Perhaps the bug has since been fixed?
  • Mauvis Ledford
    Mauvis Ledford about 9 years
    IGNORE statement didn't work for me and this worked great on deduping 5 million records. Cheers.
  • Nassim
    Nassim about 9 years
    this solution is not working properly , i tried to make some duplicate records and it does something like (20 rows affected) but if you run it again it will show you (4 rows affected) and so on until you reach (0 rows affected) which is kinda suspicious and here is what works best for me , it's almost the same but it works in one run, I edited the solution
  • timctran
    timctran almost 9 years
    Why did you take the union and not just SELECT * FROM jobs GROUP BY site_id, company, title, location?
  • Ray Baxter
    Ray Baxter over 8 years
    This is no longer support in 5.7.4, dev.mysql.com/doc/refman/5.7/en/alter-table.html
  • Lawrence Dol
    Lawrence Dol over 8 years
    How is this different from @rehriff's answer, which he submitted 6 months earlier?
  • Lawrence Dol
    Lawrence Dol over 8 years
    @Nassim: You must be doing something different from this answer because it works perfectly for me (in MySQL).
  • Mostafa -T
    Mostafa -T over 8 years
    @LawrenceDol I guess it's a bit more readable and also I think his answer were not the same at the time I answered and I think his answer got edited.
  • lepe
    lepe over 8 years
    To solve the "You can't specify target table 'Table' for update in FROM..." error, use: DELETE FROM Table WHERE Table.idTable IN ( SELECT MAX(idTable) FROM (SELECT * FROM idTable) AS tmp GROUP BY field1, field2, field3 HAVING COUNT(*) > 1) which forces MySQL to create a temporally table. However it is very slow in large datasets... in such cases, I will recommend Andomar's code, which is much faster.
  • Max
    Max about 8 years
    This is very bad form- database tasks should be done in the DB, where they are much much faster, instead of sending data constantly between php/mysql because you know one better than the other.
  • booshong
    booshong over 7 years
    For anyone that was confused like me, the NULL comparison terms are needed because NULL does not equal NULL in MySQL. If the relevant columns are guaranteed to not be NULL, you can leave these terms out.
  • uzor
    uzor about 7 years
    worked! but with ... $mm = "set session old_alter_table=1; "; $rmm = mysql_query($mm);
  • Sana
    Sana about 7 years
    Your 4th line should say TRUNCATE TABLE tableName and 5th line should say INSERT INTO tableName SELECT * FROM tempTableName;
  • Swagdaddymuffin
    Swagdaddymuffin over 6 years
    Error Code: 1055. Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'dub.id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
  • artemiuz
    artemiuz over 6 years
    you could disable "hard control" with sql_mode, see stackoverflow.com/questions/23921117/disable-only-full-group‌​-by
  • iaforek
    iaforek about 6 years
    Worked really well with mySQL 5.6.37!
  • Robin31
    Robin31 about 6 years
    This worked really well in mysql 5.7where accepted solution doesnt work anymore
  • ako
    ako almost 6 years
    Does not work for columns with type of text as those columns could not be used as UNIQUE keys.
  • wheelerswebservices
    wheelerswebservices almost 6 years
    This worked for me to remove 19,384 records in 131 seconds. The 'accepted' solution above is giving me Invalid Syntax Error due to the version of mySQL I am using.
  • that-ben
    that-ben over 5 years
    Yes, the accepted answer is no longer valid, since MYSQL 5.7 so this should really be the accepted answer as it's universal and does not require temporary table creation either.
  • ToolmakerSteve
    ToolmakerSteve about 5 years
    VERY SLOW if there are MANY copies of a given record (e.g. 100 to be reduced to 1), and many records with that condition. Recommend stackoverflow.com/a/4685232/199364 instead. IMHO, ALWAYS use the linked approach; its an inherently faster technique.
  • Luc
    Luc about 5 years
    Doesn't this compare every row against every other row? In a table with a few million records, this would take approximately forever. Using this was much faster somehow: INSERT INTO newtable SELECT a.* FROM oldtable AS a GROUP BY a.firstcolumn, a.secondcolumn, etc. (of course not including the primary key in the "group by" part).
  • Vinny
    Vinny about 5 years
    Since it no longer works on MySQL 5.7.4 or newer, whats the alternative?
  • Jiezhi.G
    Jiezhi.G about 5 years
    @Kostanos Not just DELETE, but also INSERT to the temporary table, it took me a long time. So an index for tmp table could help a lot, create index tmpTable_id_index on tmpTable (id), at least for me.
  • Dallas Clarke
    Dallas Clarke almost 5 years
    If your tables are large, it's worth wild adding an index with:- create temporary table tmpTable (id int, PRIMARY KEY (id));
  • SuB
    SuB over 4 years
    hmm. It takes too long for me while number of records was not big!
  • Anna Gabrielyan
    Anna Gabrielyan almost 4 years
    why this not working on mysql? it gives Unexpected token near t1
  • Rico Nguyen
    Rico Nguyen almost 4 years
    I used it many times, they all worked. show me your actual script
  • flow2k
    flow2k over 3 years
    Interesting. Alternatively, for performance, we could create a temporary table with the subquery, but filtering for rn = 1. Then we drop the original and rename.
  • flow2k
    flow2k over 3 years
    @Vinny To me, the closest alternative is the approach using INSERT IGNORE INTO, as explained on tocker.ca/the-future-of-alter-ignore-table-syntax.html and stackoverflow.com/a/47392593/7154924.
  • MAbraham1
    MAbraham1 over 3 years
    Why t1.id < t2.id rather than t1.id <> t2.id? That will prevent human error if t1 and t2 are switched around.
  • miken32
    miken32 over 3 years
    There have also been a few answers with DELETE...JOIN syntax already, with both natural and inner joins. This doesn't seem to add anything new.
  • Jan Steinman
    Jan Steinman over 3 years
    @ToolmakerSteve, yes, this has n-squared performance in the degenerate case. HOWEVER, all you have to do is put indexes on all the columns you are comparing, and performance goes to log(n)! I started this on a table of ~59,000, with about 9,000 dups, and it was still running after 20 minutes! Then, I put an index on my sole compare column, and the time to run was nearly imperceptible! If you have no further use for the index, you can delete it immediately after the dup deletion.
  • ToolmakerSteve
    ToolmakerSteve over 3 years
    @JanSteinman - ahh, indexes! I didn't think about how those would help the performance. Thank you. (And I've never created indexes temporarily, so that is a useful thought to tuck away for future use.)
  • codemonkey
    codemonkey about 3 years
    this is a very ugly way of doing it and i don't get it's upvoted this much.
  • Noé
    Noé over 2 years
    This is just amazing, thank you so much
  • Ambiwlans
    Ambiwlans over 2 years
    The issue with this is that you need the space to duplicate your data.
  • ino
    ino over 2 years
    Thank you @Cesar for sharing this solution. I got inspired with it and get out of my troubles with messed up two tables with no index no keys. I just had to be careful with correct ORDER BY while importing data not to overwrite valuable info with duplicate records with no values. But yes, I fixed both tables!