Merge duplicate records into 1 records with the same table and table fields

11,427

Solution 1

You can do something like this in Postgres 9.1+:

WITH duplicates AS (
  SELECT desired_unique_key, count(*) AS count_of_same_key, min(st.id) AS keep_id, max(st.id) as delete_id
  FROM source_table st
  GROUP BY desired_unique_key
  HAVING count(*) > 1
),
 deleted_dupes AS (
  DELETE FROM source_table st
  WHERE st.id IN (SELECT(delete_id) FROM duplicates)
)
UPDATE source_table st
  SET field = WHATEVER
  FROM duplicates d
  WHERE st.id = d.keep_id

Solution 2

I'm doing something like this (postgres), I haven't tested it yet though.

SELECT dup.id AS dup_id, orig.id AS orig_id
INTO TEMP specialty_duplicates
FROM medical_specialty dup,
        (SELECT DISTINCT ON (name) * FROM medical_specialty ORDER BY name, id) orig
WHERE orig.name = dup.name AND dup.id <> orig.id;


UPDATE doctor_medical_specialty
SET medical_specialty=orig_id
FROM specialty_duplicates
WHERE medical_specialty = dup_id;

DELETE
FROM medical_specialty
WHERE id IN (SELECT dup_id FROM specialty_duplicates);

ALTER TABLE medical_specialty
    ADD UNIQUE (name);

The schema is that medical_specialty has id and name, and doctor_medical_specialty references it by id.

The benefit over a CTE (IIUC) is you can merge references in multiple referring tables.

I'm using a temporary table rather than a view so that both deleting and updating are consistent with the same snapshot in time

Share:
11,427
norlan V
Author by

norlan V

Updated on June 07, 2022

Comments

  • norlan V
    norlan V almost 2 years

    I have a database table that contains a list of demographic records, some of those participant might have multiple/duplicate records, e.g.

    NOTE:
    Gender:
    119 = Male
    118 = Female

    Race:
    255 = white
    253 = Asian

    UrbanRural:
    331 = Urban
    332 = Rural

    participantid, gender, race, urbanrural, moduletypeid, hibernateid, and more fields
    1, 119, 0, 331, 1, 1, .....
    1, 119, 255, 0, 2, 2, .....
    1, 0, 255, 331, 3, 3, .....
    1, 119, 253, 331, 0, 4, .....
    

    The output should be keep the first hibernateid and the duplicate records will be merge to the first hibernatedid record. If you can do this using function that will check the records if duplicate that would be great, after merged the records it delete the unused duplicate records. Your answer gives me a great idea to resolved this problem. Thanks

    Output should be:

    participantid, gender, race, urbanrural, moduletypeid, hibernateid, and more fields
    1, 119, 255, 331, 1, 1, .....
    


    Help me guys, Thanks