MERGing Duplicate Records in SQL Server

12,088

Solution 1

In order to get this to work in the MERGE statement, you'll need to be a little more specific about the rows. I've adjusted the query below:

MERGE [__EpiTest] AS Tgt 
USING (
    SELECT [ActivityRecordID], 
           [ActCstID], 
           [ResCstID], 
           SUM([TotOCst]) AS TotOCst, 
           SUM([TotCst]) AS TotCst ,
           VolAmt,
           ActCnt
    FROM [__EpiTest]  
    GROUP BY [ActivityRecordID], 
             [ActCstID], 
             [ResCstID], 
             VolAmt,
             ActCnt) AS Src 
ON (Tgt.[ActivityRecordID] = Src.[ActivityRecordID] AND 
     Tgt.[ActCstID] = Src.[ActCstID] AND 
     Tgt.[ResCstID] = Src.[ResCstID] AND
     Tgt.TotOCst = Src.TotOCst AND
     Tgt.TotCst = Src.TotCst
     ) 
WHEN NOT MATCHED BY TARGET THEN
    INSERT ( ActivityRecordID, ActCstID, ResCstID, TotOCst, TotCst, VolAmt, ActCnt )
    VALUES ( ActivityRecordID, ActCstID, ResCstID, TotOCst, TotCst, VolAmt, ActCnt )
WHEN NOT MATCHED BY SOURCE THEN 
    DELETE;
GO

Basically, I've changed it so that the source of the merge include the other columns that seem constant, so that they can be inserted later on. I've changed the MATCH condition to be basically "the row is exactly the same and will be unchanged" and so removed the WHEN MATCHED statement.

I then added an INSERT statement, so that for ones that have changed (that will have their rows deleted by the NOT MATCHED BY TARGET), a new row is inserted with these values.

In your same data, this returns a result set of:

ActivityRecordID    ActCstID    ResCstID    VolAmt  ActCnt  TotOCst TotCst
15652   DIM008  CPF005  30.455249786377 1   0   0.375024198767061
15652   DIM008  CSC004  30.455249786377 1   7.62176510799961    11.932578069479
15652   DIM008  REC001  30.455249786377 1   0.17902367836393    0.384881520159455
15652   OUT001  CPF002  15  0   0   16.9408193013078
15652   OUT001  CSC001  15  0   2.36971564207042    2.36971564207042
15652   OUT001  CSC004  15  0   12.3230666021278    12.3760690367354
15652   OUT001  REC001  15  0   0.377459387378349   3.0275278374102
15652   SUP001  CPF002  1   1   0   0.00108648359810756
15652   SUP001  CPF011  1   1   0   -1.89799880202357E-14
15652   SUP001  CPF020  1   1   0   1.31058251625567E-05
15652   SUP001  CSC002  1   1   0   10.2266625467779
15652   SUP001  REC002  1   1   0   1.10781732547441
15652   SUP001  CPF021  1   1   0   50.1884617025102
15652   SUP001  CSC001  1   1   3.9257538206902 3.9257538206902
15652   SUP001  CSC004  1   1   32.690344321601 32.7026638120092
15652   SUP001  REC001  1   1   0.508820773403952   -12.5409759131875

Solution 2

In the __EpiTest has duplicate values

SELECT * FROM @__EpiTest 
where ResCstID = 'REC001' and ActCstID ='SUP001'
order by ActivityRecordID ,  ActCstID , ResCstID;
ActivityRecordID ActCstID ResCstID VolAmt      ActCnt      TotOCst                                 TotCst
---------------- -------- -------- ----------- ----------- --------------------------------------- ---------------------------------------
15652            SUP001   REC001   1           1           0.2544103867                            -6.2704879566
15652            SUP001   REC001   1           1           0.2544103867                            -6.2704879566

To simple select the single values

  SELECT DISTINCT * FROM @__EpiTest 
    where ResCstID = 'REC001' and ActCstID ='SUP001'
    order by ActivityRecordID ,  ActCstID , ResCstID;
    ActivityRecordID ActCstID ResCstID VolAmt      ActCnt      TotOCst                                 TotCst
    ---------------- -------- -------- ----------- ----------- --------------------------------------- ---------------------------------------
    15652            SUP001   REC001   1           1           0.2544103867                            -6.2704879566

To delete

To delete duplicated values and maintain the frist

;WITH cte as (
SELECT Row_number() OVER (PARTITION BY ActivityRecordID ,  ActCstID , ResCstID ORDER BY (SELECT NULL)) Rn, * FROM @__EpiTest 
where ResCstID = 'REC001' and ActCstID ='SUP001'
)
Delete from cte where Rn > 1 

SELECT * FROM @__EpiTest 
where ResCstID = 'REC001' and ActCstID ='SUP001' 
ActivityRecordID ActCstID ResCstID VolAmt      ActCnt      TotOCst                                 TotCst
---------------- -------- -------- ----------- ----------- --------------------------------------- ---------------------------------------
15652            SUP001   REC001   1           1           0.2544103867                            -6.2704879566

Using the Merge - Be careful, merge statament has serious preformance problems

  ;WITH myResult as (
    SELECT Row_number() OVER (PARTITION BY ActivityRecordID ,  ActCstID , ResCstID ORDER BY (SELECT NULL)) Rn, * FROM @__EpiTest 
    )
    MERGE myResult AS Tgt 
    USING myResult AS Src 
    ON (Tgt.[ActivityRecordID] = Src.[ActivityRecordID] AND 
         Tgt.[ActCstID] = Src.[ActCstID] AND 
         Tgt.[ResCstID] = Src.[ResCstID] AND 
         Tgt.Rn = Src.Rn AND
         Src.Rn = 1)  
    WHEN MATCHED THEN 
        UPDATE 
        SET [TotOCst] = Src.[TotOCst], 
            [TotCst] = Src.[TotCst] 
    WHEN NOT MATCHED BY SOURCE THEN 
        DELETE;
     SELECT * FROM @__EpiTest 
    where ResCstID = 'REC001' and ActCstID ='SUP001' 
Rn                   ActivityRecordID ActCstID ResCstID VolAmt      ActCnt      TotOCst                                 TotCst
-------------------- ---------------- -------- -------- ----------- ----------- --------------------------------------- ---------------------------------------
1                    15652            SUP001   REC001   1           1           0.2544103867                            -6.2704879566
2                    15652            SUP001   REC001   1           1           0.2544103867                            -6.2704879566

(2 row(s) affected)

(20 row(s) affected)

ActivityRecordID ActCstID ResCstID VolAmt      ActCnt      TotOCst                                 TotCst
---------------- -------- -------- ----------- ----------- --------------------------------------- ---------------------------------------
15652            SUP001   REC001   1           1           0.2544103867                            -6.2704879566
Share:
12,088
MoonKnight
Author by

MoonKnight

Updated on June 21, 2022

Comments

  • MoonKnight
    MoonKnight about 2 years

    SQL Fiddle

    I have the following table

    CREATE TABLE __EpiTest
    (
        `ActivityRecordID` int, 
        `ActCstID` varchar(6), 
        `ResCstID` varchar(6), 
        `VolAmt` int, 
        `ActCnt` int, 
        `TotOCst` int, 
        `TotCst` int
    );
    
    INSERT INTO __EpiTest (`ActivityRecordID`, `ActCstID`, `ResCstID`, `VolAmt`, `ActCnt`, `TotOCst`, `TotCst`)
    VALUES (15652, 'DIM008', 'CPF005', 30.455249786377, 1, 0, 0.375024198767061),
           (15652, 'DIM008', 'CSC004', 30.455249786377, 1, 7.62176510799961, 11.932578069479),
           (15652, 'DIM008', 'REC001', 30.455249786377, 1, 0.17902367836393, 0.384881520159455),
           (15652, 'OUT001', 'CPF002', 15, 0, 0, 16.9408193013078),
           (15652, 'OUT001', 'CSC001', 15, 0, 2.36971564207042,  2.36971564207042),
           (15652, 'OUT001', 'CSC004', 15, 0, 12.3230666021278, 12.3760690367354),
           (15652, 'OUT001', 'REC001', 15, 0, 0.377459387378349, 3.0275278374102),
           (15652, 'SUP001', 'CPF002', 1, 1, 0, 0.00108648359810756),
           (15652, 'SUP001', 'CPF011', 1, 1, 0, -1.89799880202357E-14),
           (15652, 'SUP001', 'CPF020', 1, 1, 0, 1.31058251625567E-05),
           (15652, 'SUP001', 'CPF021', 1, 1, 0, 25.0942308512551),
           (15652, 'SUP001', 'CPF021', 1, 1, 0, 25.0942308512551),
           (15652, 'SUP001', 'CSC001', 1, 1, 1.9628769103451, 1.9628769103451),
           (15652, 'SUP001', 'CSC001', 1, 1, 1.9628769103451, 1.9628769103451),
           (15652, 'SUP001', 'CSC002', 1, 1, 0, 10.2266625467779),
           (15652, 'SUP001', 'CSC004', 1, 1, 16.3451721608005, 16.3513319060046),
           (15652, 'SUP001', 'CSC004', 1, 1, 16.3451721608005, 16.3513319060046),
           (15652, 'SUP001', 'REC001', 1, 1, 0.254410386701976, -6.27048795659376),
           (15652, 'SUP001', 'REC001', 1, 1, 0.254410386701976, -6.27048795659376),
           (15652, 'SUP001', 'REC002', 1, 1, 0, 1.10781732547441);
    

    Notice there are rows which have matching values for [ActivityRecordID], [ActCstID] and [ResCstID]. I want to merge these values and sum the values in [TotOCst] and [TotCst]. To do this I have attempted to use MERGE

    MERGE [__EpiTest] AS Tgt 
    USING (
        SELECT [ActivityRecordID], 
               [ActCstID], 
               [ResCstID], 
               SUM([TotOCst]) AS TotOCst, 
               SUM([TotCst]) AS TotCst 
        FROM [__EpiTest]  
        GROUP BY [ActivityRecordID], 
                 [ActCstID], 
                 [ResCstID]) AS Src 
    ON (Tgt.[ActivityRecordID] = Src.[ActivityRecordID] AND 
         Tgt.[ActCstID] = Src.[ActCstID] AND 
         Tgt.[ResCstID] = Src.[ResCstID])  
    WHEN MATCHED THEN 
        UPDATE 
        SET [TotOCst] = Src.[TotOCst], 
            [TotCst] = Src.[TotCst] 
    WHEN NOT MATCHED BY SOURCE THEN 
        DELETE;
    GO
    

    This matches and correctly updates the values in the [TotOCst] and [TotCst] for each duplicate, but it then leaves the duplicate rows in the table whereas I want all but one removed. How can I achieve this?

    Note, the target table is HUGE, so I would like to try and do this with a single operation by using a variation of the MERGE query above, or other alternative. Multiple queries will be too expensive for me to deal with...


    Illustration

    I get

    ...
    15652   SUP001  CPF021  1   1   0                   12.5471154256275
    15652   SUP001  CPF021  1   1   0                   12.5471154256275
    15652   SUP001  CSC001  1   1   0.98143845517255    0.98143845517255
    15652   SUP001  CSC001  1   1   0.98143845517255    0.98143845517255
    15652   SUP001  CSC002  1   1   0                   10.2266625467779
    15652   SUP001  CSC004  1   1   8.17258608040024    8.17566595300228
    15652   SUP001  CSC004  1   1   8.17258608040024    8.17566595300228
    15652   SUP001  REC001  1   1   0.127205193350988   -3.13524397829688
    15652   SUP001  REC001  1   1   0.127205193350988   -3.13524397829688
    ...
    

    But I want

    ...
    15652   SUP001  CPF021  1   1   0                   12.5471154256275
    15652   SUP001  CSC001  1   1   0.98143845517255    0.98143845517255 
    15652   SUP001  CSC002  1   1   0                   10.2266625467779
    15652   SUP001  CSC004  1   1   8.17258608040024    8.17566595300228
    15652   SUP001  REC001  1   1   0.127205193350988   -3.13524397829688
    ...
    
  • MoonKnight
    MoonKnight almost 7 years
    Thanks, but I would like to do this all with one operation. Merging and then deleting might be too expensive for me to deal with.
  • TZHX
    TZHX almost 7 years
    @lad2025 remove rows that need to be changed, yes.
  • Maurício Pontalti Neri
    Maurício Pontalti Neri almost 7 years
    No problem, I Added a merge version
  • Lukasz Szozda
    Lukasz Szozda almost 7 years
    @MaurícioPontaltiNeri Where is SUM in final version with ROW_NUMBER?
  • TZHX
    TZHX almost 7 years
    @MoonKnight I've changed the data type from int to float, to try and make it look more like yours but it was just a guess? What row do you think needs to be different?
  • Maurício Pontalti Neri
    Maurício Pontalti Neri almost 7 years
    In the expected result the values are not summarized, they are unique.
  • MoonKnight
    MoonKnight about 6 years
    I have just run into the performance problems with MERGE that you spoke of. Any ideas on how to optimize such a merge statement?