How to prevent duplicate records being inserted with SqlBulkCopy when there is no primary key

27,595

Solution 1

I'd upload the data into a staging table then deal with duplicates afterwards on copy to the final table.

For example, you can create a (non-unique) index on the staging table to deal with the "key"

Solution 2

Given that you're using SQL 2008, you have two options to solve the problem easily without needing to change your application much (if at all).

The first possible solution is create a second table like the first one but with a surrogate identity key and a uniqueness constraint added using the ignore_dup_key option which will do all the heavy lifting of eliminating the duplicates for you.

Here's an example you can run in SSMS to see what's happening:

if object_id( 'tempdb..#test1' ) is not null drop table #test1;
if object_id( 'tempdb..#test2' ) is not null drop table #test2;
go


-- example heap table with duplicate record

create table #test1
(
     col1 int
    ,col2 varchar(50)
    ,col3 char(3)
);
insert #test1( col1, col2, col3 )
values
     ( 250, 'Joe''s IT Consulting and Bait Shop', null )
    ,( 120, 'Mary''s Dry Cleaning and Taxidermy', 'ACK' )
    ,( 250, 'Joe''s IT Consulting and Bait Shop', null )    -- dup record
    ,( 666, 'The Honest Politician', 'LIE' )
    ,( 100, 'My Invisible Friend', 'WHO' )
;
go


-- secondary table for removing duplicates

create table #test2
(
     sk int not null identity primary key
    ,col1 int
    ,col2 varchar(50)
    ,col3 char(3)

    -- add a uniqueness constraint to filter dups
    ,constraint UQ_test2 unique ( col1, col2, col3 ) with ( ignore_dup_key = on )
);
go


-- insert all records from original table
-- this should generate a warning if duplicate records were ignored

insert #test2( col1, col2, col3 )
select col1, col2, col3
from #test1;
go

Alternatively, you can also remove the duplicates in-place without a second table, but the performance may be too slow for your needs. Here's the code for that example, also runnable in SSMS:

if object_id( 'tempdb..#test1' ) is not null drop table #test1;
go


-- example heap table with duplicate record

create table #test1
(
     col1 int
    ,col2 varchar(50)
    ,col3 char(3)
);
insert #test1( col1, col2, col3 )
values
     ( 250, 'Joe''s IT Consulting and Bait Shop', null )
    ,( 120, 'Mary''s Dry Cleaning and Taxidermy', 'ACK' )
    ,( 250, 'Joe''s IT Consulting and Bait Shop', null )    -- dup record
    ,( 666, 'The Honest Politician', 'LIE' )
    ,( 100, 'My Invisible Friend', 'WHO' )
;
go


-- add temporary PK and index

alter table #test1 add sk int not null identity constraint PK_test1 primary key clustered;
create index IX_test1 on #test1( col1, col2, col3 );
go


-- note: rebuilding the indexes may or may not provide a performance benefit

alter index PK_test1 on #test1 rebuild;
alter index IX_test1 on #test1 rebuild;
go


-- remove duplicates

with ranks as
(
    select
         sk
        ,ordinal = row_number() over 
         ( 
            -- put all the columns composing uniqueness into the partition
            partition by col1, col2, col3
            order by sk
         )
    from #test1
)
delete 
from ranks
where ordinal > 1;
go


-- remove added columns

drop index IX_test1 on #test1;
alter table #test1 drop constraint PK_test1;
alter table #test1 drop column sk;
go

Solution 3

Why not simply use, instead of a Primary Key, create an Index and set

Ignore Duplicate Keys: YES

This will prevent any duplicate key to fire an error, and it will not be created (as it exists already).

enter image description here

I use this method to insert around 120.000 rows per day and works flawlessly.

Solution 4

I would bulk copy into a temporary table and then push the data from that into the actual destination table. In this way, you can use SQL to check for and handle duplicates.

Solution 5

What is the data volume? You have 2 options that I can see:

1: filter it at source, by implementing your own IDataReader and using some hash over the data, and simply skipping any duplicates so that they never get passed into the TDS.

2: filter it in the DB; at the simplest level, I guess you could have multiple stages of import - the raw, unsanitised data - and then copy the DISTINCT data into your actual tables, perhaps using an intermediate table if you want to. You might want to use CHECKSUM for some of this, but it depends.

Share:
27,595
kscott
Author by

kscott

I program computers to do things for people.

Updated on July 09, 2022

Comments

  • kscott
    kscott almost 2 years

    I receive a daily XML file that contains thousands of records, each being a business transaction that I need to store in an internal database for use in reporting and billing. I was under the impression that each day's file contained only unique records, but have discovered that my definition of unique is not exactly the same as the provider's.

    The current application that imports this data is a C#.Net 3.5 console application, it does so using SqlBulkCopy into a MS SQL Server 2008 database table where the columns exactly match the structure of the XML records. Each record has just over 100 fields, and there is no natural key in the data, or rather the fields I can come up with making sense as a composite key end up also having to allow nulls. Currently the table has several indexes, but no primary key.

    Basically the entire row needs to be unique. If one field is different, it is valid enough to be inserted. I looked at creating an MD5 hash of the entire row, inserting that into the database and using a constraint to prevent SqlBulkCopy from inserting the row,but I don't see how to get the MD5 Hash into the BulkCopy operation and I'm not sure if the whole operation would fail and roll back if any one record failed, or if it would continue.

    The file contains a very large number of records, going row by row in the XML, querying the database for a record that matches all fields, and then deciding to insert is really the only way I can see being able to do this. I was just hoping not to have to rewrite the application entirely, and the bulk copy operation is so much faster.

    Does anyone know of a way to use SqlBulkCopy while preventing duplicate rows, without a primary key? Or any suggestion for a different way to do this?