How to remove duplicate rows from flat file using SSIS?

29,811

Solution 1

Use the Sort Component.

Simply choose which fields you wish to sort your loaded rows by and in the bottom left corner you'll see a check box to remove duplicates. This box removes any rows which are duplicates based on the sort criteria only so in the example below the rows would be considered duplicate if we only sorted on the first field:

1 | sample A |
1 | sample B |

Solution 2

I would suggest using SSIS to copy the records to a temporary table, then create a task that uses Select Distinct or Rank depending on your situation to select the duplicates which would funnel them to a flat file and delete them from the temporary table. The last step would be to copy the records from the temporary table into the destination table.

Determining a duplicate is something SQL is good at but a flat file is not as well suited for. In the case you proposed, the script container would load a row and then would have to compare it against 17 million records, then load the next row and repeat...The performance might not be all that great.

Solution 3

We can use look up tables for this. Like SSIS provides two DFS (Data Flow Transformations) i.e. Fuzzy Grouping and Fuzzy Lookup.

Solution 4

A bit of a dirty solution is to set your target table up with a composite key that spans all columns. This will guarantee distint uniqueness. Then on the Data Destination shape, configure the task to ignore errors. All duplicate inserts will fall off into oblivion.

Solution 5

The strategy will usually depend on how many columns the staging table has. The more columns, the more complex the solution. The article you linked has some very good advice.

The only thing that I will add to what everybody else has said so far, is that columns with date and datetime values will give some of the solutions presented here fits.

One solution that I came up with is this:

SET NOCOUNT ON

DECLARE @email varchar(100)

SET @email = ''

SET @emailid = (SELECT min(email) from StagingTable WITH (NOLOCK) WHERE email > @email)

WHILE @emailid IS NOT NULL
BEGIN

    -- Do INSERT statement based on the email
    INSERT StagingTable2 (Email)
    FROM StagingTable WITH (NOLOCK) 
    WHERE email = @email

    SET @emailid = (SELECT min(email) from StagingTable WITH (NOLOCK) WHERE email > @email)

END

This is a LOT faster when doing deduping, than a CURSOR and will not peg the server's CPU. To use this, separate each column that comes from the text file into their own variables. Use a separate SELECT statement before and inside the loop, then include them in the INSERT statement. This has worked really well for me.

Share:
29,811
RyanKeeter
Author by

RyanKeeter

Father, Husband, .NET Developer, and US Army Officer all in one stress free package.

Updated on July 19, 2022

Comments

  • RyanKeeter
    RyanKeeter almost 2 years

    Let me first say that being able to take 17 million records from a flat file, pushing to a DB on a remote box and having it take 7 minutes is amazing. SSIS truly is fantastic. But now that I have that data up there, how do I remove duplicates?

    Better yet, I want to take the flat file, remove the duplicates from the flat file and put them back into another flat file.

    I am thinking about a:

    Data Flow Task

    • File source (with an associated file connection)
    • A for loop container
    • A script container that contains some logic to tell if another row exists

    Thak you, and everyone on this site is incredibly knowledgeable.

    Update: I have found this link, might help in answering this question

  • RyanKeeter
    RyanKeeter over 15 years
    Hector, you are going to be my savior when it comes to this SSIS endeavor! Thank you so much!
  • Hector Sosa Jr
    Hector Sosa Jr over 15 years
    One is glad to be of service. ;)