Bulk inserting best way to about it? + Helping me understand fully what I found so far

13,088

The batch size is there to reduce the impact of network latency. It does not need to be more than a few thousand. Multiple statements are collected together and sent as a unit, so you get the hit of one network trip once every N statements, rather than once per statement.

Share:
13,088

Related videos on Youtube

chobo2
Author by

chobo2

Updated on June 04, 2022

Comments

  • chobo2
    chobo2 almost 2 years

    So I saw this post here and read it and it seems like bulk copy might be the way to go.

    What’s the best way to bulk database inserts from c#?

    I still have some questions and want to know how things actually work.

    So I found 2 tutorials.

    http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241

    http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx

    First way uses 2 ado.net 2.0 features. BulkInsert and BulkCopy. the second one uses linq to sql and OpenXML.

    This sort of appeals to me as I am using linq to sql already and prefer it over ado.net. However as one person pointed out in the posts what he just going around the issue at the cost of performance( nothing wrong with that in my opinion)

    First I will talk about the 2 ways in the first tutorial

    I am using VS2010 Express(for testing of the tutorials I used VS2008 and not sure what .net version I just loaded up there sample files and ran them), .net 4.0, MVC 2.0, SQl Server 2005

    1. Is ado.net 2.0 the most current version?
    2. Based on the technology I am using, is there some updates to what I am going to show that would improve it somehow?
    3. Is there any thing that these tutorial left out that I should know about?

    BulkInsert

    I am using this table for all the examples.

    CREATE TABLE [dbo].[TBL_TEST_TEST]
    (
        ID INT IDENTITY(1,1) PRIMARY KEY,
        [NAME] [varchar](50) 
    )
    

    SP Code

    USE [Test]
    GO
    /****** Object:  StoredProcedure [dbo].[sp_BatchInsert]    Script Date: 05/19/2010 15:12:47 ******/
    SET ANSI_NULLS ON
    GO
    SET QUOTED_IDENTIFIER ON
    GO
    ALTER PROCEDURE [dbo].[sp_BatchInsert] (@Name VARCHAR(50) )
    AS
    BEGIN
                INSERT INTO TBL_TEST_TEST VALUES (@Name);
    END 
    

    C# Code

    /// <summary>
    /// Another ado.net 2.0 way that uses a stored procedure to do a bulk insert.
    /// Seems slower then "BatchBulkCopy" way and it crashes when you try to insert 500,000 records in one go.
    /// http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241
    /// </summary>
    private static void BatchInsert()
    {
        // Get the DataTable with Rows State as RowState.Added
        DataTable dtInsertRows = GetDataTable();
    
        SqlConnection connection = new SqlConnection(connectionString);
        SqlCommand command = new SqlCommand("sp_BatchInsert", connection);
        command.CommandType = CommandType.StoredProcedure;
        command.UpdatedRowSource = UpdateRowSource.None;
    
        // Set the Parameter with appropriate Source Column Name
        command.Parameters.Add("@Name", SqlDbType.VarChar, 50, dtInsertRows.Columns[0].ColumnName);
    
        SqlDataAdapter adpt = new SqlDataAdapter();
        adpt.InsertCommand = command;
        // Specify the number of records to be Inserted/Updated in one go. Default is 1.
        adpt.UpdateBatchSize = 1000;
    
        connection.Open();
        int recordsInserted = adpt.Update(dtInsertRows);
        connection.Close();
    }
    

    So first thing is the batch size. Why would you set a batch size to anything but the number of records you are sending? Like I am sending 500,000 records so I did a Batch size of 500,000.

    Next why does it crash when I do this? If I set it to 1000 for batch size it works just fine.

    System.Data.SqlClient.SqlException was unhandled
      Message="A transport-level error has occurred when sending the request to the server. (provider: Shared Memory Provider, error: 0 - No process is on the other end of the pipe.)"
      Source=".Net SqlClient Data Provider"
      ErrorCode=-2146232060
      Class=20
      LineNumber=0
      Number=233
      Server=""
      State=0
      StackTrace:
           at System.Data.Common.DbDataAdapter.UpdatedRowStatusErrors(RowUpdatedEventArgs rowUpdatedEvent, BatchCommandInfo[] batchCommands, Int32 commandCount)
           at System.Data.Common.DbDataAdapter.UpdatedRowStatus(RowUpdatedEventArgs rowUpdatedEvent, BatchCommandInfo[] batchCommands, Int32 commandCount)
           at System.Data.Common.DbDataAdapter.Update(DataRow[] dataRows, DataTableMapping tableMapping)
           at System.Data.Common.DbDataAdapter.UpdateFromDataTable(DataTable dataTable, DataTableMapping tableMapping)
           at System.Data.Common.DbDataAdapter.Update(DataTable dataTable)
           at TestIQueryable.Program.BatchInsert() in C:\Users\a\Downloads\TestIQueryable\TestIQueryable\TestIQueryable\Program.cs:line 124
           at TestIQueryable.Program.Main(String[] args) in C:\Users\a\Downloads\TestIQueryable\TestIQueryable\TestIQueryable\Program.cs:line 16
      InnerException: 
    

    Time it took to insert 500,000 records with insert batch size of 1000 took "2 mins and 54 seconds"

    Of course this is no official time I sat there with a stop watch( I am sure there are better ways but was too lazy to look what they where)

    So I find that kinda slow compared to all my other ones(expect the linq to sql insert one) and I am not really sure why.

    Next I looked at bulkcopy

    /// <summary>
    /// An ado.net 2.0 way to mass insert records. This seems to be the fastest.
    /// http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241
    /// </summary>
    private static void BatchBulkCopy()
    {
        // Get the DataTable 
        DataTable dtInsertRows = GetDataTable();
    
        using (SqlBulkCopy sbc = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.KeepIdentity))
        {
            sbc.DestinationTableName = "TBL_TEST_TEST";
    
            // Number of records to be processed in one go
            sbc.BatchSize = 500000;
    
            // Map the Source Column from DataTabel to the Destination Columns in SQL Server 2005 Person Table
            // sbc.ColumnMappings.Add("ID", "ID");
            sbc.ColumnMappings.Add("NAME", "NAME");
    
            // Number of records after which client has to be notified about its status
            sbc.NotifyAfter = dtInsertRows.Rows.Count;
    
            // Event that gets fired when NotifyAfter number of records are processed.
            sbc.SqlRowsCopied += new SqlRowsCopiedEventHandler(sbc_SqlRowsCopied);
    
            // Finally write to server
            sbc.WriteToServer(dtInsertRows);
            sbc.Close();
        }
    
    }
    

    This one seemed to go really fast and did not even need a SP( can you use SP with bulk copy? If you can would it be better?)

    BatchCopy had no problem with a 500,000 batch size.So again why make it smaller then the number of records you want to send?

    I found that with BatchCopy and 500,000 batch size it took only 5 seconds to complete. I then tried with a batch size of 1,000 and it only took 8 seconds.

    So much faster then the bulkinsert one above.

    Now I tried the other tutorial.

    USE [Test]
    GO
    /****** Object:  StoredProcedure [dbo].[spTEST_InsertXMLTEST_TEST]    Script Date: 05/19/2010 15:39:03 ******/
    SET ANSI_NULLS ON
    GO
    SET QUOTED_IDENTIFIER ON
    GO
    ALTER PROCEDURE [dbo].[spTEST_InsertXMLTEST_TEST](@UpdatedProdData nText)
    AS 
     DECLARE @hDoc int   
    
     exec sp_xml_preparedocument @hDoc OUTPUT,@UpdatedProdData 
    
     INSERT INTO TBL_TEST_TEST(NAME)
     SELECT XMLProdTable.NAME
        FROM OPENXML(@hDoc, 'ArrayOfTBL_TEST_TEST/TBL_TEST_TEST', 2)   
           WITH (
                    ID Int,                 
                    NAME varchar(100)
                ) XMLProdTable
    
    EXEC sp_xml_removedocument @hDoc
    

    C# code.

    /// <summary>
    /// This is using linq to sql to make the table objects. 
    /// It is then serailzed to to an xml document and sent to a stored proedure
    /// that then does a bulk insert(I think with OpenXML)
    ///  http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx
    /// </summary>
    private static void LinqInsertXMLBatch()
    {
        using (TestDataContext db = new TestDataContext())
        {
            TBL_TEST_TEST[] testRecords = new TBL_TEST_TEST[500000];
            for (int count = 0; count < 500000; count++)
            {
                TBL_TEST_TEST testRecord = new TBL_TEST_TEST();
                testRecord.NAME = "Name : " + count;
                testRecords[count] = testRecord;
            }
    
            StringBuilder sBuilder = new StringBuilder();
            System.IO.StringWriter sWriter = new System.IO.StringWriter(sBuilder);
            XmlSerializer serializer = new XmlSerializer(typeof(TBL_TEST_TEST[]));
            serializer.Serialize(sWriter, testRecords);
            db.insertTestData(sBuilder.ToString());
        }
    }
    

    So I like this because I get to use objects even though it is kinda redundant. I don't get how the SP works. Like I don't get the whole thing. I don't know if OPENXML has some batch insert under the hood but I do not even know how to take this example SP and change it to fit my tables since like I said I don't know what is going on.

    I also don't know what would happen if the object you have more tables in it. Like say I have a ProductName table what has a relationship to a Product table or something like that.

    In linq to sql you could get the product name object and make changes to the Product table in that same object. So I am not sure how to take that into account. I am not sure if I would have to do separate inserts or what.

    The time was pretty good for 500,000 records it took 52 seconds

    The last way of course was just using linq to do it all and it was pretty bad.

    /// <summary>
    /// This is using linq to sql to to insert lots of records. 
    /// This way is slow as it uses no mass insert.
    /// Only tried to insert 50,000 records as I did not want to sit around till it did 500,000 records.
    /// http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx
    /// </summary>
    private static void LinqInsertAll()
    {
        using (TestDataContext db = new TestDataContext())
        {
            db.CommandTimeout = 600;
            for (int count = 0; count < 50000; count++)
            {
                TBL_TEST_TEST testRecord = new TBL_TEST_TEST();
                testRecord.NAME = "Name : " + count;
                db.TBL_TEST_TESTs.InsertOnSubmit(testRecord);
            }
            db.SubmitChanges();
        }
    }
    

    I did only 50,000 records and that took over a minute to do.

    So I really narrowed it done to the linq to sql bulk insert way or bulk copy. I am just not sure how to do it when you have relationship for either way. I am not sure how they both stand up when doing updates instead of inserts as I have not gotten around to try it yet.

    I don't think I will ever need to insert/update more than 50,000 records at one type but at the same time I know I will have to do validation on records before inserting so that will slow it down and that sort of makes linq to sql nicer as your got objects especially if your first parsing data from a xml file before you insert into the database.

    Full C# code

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Xml.Serialization;
    using System.Data;
    using System.Data.SqlClient;
    
    namespace TestIQueryable
    {
        class Program
        {
            private static string connectionString = "";
            static void Main(string[] args)
            {
                BatchInsert();
                Console.WriteLine("done");
            }
    
            /// <summary>
            /// This is using linq to sql to to insert lots of records. 
            /// This way is slow as it uses no mass insert.
            /// Only tried to insert 50,000 records as I did not want to sit around till it did 500,000 records.
            /// http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx
            /// </summary>
            private static void LinqInsertAll()
            {
                using (TestDataContext db = new TestDataContext())
                {
                    db.CommandTimeout = 600;
                    for (int count = 0; count < 50000; count++)
                    {
                        TBL_TEST_TEST testRecord = new TBL_TEST_TEST();
                        testRecord.NAME = "Name : " + count;
                        db.TBL_TEST_TESTs.InsertOnSubmit(testRecord);
                    }
                    db.SubmitChanges();
                }
            }
    
            /// <summary>
            /// This is using linq to sql to make the table objects. 
            /// It is then serailzed to to an xml document and sent to a stored proedure
            /// that then does a bulk insert(I think with OpenXML)
            ///  http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx
            /// </summary>
            private static void LinqInsertXMLBatch()
            {
                using (TestDataContext db = new TestDataContext())
                {
                    TBL_TEST_TEST[] testRecords = new TBL_TEST_TEST[500000];
                    for (int count = 0; count < 500000; count++)
                    {
                        TBL_TEST_TEST testRecord = new TBL_TEST_TEST();
                        testRecord.NAME = "Name : " + count;
                        testRecords[count] = testRecord;
                    }
    
                    StringBuilder sBuilder = new StringBuilder();
                    System.IO.StringWriter sWriter = new System.IO.StringWriter(sBuilder);
                    XmlSerializer serializer = new XmlSerializer(typeof(TBL_TEST_TEST[]));
                    serializer.Serialize(sWriter, testRecords);
                    db.insertTestData(sBuilder.ToString());
                }
            }
    
            /// <summary>
            /// An ado.net 2.0 way to mass insert records. This seems to be the fastest.
            /// http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241
            /// </summary>
            private static void BatchBulkCopy()
            {
                // Get the DataTable 
                DataTable dtInsertRows = GetDataTable();
    
                using (SqlBulkCopy sbc = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.KeepIdentity))
                {
                    sbc.DestinationTableName = "TBL_TEST_TEST";
    
                    // Number of records to be processed in one go
                    sbc.BatchSize = 500000;
    
                    // Map the Source Column from DataTabel to the Destination Columns in SQL Server 2005 Person Table
                    // sbc.ColumnMappings.Add("ID", "ID");
                    sbc.ColumnMappings.Add("NAME", "NAME");
    
                    // Number of records after which client has to be notified about its status
                    sbc.NotifyAfter = dtInsertRows.Rows.Count;
    
                    // Event that gets fired when NotifyAfter number of records are processed.
                    sbc.SqlRowsCopied += new SqlRowsCopiedEventHandler(sbc_SqlRowsCopied);
    
                    // Finally write to server
                    sbc.WriteToServer(dtInsertRows);
                    sbc.Close();
                }
    
            }
    
    
            /// <summary>
            /// Another ado.net 2.0 way that uses a stored procedure to do a bulk insert.
            /// Seems slower then "BatchBulkCopy" way and it crashes when you try to insert 500,000 records in one go.
            /// http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241
            /// </summary>
            private static void BatchInsert()
            {
                // Get the DataTable with Rows State as RowState.Added
                DataTable dtInsertRows = GetDataTable();
    
                SqlConnection connection = new SqlConnection(connectionString);
                SqlCommand command = new SqlCommand("sp_BatchInsert", connection);
                command.CommandType = CommandType.StoredProcedure;
                command.UpdatedRowSource = UpdateRowSource.None;
    
                // Set the Parameter with appropriate Source Column Name
                command.Parameters.Add("@Name", SqlDbType.VarChar, 50, dtInsertRows.Columns[0].ColumnName);
    
                SqlDataAdapter adpt = new SqlDataAdapter();
                adpt.InsertCommand = command;
                // Specify the number of records to be Inserted/Updated in one go. Default is 1.
                adpt.UpdateBatchSize = 500000;
    
                connection.Open();
                int recordsInserted = adpt.Update(dtInsertRows);
                connection.Close();
            }
    
    
    
            private static DataTable GetDataTable()
            {
                // You First need a DataTable and have all the insert values in it
                DataTable dtInsertRows = new DataTable();
                dtInsertRows.Columns.Add("NAME");
    
                for (int i = 0; i < 500000; i++)
                {
                    DataRow drInsertRow = dtInsertRows.NewRow();
                    string name = "Name : " + i;
                    drInsertRow["NAME"] = name;
                    dtInsertRows.Rows.Add(drInsertRow);
    
    
                }
                return dtInsertRows;
    
            }
    
    
            static void sbc_SqlRowsCopied(object sender, SqlRowsCopiedEventArgs e)
            {
                Console.WriteLine("Number of records affected : " + e.RowsCopied.ToString());
            }
    
    
        }
    }
    
  • chobo2
    chobo2 almost 14 years
    So what would be the prefect batch size in your mind then? Where you don't impact network latency as much but your not doing 10,000's of trips.
  • mdma
    mdma almost 14 years
    You want to try to avoid the network latency, i.e. reduce the impact of latency as much as possible. batching 1000-10000 is generally fine, but it depends upon how much latency you have in your network and how much memory you can afford to use for batching statements. E.g. if you have N statements, and a latency of Lms, then total latency is NL ms. If you batch in size B, then latency is NL/B, so you reduce the latency by the size of the batch. Work these variables to arrive at a total latency that is acceptable in your situation - e.g. a small percentage of query execution time.
  • chobo2
    chobo2 almost 14 years
    It would be a one-off bulk copy for the most part(for inserting 10,000-50,000 records). Otherwise it depends on the user and what they do. They could do it every day or never but these would be updates and probably looking around no more than 50 in one batch send.
  • Chris
    Chris almost 14 years
    I used BCP yesterday to transmit almost a million records from a very wide table that has about 400 fields. So it's exactly what you need for your task if its a one-off of 50k rows.
  • chobo2
    chobo2 almost 14 years
    How long did that take you? Well I am also looking at the whole picture. Like I am taking a xml document and using serialization on it. If it is good then I could just add it to another object array and call send it with linq to sql way.
  • Chris
    Chris almost 14 years
    Took about 5 minutes to extract 900k rows. So yep, BCP is really good for occasional extracts. As for your regular 50record update, BCP is probably overkill.
  • Chris
    Chris almost 14 years
    And to insert those 900k rows into another database, maybe a bit longer, say 10 mins? Not long enough for me to worry about measuring it.
  • Amit Naidu
    Amit Naidu almost 11 years
    Do you mean the batch size has no effect on the transaction log and commits?
  • mdma
    mdma almost 11 years
    yes, exactly. It just controls how often the data is sent over the network.