Bulk inserting best way to about it? + Helping me understand fully what I found so far
The batch size is there to reduce the impact of network latency. It does not need to be more than a few thousand. Multiple statements are collected together and sent as a unit, so you get the hit of one network trip once every N statements, rather than once per statement.
Related videos on Youtube
chobo2
Updated on June 04, 2022Comments
-
chobo2 almost 2 years
So I saw this post here and read it and it seems like bulk copy might be the way to go.
What’s the best way to bulk database inserts from c#?
I still have some questions and want to know how things actually work.
So I found 2 tutorials.
http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241
http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx
First way uses 2 ado.net 2.0 features. BulkInsert and BulkCopy. the second one uses linq to sql and OpenXML.
This sort of appeals to me as I am using linq to sql already and prefer it over ado.net. However as one person pointed out in the posts what he just going around the issue at the cost of performance( nothing wrong with that in my opinion)
First I will talk about the 2 ways in the first tutorial
I am using VS2010 Express(for testing of the tutorials I used VS2008 and not sure what .net version I just loaded up there sample files and ran them), .net 4.0, MVC 2.0, SQl Server 2005
- Is ado.net 2.0 the most current version?
- Based on the technology I am using, is there some updates to what I am going to show that would improve it somehow?
- Is there any thing that these tutorial left out that I should know about?
BulkInsert
I am using this table for all the examples.
CREATE TABLE [dbo].[TBL_TEST_TEST] ( ID INT IDENTITY(1,1) PRIMARY KEY, [NAME] [varchar](50) )
SP Code
USE [Test] GO /****** Object: StoredProcedure [dbo].[sp_BatchInsert] Script Date: 05/19/2010 15:12:47 ******/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO ALTER PROCEDURE [dbo].[sp_BatchInsert] (@Name VARCHAR(50) ) AS BEGIN INSERT INTO TBL_TEST_TEST VALUES (@Name); END
C# Code
/// <summary> /// Another ado.net 2.0 way that uses a stored procedure to do a bulk insert. /// Seems slower then "BatchBulkCopy" way and it crashes when you try to insert 500,000 records in one go. /// http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241 /// </summary> private static void BatchInsert() { // Get the DataTable with Rows State as RowState.Added DataTable dtInsertRows = GetDataTable(); SqlConnection connection = new SqlConnection(connectionString); SqlCommand command = new SqlCommand("sp_BatchInsert", connection); command.CommandType = CommandType.StoredProcedure; command.UpdatedRowSource = UpdateRowSource.None; // Set the Parameter with appropriate Source Column Name command.Parameters.Add("@Name", SqlDbType.VarChar, 50, dtInsertRows.Columns[0].ColumnName); SqlDataAdapter adpt = new SqlDataAdapter(); adpt.InsertCommand = command; // Specify the number of records to be Inserted/Updated in one go. Default is 1. adpt.UpdateBatchSize = 1000; connection.Open(); int recordsInserted = adpt.Update(dtInsertRows); connection.Close(); }
So first thing is the batch size. Why would you set a batch size to anything but the number of records you are sending? Like I am sending 500,000 records so I did a Batch size of 500,000.
Next why does it crash when I do this? If I set it to 1000 for batch size it works just fine.
System.Data.SqlClient.SqlException was unhandled Message="A transport-level error has occurred when sending the request to the server. (provider: Shared Memory Provider, error: 0 - No process is on the other end of the pipe.)" Source=".Net SqlClient Data Provider" ErrorCode=-2146232060 Class=20 LineNumber=0 Number=233 Server="" State=0 StackTrace: at System.Data.Common.DbDataAdapter.UpdatedRowStatusErrors(RowUpdatedEventArgs rowUpdatedEvent, BatchCommandInfo[] batchCommands, Int32 commandCount) at System.Data.Common.DbDataAdapter.UpdatedRowStatus(RowUpdatedEventArgs rowUpdatedEvent, BatchCommandInfo[] batchCommands, Int32 commandCount) at System.Data.Common.DbDataAdapter.Update(DataRow[] dataRows, DataTableMapping tableMapping) at System.Data.Common.DbDataAdapter.UpdateFromDataTable(DataTable dataTable, DataTableMapping tableMapping) at System.Data.Common.DbDataAdapter.Update(DataTable dataTable) at TestIQueryable.Program.BatchInsert() in C:\Users\a\Downloads\TestIQueryable\TestIQueryable\TestIQueryable\Program.cs:line 124 at TestIQueryable.Program.Main(String[] args) in C:\Users\a\Downloads\TestIQueryable\TestIQueryable\TestIQueryable\Program.cs:line 16 InnerException:
Time it took to insert 500,000 records with insert batch size of 1000 took "2 mins and 54 seconds"
Of course this is no official time I sat there with a stop watch( I am sure there are better ways but was too lazy to look what they where)
So I find that kinda slow compared to all my other ones(expect the linq to sql insert one) and I am not really sure why.
Next I looked at bulkcopy
/// <summary> /// An ado.net 2.0 way to mass insert records. This seems to be the fastest. /// http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241 /// </summary> private static void BatchBulkCopy() { // Get the DataTable DataTable dtInsertRows = GetDataTable(); using (SqlBulkCopy sbc = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.KeepIdentity)) { sbc.DestinationTableName = "TBL_TEST_TEST"; // Number of records to be processed in one go sbc.BatchSize = 500000; // Map the Source Column from DataTabel to the Destination Columns in SQL Server 2005 Person Table // sbc.ColumnMappings.Add("ID", "ID"); sbc.ColumnMappings.Add("NAME", "NAME"); // Number of records after which client has to be notified about its status sbc.NotifyAfter = dtInsertRows.Rows.Count; // Event that gets fired when NotifyAfter number of records are processed. sbc.SqlRowsCopied += new SqlRowsCopiedEventHandler(sbc_SqlRowsCopied); // Finally write to server sbc.WriteToServer(dtInsertRows); sbc.Close(); } }
This one seemed to go really fast and did not even need a SP( can you use SP with bulk copy? If you can would it be better?)
BatchCopy had no problem with a 500,000 batch size.So again why make it smaller then the number of records you want to send?
I found that with BatchCopy and 500,000 batch size it took only 5 seconds to complete. I then tried with a batch size of 1,000 and it only took 8 seconds.
So much faster then the bulkinsert one above.
Now I tried the other tutorial.
USE [Test] GO /****** Object: StoredProcedure [dbo].[spTEST_InsertXMLTEST_TEST] Script Date: 05/19/2010 15:39:03 ******/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO ALTER PROCEDURE [dbo].[spTEST_InsertXMLTEST_TEST](@UpdatedProdData nText) AS DECLARE @hDoc int exec sp_xml_preparedocument @hDoc OUTPUT,@UpdatedProdData INSERT INTO TBL_TEST_TEST(NAME) SELECT XMLProdTable.NAME FROM OPENXML(@hDoc, 'ArrayOfTBL_TEST_TEST/TBL_TEST_TEST', 2) WITH ( ID Int, NAME varchar(100) ) XMLProdTable EXEC sp_xml_removedocument @hDoc
C# code.
/// <summary> /// This is using linq to sql to make the table objects. /// It is then serailzed to to an xml document and sent to a stored proedure /// that then does a bulk insert(I think with OpenXML) /// http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx /// </summary> private static void LinqInsertXMLBatch() { using (TestDataContext db = new TestDataContext()) { TBL_TEST_TEST[] testRecords = new TBL_TEST_TEST[500000]; for (int count = 0; count < 500000; count++) { TBL_TEST_TEST testRecord = new TBL_TEST_TEST(); testRecord.NAME = "Name : " + count; testRecords[count] = testRecord; } StringBuilder sBuilder = new StringBuilder(); System.IO.StringWriter sWriter = new System.IO.StringWriter(sBuilder); XmlSerializer serializer = new XmlSerializer(typeof(TBL_TEST_TEST[])); serializer.Serialize(sWriter, testRecords); db.insertTestData(sBuilder.ToString()); } }
So I like this because I get to use objects even though it is kinda redundant. I don't get how the SP works. Like I don't get the whole thing. I don't know if OPENXML has some batch insert under the hood but I do not even know how to take this example SP and change it to fit my tables since like I said I don't know what is going on.
I also don't know what would happen if the object you have more tables in it. Like say I have a ProductName table what has a relationship to a Product table or something like that.
In linq to sql you could get the product name object and make changes to the Product table in that same object. So I am not sure how to take that into account. I am not sure if I would have to do separate inserts or what.
The time was pretty good for 500,000 records it took 52 seconds
The last way of course was just using linq to do it all and it was pretty bad.
/// <summary> /// This is using linq to sql to to insert lots of records. /// This way is slow as it uses no mass insert. /// Only tried to insert 50,000 records as I did not want to sit around till it did 500,000 records. /// http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx /// </summary> private static void LinqInsertAll() { using (TestDataContext db = new TestDataContext()) { db.CommandTimeout = 600; for (int count = 0; count < 50000; count++) { TBL_TEST_TEST testRecord = new TBL_TEST_TEST(); testRecord.NAME = "Name : " + count; db.TBL_TEST_TESTs.InsertOnSubmit(testRecord); } db.SubmitChanges(); } }
I did only 50,000 records and that took over a minute to do.
So I really narrowed it done to the linq to sql bulk insert way or bulk copy. I am just not sure how to do it when you have relationship for either way. I am not sure how they both stand up when doing updates instead of inserts as I have not gotten around to try it yet.
I don't think I will ever need to insert/update more than 50,000 records at one type but at the same time I know I will have to do validation on records before inserting so that will slow it down and that sort of makes linq to sql nicer as your got objects especially if your first parsing data from a xml file before you insert into the database.
Full C# code
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml.Serialization; using System.Data; using System.Data.SqlClient; namespace TestIQueryable { class Program { private static string connectionString = ""; static void Main(string[] args) { BatchInsert(); Console.WriteLine("done"); } /// <summary> /// This is using linq to sql to to insert lots of records. /// This way is slow as it uses no mass insert. /// Only tried to insert 50,000 records as I did not want to sit around till it did 500,000 records. /// http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx /// </summary> private static void LinqInsertAll() { using (TestDataContext db = new TestDataContext()) { db.CommandTimeout = 600; for (int count = 0; count < 50000; count++) { TBL_TEST_TEST testRecord = new TBL_TEST_TEST(); testRecord.NAME = "Name : " + count; db.TBL_TEST_TESTs.InsertOnSubmit(testRecord); } db.SubmitChanges(); } } /// <summary> /// This is using linq to sql to make the table objects. /// It is then serailzed to to an xml document and sent to a stored proedure /// that then does a bulk insert(I think with OpenXML) /// http://www.codeproject.com/KB/linq/BulkOperations_LinqToSQL.aspx /// </summary> private static void LinqInsertXMLBatch() { using (TestDataContext db = new TestDataContext()) { TBL_TEST_TEST[] testRecords = new TBL_TEST_TEST[500000]; for (int count = 0; count < 500000; count++) { TBL_TEST_TEST testRecord = new TBL_TEST_TEST(); testRecord.NAME = "Name : " + count; testRecords[count] = testRecord; } StringBuilder sBuilder = new StringBuilder(); System.IO.StringWriter sWriter = new System.IO.StringWriter(sBuilder); XmlSerializer serializer = new XmlSerializer(typeof(TBL_TEST_TEST[])); serializer.Serialize(sWriter, testRecords); db.insertTestData(sBuilder.ToString()); } } /// <summary> /// An ado.net 2.0 way to mass insert records. This seems to be the fastest. /// http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241 /// </summary> private static void BatchBulkCopy() { // Get the DataTable DataTable dtInsertRows = GetDataTable(); using (SqlBulkCopy sbc = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.KeepIdentity)) { sbc.DestinationTableName = "TBL_TEST_TEST"; // Number of records to be processed in one go sbc.BatchSize = 500000; // Map the Source Column from DataTabel to the Destination Columns in SQL Server 2005 Person Table // sbc.ColumnMappings.Add("ID", "ID"); sbc.ColumnMappings.Add("NAME", "NAME"); // Number of records after which client has to be notified about its status sbc.NotifyAfter = dtInsertRows.Rows.Count; // Event that gets fired when NotifyAfter number of records are processed. sbc.SqlRowsCopied += new SqlRowsCopiedEventHandler(sbc_SqlRowsCopied); // Finally write to server sbc.WriteToServer(dtInsertRows); sbc.Close(); } } /// <summary> /// Another ado.net 2.0 way that uses a stored procedure to do a bulk insert. /// Seems slower then "BatchBulkCopy" way and it crashes when you try to insert 500,000 records in one go. /// http://www.codeproject.com/KB/cs/MultipleInsertsIn1dbTrip.aspx#_Toc196622241 /// </summary> private static void BatchInsert() { // Get the DataTable with Rows State as RowState.Added DataTable dtInsertRows = GetDataTable(); SqlConnection connection = new SqlConnection(connectionString); SqlCommand command = new SqlCommand("sp_BatchInsert", connection); command.CommandType = CommandType.StoredProcedure; command.UpdatedRowSource = UpdateRowSource.None; // Set the Parameter with appropriate Source Column Name command.Parameters.Add("@Name", SqlDbType.VarChar, 50, dtInsertRows.Columns[0].ColumnName); SqlDataAdapter adpt = new SqlDataAdapter(); adpt.InsertCommand = command; // Specify the number of records to be Inserted/Updated in one go. Default is 1. adpt.UpdateBatchSize = 500000; connection.Open(); int recordsInserted = adpt.Update(dtInsertRows); connection.Close(); } private static DataTable GetDataTable() { // You First need a DataTable and have all the insert values in it DataTable dtInsertRows = new DataTable(); dtInsertRows.Columns.Add("NAME"); for (int i = 0; i < 500000; i++) { DataRow drInsertRow = dtInsertRows.NewRow(); string name = "Name : " + i; drInsertRow["NAME"] = name; dtInsertRows.Rows.Add(drInsertRow); } return dtInsertRows; } static void sbc_SqlRowsCopied(object sender, SqlRowsCopiedEventArgs e) { Console.WriteLine("Number of records affected : " + e.RowsCopied.ToString()); } } }
-
chobo2 almost 14 yearsSo what would be the prefect batch size in your mind then? Where you don't impact network latency as much but your not doing 10,000's of trips.
-
mdma almost 14 yearsYou want to try to avoid the network latency, i.e. reduce the impact of latency as much as possible. batching 1000-10000 is generally fine, but it depends upon how much latency you have in your network and how much memory you can afford to use for batching statements. E.g. if you have N statements, and a latency of Lms, then total latency is NL ms. If you batch in size B, then latency is NL/B, so you reduce the latency by the size of the batch. Work these variables to arrive at a total latency that is acceptable in your situation - e.g. a small percentage of query execution time.
-
chobo2 almost 14 yearsIt would be a one-off bulk copy for the most part(for inserting 10,000-50,000 records). Otherwise it depends on the user and what they do. They could do it every day or never but these would be updates and probably looking around no more than 50 in one batch send.
-
Chris almost 14 yearsI used BCP yesterday to transmit almost a million records from a very wide table that has about 400 fields. So it's exactly what you need for your task if its a one-off of 50k rows.
-
chobo2 almost 14 yearsHow long did that take you? Well I am also looking at the whole picture. Like I am taking a xml document and using serialization on it. If it is good then I could just add it to another object array and call send it with linq to sql way.
-
Chris almost 14 yearsTook about 5 minutes to extract 900k rows. So yep, BCP is really good for occasional extracts. As for your regular 50record update, BCP is probably overkill.
-
Chris almost 14 yearsAnd to insert those 900k rows into another database, maybe a bit longer, say 10 mins? Not long enough for me to worry about measuring it.
-
Amit Naidu almost 11 yearsDo you mean the batch size has no effect on the transaction log and commits?
-
mdma almost 11 yearsyes, exactly. It just controls how often the data is sent over the network.