Fastest way to insert 100,000+ records into DocumentDB

31,822

Solution 1

The fastest way to insert documents into Azure DocumentDB. is available as a sample on Github: https://github.com/Azure/azure-documentdb-dotnet/tree/master/samples/documentdb-benchmark

The following tips will help you achieve the best througphput using the .NET SDK:

  • Initialize a singleton DocumentClient
  • Use Direct connectivity and TCP protocol (ConnectionMode.Direct and ConnectionProtocol.Tcp)
  • Use 100s of Tasks in parallel (depends on your hardware)
  • Increase the MaxConnectionLimit in the DocumentClient constructor to a high value, say 1000 connections
  • Turn gcServer on
  • Make sure your collection has the appropriate provisioned throughput (and a good partition key)
  • Running in the same Azure region will also help

With 10,000 RU/s, you can insert 100,000 documents in about 50 seconds (approximately 5 request units per write).

With 100,000 RU/s, you can insert in about 5 seconds. You can make this as fast as you want to, by configuring throughput (and for very high # of inserts, spread inserts across multiple VMs/workers)

EDIT: You can now use the bulk executor library at https://docs.microsoft.com/en-us/azure/cosmos-db/bulk-executor-overview, 7/12/19

Solution 2

The Cosmos Db team have just released a bulk import and update SDK, unfortunately only available in Framework 4.5.1 but this apparently does a lot of the heavy lifting for you and maximize use of throughput. see

https://docs.microsoft.com/en-us/azure/cosmos-db/bulk-executor-overview https://docs.microsoft.com/en-us/azure/cosmos-db/sql-api-sdk-bulk-executor-dot-net

Solution 3

Cosmos DB SDK has been updated to allow bulk insert: https://docs.microsoft.com/en-us/azure/cosmos-db/tutorial-sql-api-dotnet-bulk-import via the AllowBulkExecution option.

Share:
31,822

Related videos on Youtube

Mark Clancy
Author by

Mark Clancy

Updated on November 07, 2020

Comments

  • Mark Clancy
    Mark Clancy over 3 years

    As the title suggests, I need to insert 100,000+ records into a DocumentDb collection programatically. The data will be used for creating reports later on. I am using the Azure Documents SDK and a stored procedure for bulk inserting documents (See question Azure documentdb bulk insert using stored procedure).

    The following console application shows how I'm inserting documents.

    InsertDocuments generates 500 test documents to pass to the stored procedure. The main function calls InsertDocuments 10 times, inserting 5,000 documents overall. Running this application results in 500 documents getting inserted every few seconds. If I increase the number of documents per call I start to get errors and lost documents.

    Can anyone recommend a faster way to insert documents?

    static void Main(string[] args)
    {
        Console.WriteLine("Starting...");
    
        MainAsync().Wait();
    }
    
    static async Task MainAsync()
    {
        int campaignId = 1001,
            count = 500;
    
        for (int i = 0; i < 10; i++)
        {
            await InsertDocuments(campaignId, (count * i) + 1, (count * i) + count);
        }
    }
    
    static async Task InsertDocuments(int campaignId, int startId, int endId)
    {
        using (DocumentClient client = new DocumentClient(new Uri(documentDbUrl), documentDbKey))
        {
            List<dynamic> items = new List<dynamic>();
    
            // Create x number of documents to insert
            for (int i = startId; i <= endId; i++)
            {
                var item = new
                {
                    id = Guid.NewGuid(),
                    campaignId = campaignId,
                    userId = i,
                    status = "Pending"
                };
    
                items.Add(item);
            }
    
            var task = client.ExecuteStoredProcedureAsync<dynamic>("/dbs/default/colls/campaignusers/sprocs/bulkImport", new RequestOptions()
            {
                PartitionKey = new PartitionKey(campaignId)
            },
            new
            {
                items = items
            });
    
            try
            {
                await task;
    
                int insertCount = (int)task.Result.Response;
    
                Console.WriteLine("{0} documents inserted...", insertCount);
            }
            catch (Exception e)
            {
                Console.WriteLine("Error: {0}", e.Message);
            }
        }
    }
    
  • Mark Clancy
    Mark Clancy over 7 years
    Thanks for the answer. It's interesting that it doesn't use stored procedures to bulk insert documents. It relies heavily on the size of hardware to get fast results. I'll test this and post my results.
  • Aravind Krishna R.
    Aravind Krishna R. over 7 years
    You can use stored procedures as well, but that approach requires to batch requests by partition key, which might not always be possible (if you have a key-value workload).
  • Aravind Krishna R.
    Aravind Krishna R. over 7 years
    If you'd like to discuss in more detail over email/phone, please email [email protected]
  • Ben Mayo
    Ben Mayo over 6 years
    I'm trying to implement this solution - I've saved the stored procedure using the Azure web front end, and am trying to call it via bulkloadresult = client.ExecuteStoredProcedure('dbs/database/colls/upload/spr‌​ocs/bulkimport2',{"i‌​tems":'[{"name":"Joh‌​n","age":30,"city":"‌​New York","PartitionKey" : "upload"}, {"name":"Ben","age":25,"city":"LA","PartitionKey" : "upload"}]'}, {"partitionKey" : "upload"}). I'm returning {"code":"BadRequest","message":"Message: {\"Errors\":[\"Encountered exception while executing function. Exception = undefined\"]}\r\nActivityId: 66b573d4... Any ideas? Thanks
  • donald
    donald over 6 years
    I forgot to add code related to create collection with partition .Stored procedure works within partition . In my collection i have partion key "/city" . Create collection with partition "/city" and run above code
  • NathofGod
    NathofGod over 6 years
    Any idea how to set the MaxConnectionLimit using the Node JS client
  • bytedev
    bytedev over 6 years
    @NathofGod in .NET it is on the ConnectionPolicy class that is passed in on the constructor when creating the DocumentClient. This not the case in Node JS?
  • moarra
    moarra about 4 years
    Is there a way to define the number of RUs to be used programmatically?