MongoDB Bulk Insert where many documents already exist

10,363

Solution 1

My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming???

The ContinueOnError flag for Bulk Inserts only affects the behaviour of the batch processing: rather than stopping processing on the first error encountered, the full batch will be processed.

In MongoDB 2.4 you will only get a single error for the batch, which will be the last error encountered. This means if you do care about catching errors you would be better doing individual inserts.

The main time savings for bulk insert vs single insert is reduced network round trips. Instead of sending a message to the MongoDB server per document inserted, drivers can break down bulk inserts into batches of up to the MaxMessageSizeBytes accepted by the mongod server (currently 48Mb).

Are bulk inserts appropriate for this use case?

Given your use case of only 100s (or even 1000s) of documents to insert where 80% already exist, there may not be a huge benefit in using bulk inserts (especially if this process only happens every few days). Your small inserts will be combined in batches, but 80% of the documents don't actually need to be sent to the server.

I would still favour bulk insert with ContinueOnError over your approach of deletion and re-insertion, but bulk inserts may be an unnecessary early optimisation given the number of documents you are wrangling and the percentage that actually need to be inserted.

I would suggest doing a few runs with the different approaches to see what the actual impact is for your use case.

MongoDB 2.6

As a head's up, the batch functionality is being significantly improved in the MongoDB 2.5 development series (which will culminate in the 2.6 production release). Planned features include support for bulk upserts and accumulating per-document errors rather than a single error per batch.

The new write commands will require driver changes to support, but may change some of the assumptions above. For example, with ContinueOnError using the new batch API you could end up getting a result back with the 80% of your batch IDs that are duplicate keys.

For more details, see the parent issue SERVER-9038 in the MongoDB issue tracker.

Solution 2

collection.insert(item, {continueOnError: true, safe: true}, function(err, result) {
                    if (err && err.code != "11000"){
                        throw err;
                     }

                    db.close();
                    callBack();
});

Solution 3

For your case, I'd suggest you consider fetching a list of the existing document _ids, and then only sending the documents that aren't in that list already. While you could use update with upsert to update individually, there's little reason to do so. Unless the list of _ids is extremely long (tens of thousands), it would be more efficient to grab the list and do the comparison than do individual updates to the database for each document (with some large percentage apparently failing to update).

I wouldn't use the continueOnError and send all documents ... it's less efficient.

Share:
10,363

Related videos on Youtube

user949300
Author by

user949300

Currently a budding full-stack web developer using HTML5, CSS, JavaScript, node.js, a little mongodb, python and PHP. Before that a long time Java programmer in the scientific instruments / biotech field, mainly algorithms, desktop / "SE", threading, Swing.

Updated on September 01, 2022

Comments

  • user949300
    user949300 over 1 year

    I have a largish (~100) array of smallish documents (maybe 10 fields each) to insert in MongoDB. But many of them (perhaps all, but typically 80% or so) of them will already exist in the DB. The documents represent upcoming events over the next few months, and I'm updating the database every couple of days. So most of the events are already in there.

    Anybody know (or want to guess) if it would be more efficient to:

    1. Do the bulk update but with continueOnError = true, e.g.

    db.collection.insert(myArray, {continueOnError: true}, callback)

    1. do individual inserts, checking first if the _ID exists?

    2. First do a big remove (something like db.collection.delete({_id: $in : [array of all the IDs in my new documents] }), then a bulk insert?

    I'll probably do #1 as that is the simplest, and I don't think that 100 documents is all that large so it may not matter, but if there were 10,000 documents? I'm doing this in JavaScript with the node.js driver if that matters. My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming???

    ADDED: I don't think "upsert" makes sense. That is for updating an individual document. In my case, the individual document, representing an upcoming event, is not changing. (well, maybe it is, that's another issue)

    What's happening is that a few new documents will be added.

    • xspydr
      xspydr over 10 years
      Are you able to check to see if a document/object already has an id assigned without making a call to the server? Try to do as much within your application as possible without making calls to the db.
  • user949300
    user949300 over 10 years
    I believe that upsert only works on one document at a time. Plus, in my case, I'm not updating existing documents, I'm adding new ones.
  • user949300
    user949300 over 10 years
    This sounds interesting - I'll give it a try.
  • user949300
    user949300 about 10 years
    For the time being, I decided that I did want to remove and re-insert, so that all documents are up to date. Luckily for me, the docs don't change often, so this only happens every couple of days. Also, the docs are organized by US state, and have a field for such, so instead of deleting based on a gazillion indices I can just pick the states for which there are new docs, e.g. delete({ state: 'CA' }). (Will add an index there too!). But thanks for the info and "head's up" and I'll give you the check for best answer.
  • neontapir
    neontapir over 9 years
    For those new to Mongo, this answer might be more useful with some explanation of what this code is doing.
  • Parvez Khan
    Parvez Khan almost 3 years
    Is it possible to provide a list of documents in the update param?