Remove duplicate in MongoDB

14,715

Solution 1

Yes, dropDups is gone for good. But you can definitely achieve your goal with little bit effort.

You need to first find all duplicate rows and then remove all except first.

db.dups.aggregate([{$group:{_id:"$contact_id", dups:{$push:"$_id"}, count: {$sum: 1}}},
{$match:{count: {$gt: 1}}}
]).forEach(function(doc){
  doc.dups.shift();
  db.dups.remove({_id : {$in: doc.dups}});
});

As you see doc.dups.shift() will remove first _id from array and then remove all documents with remaining _ids in dups array.

script above will remove all duplicate documents.

Solution 2

this is a good pattern for mongod 3+ that also ensures that you will not run our of memory which can happen with really big collections. You can save this to a dedup.js file, customize it, and run it against your desired database with: mongo localhost:27017/YOURDB dedup.js

var duplicates = [];

db.runCommand(
  {aggregate: "YOURCOLLECTION",
    pipeline: [
      { $group: { _id: { DUPEFIELD: "$DUPEFIELD"}, dups: { "$addToSet": "$_id" }, count: { "$sum": 1 } }},
      { $match: { count: { "$gt": 1 }}}
    ],
    allowDiskUse: true }
)
.result
.forEach(function(doc) {
    doc.dups.shift();
    doc.dups.forEach(function(dupId){ duplicates.push(dupId); })
})
printjson(duplicates); //optional print the list of duplicates to be removed

db.YOURCOLLECTION.remove({_id:{$in:duplicates}});

Solution 3

We can also use an $out stage to remove duplicates from a collection by replacing the content of the collection with only one occurrence per duplicate.

For instance, to only keep one element per value of x:

// > db.collection.find()
//     { "x" : "a", "y" : 27 }
//     { "x" : "a", "y" : 4  }
//     { "x" : "b", "y" : 12 }
db.collection.aggregate(
  { $group: { _id: "$x", onlyOne: { $first: "$$ROOT" } } },
  { $replaceWith: "$onlyOne" }, // prior to 4.2: { $replaceRoot: { newRoot: "$onlyOne" } }
  { $out: "collection" }
)
// > db.collection.find()
//     { "x" : "a", "y" : 27 }
//     { "x" : "b", "y" : 12 }

This:

  • $groups documents by the field defining what a duplicate is (here x) and accumulates grouped documents by only keeping one (the $first found) and giving it the value $$ROOT, which is the document itself. At the end of this stage, we have something like:

    { "_id" : "a", "onlyOne" : { "x" : "a", "y" : 27 } }
    { "_id" : "b", "onlyOne" : { "x" : "b", "y" : 12 } }
    
  • $replaceWith all existing fields in the input document with the content of the onlyOne field we've created in the $group stage, in order to find the original format back. At the end of this stage, we have something like:

    { "x" : "a", "y" : 27 }
    { "x" : "b", "y" : 12 }
    

    $replaceWith is only available starting in Mongo 4.2. With prior versions, we can use $replaceRoot instead:

    { $replaceRoot: { newRoot: "$onlyOne" } }
    
  • $out inserts the result of the aggregation pipeline in the same collection. Note that $out conveniently replaces the content of the specified collection, making this solution possible.

Share:
14,715
Jhonathan
Author by

Jhonathan

Curte um bom rock, seriados, seu cachorro e Ovomaltine. Adora misturar tecnologias web, mania que pegou de quando era barman. Grande fã do manifesto ágil, trabalhou em equipes de pequeno e médio porte com várias metodologias adotadas desde cascata até o SCRUM. É especialista em ferramentas de otimização de sites e visibilidade na Web. Hoje trabalha como Analista Programador no Grupo UOL Educação e pós-graduando em Arquitetura de Software. Também é criador do codigosimples.net, que é uma forma de aprender ainda mais, compartilhando o pouco que sabe. http://codigosimples.net

Updated on June 05, 2022

Comments

  • Jhonathan
    Jhonathan about 2 years

    I have a collection with the field called "contact_id". In my collection I have duplicate registers with this key.

    How can I remove duplicates, resulting in just one register?

    I already tried:

    db.PersonDuplicate.ensureIndex({"contact_id": 1}, {unique: true, dropDups: true}) 
    

    But did not work, because the function dropDups is no longer available in MongoDB 3.x

    I'm using 3.2

  • Jhonathan
    Jhonathan over 8 years
    Hi. Partial worked. When I put in a small collection works fine. But when I execute in a big collection the databases "lock" and others query goes to timeout.
  • Saleem
    Saleem over 8 years
    Well, this is expected especially in very large collections. MongoDB aggregation pipeline is limited to 100MB RAM usage. See docs.mongodb.org/manual/aggregation. again your question didn't mentioned anything about your collection size. This solution gives you idea how to solve a problem and is not covering all edge cases. So you'll need to put some efforts too.
  • Jhonathan
    Jhonathan over 8 years
    Thanks for help. I'll search about this.
  • Vince Bowdren
    Vince Bowdren almost 8 years
    For large collections, you may need to use the Bulk Write Operations feature.