How to use Distinct, Sort, limit with mongodb

mongodb pymongo

10,360

Can you please clarify exactly what you would like to do? Do you want to return documents with unique "text" values with the highest "count" value?

For example, given the collection:

> db.text.find({}, {_id:0})
{ "text" : "here is text", "count" : 13, "somefield" : "value" }
{ "text" : "here is text", "count" : 12, "somefield" : "value" }
{ "text" : "here is text", "count" : 10, "somefield" : "value" }
{ "text" : "other text", "count" : 4, "somefield" : "value" }
{ "text" : "other text", "count" : 3, "somefield" : "value" }
{ "text" : "other text", "count" : 2, "somefield" : "value" }
>
(I have omitted _id values for brevity)

Would you like to return only the documents that contain unique text with the highest 'count' value?

{ "text" : "here is text", "count" : 13, "somefield" : "value" }

and

{ "text" : "other text", "count" : 4, "somefield" : "value" }

One way to do this is with the $group and $max functions in the new aggregation framework. The documentation on $group may be found here: http://docs.mongodb.org/manual/aggregation/

> db.text.aggregate({$group : {_id:"$text", "maxCount":{$max:"$count"}}})
{
    "result" : [
        {
            "_id" : "other text",
            "maxCount" : 4
        },
        {
            "_id" : "here is text",
            "maxCount" : 13
        }
    ],
    "ok" : 1
}

As you can see, the above does not return the original documents. If the original documents are desired, a query may then be done to find documents matching the unique text and count values.

As an alternative, you can first do run the 'distinct' command to return an array of all the distinct values and then run a query for each value with sort and limit to return the document with the highest value of 'count'. The sort() and limit() methods are explained in the "Cursor Methods" section of the "Advanced Queries" documentation: http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-CursorMethods

> var values = db.runCommand({distinct:"text", key:"text"}).values
> values
[ "here is text", "other text" ]
> for(v in values){var c = db.text.find({"text":values[v]}).sort({count:-1}).limit(1); c.forEach(printjson);}
{
    "_id" : ObjectId("4f7b50b2df77a5e0fd4ccbf1"),
    "text" : "here is text",
    "count" : 13,
    "somefield" : "value"
}
{
    "_id" : ObjectId("4f7b50b2df77a5e0fd4ccbf4"),
    "text" : "other text",
    "count" : 4,
    "somefield" : "value"
}

It is unclear if this is exactly what you are trying to do, but I hope that it will at least give you some ideas to get started. If I have misunderstood, please explain in more detail the exact operation that you would like to perform, and hopefully I or another member of the Community will be able to help you out. Thanks.

10,360

Author by

Kracekumar

Geek. http://about.me/kracekumar

Updated on June 04, 2022

Comments

Kracekumar almost 2 years

I have a document structure {'text': 'here is text', 'count' : 13, 'somefield': value}

Collection has some thousands of record, and text key value may be repeated many time, I want to find distinct text with highest count value,along with that whole document should be returned , I am able to sort them in descending order.

distinct returns unique value in a list.

I want to use all three functions and document has to be returned, I am still learning and not covered mapreduce.
Kracekumar about 12 years

Would you like to return only the documents that contain unique text with the highest 'count' value? yes . I tried aggregate, seems pymongo doesn't support aggregate, I tried db.command, still it failed, I will experiment, the second method seems to be straight forward, I am concerned about complexity and round trip time, since this is run across few thousand to few hundred thousand dosc(at present 10k). Thanks for 2 answers, fetching a unique records will take constant time(increases with increase in no of docs) and sort descending will also take m * n lg , m is unique records.
Marc about 12 years

The aggregate command may be run in PyMongo like this: res = db.command({"aggregate":"text", "pipeline":[{"$group" : {"_id":"$text", "maxCount":{"$max":"$count"}}}]}) The PyMongo API documentation on commands is here: api.mongodb.org/python/current/api/pymongo/… It looks as though aggregate may not be necessary for your application, but it is still useful to know how to use the aggregation framework in PyMongo for future reference.
Marc about 12 years

As a general piece of advice, it is often preferable to add a little bit of extra overhead on each insert and add an extra key to each document that can be queried, as opposed to doing a big calculation (such as a MapReduce or an Aggregation command) every time data is retrieved. As an alternative, maybe consider adding an "isMaxCountforText" key (or equivalent) to each document. Each time a document is added or updated, the "count" key can be checked against other documents containing the same text, and "isMaxCountforText" can be updated accordingly.