[SERVER-9272] Querying latest document based on a set of field Created: 08/Apr/13  Updated: 10/Dec/14  Resolved: 29/Apr/13

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 2.4.1
Fix Version/s: None

Type: New Feature Priority: Minor - P4
Reporter: Mervin San Andres Assignee: J Rassi
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-9366 Expose projection (aggregation) opera... Closed
duplicates SERVER-9507 Optimize $sort+$group+$first pipeline... Closed
Participants:

 Description   

It would be good if querying can support filtering of recent documents based on certain document fields. For example:

> db.logs.find()
{
'id': ObjectId("50ad8d451d41c8fc58000003")
'name': 'Sample Log 1',
'uploaded_at: ISODate("2013-03-14T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099',
'tag_doc': {
  'group_x: ['TAG-1','TAG-2'],
  'group_y': ['XYZ']
}
},
{
'id': ObjectId("50ad8d451d41c8fc58000004")
'name': 'Sample Log 2',
'uploaded_at: ISODate("2013-03-15T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099'
'tag_doc': {
  'group_x: ['TAG-1'],
  'group_y': ['XYZ']
}
}

Both documents contain the same case_id. The user can filter out all the most recently uploaded document for each unique combination of fields, for example case_id and group_y. This is helpful in aggregating on a large collection of multiple version of documents.



 Comments   
Comment by J Rassi [ 29/Apr/13 ]

You're suggesting:

  1. an optimization that allows for a $sort+$group+$first pipeline to avoid a full index scan. I'm distilling this request into a new ticket, SERVER-9507. Please add yourself as a watcher to that ticket, and add additional comments if necessary.
  2. new query syntax that can be used in a find() to access aggregation framework functionality. There is an existing ticket that covers this, SERVER-9366. Please comment there with your suggested syntax, or open a new ticket if your proposal is outside the scope of that ticket.

For your application, I would suggest one of the following options in the interim:

  • Test the performance of using $sort+$group+$first as suggested, and see if it meets your requirements. Note that $sort will use an index instead of performing an in-memory sort, if one is available, so make sure your collection has the appropriate indexes.
  • Augment your application/schema to additionally maintain a separate collection (or collections) to store a mapping to use for this query. For example, you could create one document for each unique combination of (group_x, case_id), and store in this document a reference to the latest upload performed. See the documentation for more information on this data model.
Comment by Mervin San Andres [ 25/Apr/13 ]

Just following up, was my comment clear to you?

Comment by Mervin San Andres [ 16/Apr/13 ]

Better yet, I am hoping for something like

db.collection.aggregate({$sort:{t:-1}},{$group:{_id:{a:"$a",b:"$b"},t:{$max:"$t"}}})

but with the $max operator implemented as a query operator requiring field parameters. Maybe something like:

db.collection.find({},{$max:'uploaded_at',$by:{'group_x','case_id'}})

which can also be used in a $match pipeline.

Comment by Mervin San Andres [ 11/Apr/13 ]

Yes, I think you can view it as such. Performance is definitely the issue I am dealing with given the records that I have.

Comment by J Rassi [ 10/Apr/13 ]

Just checking in – does my above summary accurately reflect your request?

Comment by J Rassi [ 08/Apr/13 ]

Can you make sure I am understanding you correctly? I will attempt to rephrase your position:

The aggregation example given above satisfies your logical requirement for the feature request, however it does not satisfy your performance requirement. Specifically, the example given runs in time O(num docs in collection), assuming there is an index available for the $sort. But, you are interested in a mechanism that runs in time O(unique num docs satisfying constraint).

Comment by Mervin San Andres [ 08/Apr/13 ]

Actually, the point of the suggested feature is to prevent all the 'redundant' documents to be loaded in the pipeline. Hopefully, the query as passed to the $match pipeline will immediately limit the documents and speed up the grouping.

Comment by J Rassi [ 08/Apr/13 ]

I believe MongoDB's aggregation framework already supports the feature you're requesting.

See the following example query which, for each unique combination of fields a and b, returns the document with the most recent value of t.

db.collection.aggregate({$sort:{t:-1}},{$group:{_id:{a:"$a",b:"$b"},t:{$first:"$t"}}})

Is this what you're looking for?

Generated at Thu Feb 08 03:19:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.