Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.0.6
Component/s: Aggregation Framework
Labels:
None

Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

mongodb-osx-x86_64-4.0.6

I have a collection with ~2 million documents. Each document is uniquely identified by a 'date' field (business date, not upload date). I implemented a historicization system by storing multiple documents with the same 'date' field and, within the same date, picking up the one with the highest _id (which is always auto-generated on insert).

Collection unique index: [(date, -1), (_id, -1)]

A typical aggregation pipeline where I fetch the latest version of any number of documents:

[
     {"$match": { "date": DATE_MATCH},
     {"$sort": SON( [ ("date", -1), ("_id", -1), ] ) },
     {"$group": {"_id": { "date": "$date"}, "doc": {"$first": "$$ROOT"}}},
     {"$replaceRoot": {"newRoot": "$doc"}},
 ]

The above works very well - as long only a handful of documents are filtered in by the initial $match step.

But then the query becomes "get the latest version (max id) of the newest 5 documents (max 5 dates)" older than DATE_MAX, without any constraints on the minimum date:

[
     {"$match": { "date": {"$lt": DATE_MAX}},
     {"$sort": SON( [ ("date", -1), ("_id", -1), ] ) },
     {"$group": {"_id": { "date": "$date"}, "doc": {"$first": "$$ROOT"}}},
     {"$replaceRoot": {"newRoot": "$doc"}},
     {"$sort": SON( [ ("date", -1)] },
     {"$limit": 5},
 ]

The above functionally does the job, but it is 100x slower than just

[
     {"$match": { "date": {"$lt": DATE_MAX}},
     {"$sort": SON( [ ("date", -1), ("_id", -1), ] ) },
     {"$limit": 5},
 ]

which however produces incorrect results as it will return multiple versions of the same document.

The problem is that after $match I have thousands, if not millions, of documents potentially returned by the cursor - literally everything from the 1980s onwards, which I do want to pick up if there isn't anything newer available.

One would assume that, just like $sort, $group automatically notices that the results are already sorted exactly like it needs them to, therefore grouping the results as they arrive and releasing RAM as soon as the aggregation key changes. This would mean that as soon as 5 unique dates reach the $limit step, the whole pipeline can be cancelled.

That is not the case - the timings clearly show that $group is doing a full scan of the whole collection. Note the second $sort; that is necessary because $group is returning dates sorted in ascending order, even if the $sort step before it yields them in descending order.

duplicates

SERVER-4507 aggregation: optimize $group to take advantage of sorted sequences

Backlog

SERVER-24799 $group aggregation command should maintain document order

Closed

Assignee:: Eric Sedor
Reporter:: Guido Imperiale
Participants:: Eric Sedor, Guido Imperiale
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Jun 24 2019 05:26:00 PM UTC
Updated:: Sep 22 2021 04:24:53 PM UTC
Resolved:: Jun 28 2019 08:32:25 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates