[SERVER-41891] $group on sorted cursor forces full collection scan Created: 24/Jun/19  Updated: 22/Sep/21  Resolved: 28/Jun/19

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 4.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Guido Imperiale Assignee: Eric Sedor
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-4507 aggregation: optimize $group to take... Backlog
duplicates SERVER-24799 $group aggregation command should mai... Closed
Operating System: ALL
Participants:

 Description   

mongodb-osx-x86_64-4.0.6

I have a collection with ~2 million documents. Each document is uniquely identified by a 'date' field (business date, not upload date). I implemented a historicization system by storing multiple documents with the same 'date' field and, within the same date, picking up the one with the highest _id (which is always auto-generated on insert).

Collection unique index: [(date, -1), (_id, -1)]

A typical aggregation pipeline where I fetch the latest version of any number of documents:

[
     {"$match": { "date": DATE_MATCH},
     {"$sort": SON( [ ("date", -1), ("_id", -1), ] ) },
     {"$group": {"_id": { "date": "$date"}, "doc": {"$first": "$$ROOT"}}},
     {"$replaceRoot": {"newRoot": "$doc"}},
 ]

The above works very well - as long only a handful of documents are filtered in by the initial $match step.

But then the query becomes "get the latest version (max id) of the newest 5 documents (max 5 dates)" older than DATE_MAX, without any constraints on the minimum date:

[
     {"$match": { "date": {"$lt": DATE_MAX}},
     {"$sort": SON( [ ("date", -1), ("_id", -1), ] ) },
     {"$group": {"_id": { "date": "$date"}, "doc": {"$first": "$$ROOT"}}},
     {"$replaceRoot": {"newRoot": "$doc"}},
     {"$sort": SON( [ ("date", -1)] },
     {"$limit": 5},
 ]

The above functionally does the job, but it is 100x slower than just

[
     {"$match": { "date": {"$lt": DATE_MAX}},
     {"$sort": SON( [ ("date", -1), ("_id", -1), ] ) },
     {"$limit": 5},
 ]

which however produces incorrect results as it will return multiple versions of the same document.

The problem is that after $match I have thousands, if not millions, of documents potentially returned by the cursor - literally everything from the 1980s onwards, which I do want to pick up if there isn't anything newer available.

One would assume that, just like $sort, $group automatically notices that the results are already sorted exactly like it needs them to, therefore grouping the results as they arrive and releasing RAM as soon as the aggregation key changes. This would mean that as soon as 5 unique dates reach the $limit step, the whole pipeline can be cancelled.

That is not the case - the timings clearly show that $group is doing a full scan of the whole collection. Note the second $sort; that is necessary because $group is returning dates sorted in ascending order, even if the $sort step before it yields them in descending order.

 



 Comments   
Comment by Eric Sedor [ 28/Jun/19 ]

gimperiale, SERVER-24799 covers a previous request for $group to respect input order. It is something we have decided not to do. So, I am closing this as a duplicate of SERVER-4507

Comment by Eric Sedor [ 25/Jun/19 ]

Thanks gimperiale--I am going to investigate this point more before deciding to close this ticket as a duplicate:

"$group is returning dates sorted in ascending order, even if the $sort step before it yields them in descending order."

Thanks in advance for your patience.

Comment by Guido Imperiale [ 25/Jun/19 ]

Hi,
Yes, you are correct on both points

Comment by Eric Sedor [ 25/Jun/19 ]

Hi gimperiale,

The potential for optimizing $group for sorted input is tracked in SERVER-4507. I encourage you to watch that ticket for updates. You can also chime in on that ticket to describe your use-case and the importance of the improvement for your system.

Before closing this ticket though, I wanted to better understand your unique index on "[(date, -1), (id_, -1)]":

  • I just want to verify my assumption that you meant _id, to be sure. If I am assuming incorrectly then I may have additional guidance to provide.
  • Do I understand correctly that the uniqueness criteria on both date and _id so that the collection can support historic documents in which the same business dates were stored in separate documents?
Comment by Guido Imperiale [ 25/Jun/19 ]

To clarify: when I wrote

 $group is doing a full scan of the whole collection

what I more precisely meant is that $group is ingesting all documents from the previous stage of the pipeline before it emits a single one.

Generated at Thu Feb 08 04:58:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.