[SERVER-41891] $group on sorted cursor forces full collection scan Created: 24/Jun/19 Updated: 22/Sep/21 Resolved: 28/Jun/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | 4.0.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Guido Imperiale | Assignee: | Eric Sedor |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
mongodb-osx-x86_64-4.0.6 I have a collection with ~2 million documents. Each document is uniquely identified by a 'date' field (business date, not upload date). I implemented a historicization system by storing multiple documents with the same 'date' field and, within the same date, picking up the one with the highest _id (which is always auto-generated on insert). Collection unique index: [(date, -1), (_id, -1)] A typical aggregation pipeline where I fetch the latest version of any number of documents:
The above works very well - as long only a handful of documents are filtered in by the initial $match step. But then the query becomes "get the latest version (max id) of the newest 5 documents (max 5 dates)" older than DATE_MAX, without any constraints on the minimum date:
The above functionally does the job, but it is 100x slower than just
which however produces incorrect results as it will return multiple versions of the same document. The problem is that after $match I have thousands, if not millions, of documents potentially returned by the cursor - literally everything from the 1980s onwards, which I do want to pick up if there isn't anything newer available. One would assume that, just like $sort, $group automatically notices that the results are already sorted exactly like it needs them to, therefore grouping the results as they arrive and releasing RAM as soon as the aggregation key changes. This would mean that as soon as 5 unique dates reach the $limit step, the whole pipeline can be cancelled. That is not the case - the timings clearly show that $group is doing a full scan of the whole collection. Note the second $sort; that is necessary because $group is returning dates sorted in ascending order, even if the $sort step before it yields them in descending order.
|
| Comments |
| Comment by Eric Sedor [ 28/Jun/19 ] |
|
gimperiale, |
| Comment by Eric Sedor [ 25/Jun/19 ] |
|
Thanks gimperiale--I am going to investigate this point more before deciding to close this ticket as a duplicate: "$group is returning dates sorted in ascending order, even if the $sort step before it yields them in descending order." Thanks in advance for your patience. |
| Comment by Guido Imperiale [ 25/Jun/19 ] |
|
Hi, |
| Comment by Eric Sedor [ 25/Jun/19 ] |
|
Hi gimperiale, The potential for optimizing $group for sorted input is tracked in SERVER-4507. I encourage you to watch that ticket for updates. You can also chime in on that ticket to describe your use-case and the importance of the improvement for your system. Before closing this ticket though, I wanted to better understand your unique index on "[(date, -1), (id_, -1)]":
|
| Comment by Guido Imperiale [ 25/Jun/19 ] |
|
To clarify: when I wrote
what I more precisely meant is that $group is ingesting all documents from the previous stage of the pipeline before it emits a single one. |