[SERVER-70395] Slot-Based Engine too aggressively uses disk for $group and is slow Created: 10/Oct/22 Updated: 29/Oct/23 Resolved: 31/Jan/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 6.3.0-rc0, 6.0.5 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Steve Micallef | Assignee: | David Storch |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | pm2697-m3 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v6.2, v6.0
|
||||||||||||||||||||||||||||||||||||
| Sprint: | QE 2023-02-06 | ||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||
| Story Points: | 10 | ||||||||||||||||||||||||||||||||||||
| Description |
|
Please refer to the discussion here: https://www.mongodb.com/community/forums/t/addtoset-slower-hanging-on-large-collections-since-mongodb-6/192395/3 In short, there appear to be two issues with the SBE:
The only solution was to force the use of the classic query engine, which doesn't seem like a desirable approach. |
| Comments |
| Comment by David Storch [ 15/Feb/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've applied each of the five commits individually to the 6.0 branch. Each cherry-picked commit also contains the necessary merge conflict resolution. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 15/Feb/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}Message: The new algorithm spills the entire hash table to a (cherry picked from commit 355aadd8ff9d5599da2983990f87c7ec300a972d) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 15/Feb/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}Message: (cherry picked from commit efb394a3e7b3133148181e50f86edddedbc588d9) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 15/Feb/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}Message: (cherry picked from commit ed83b1244ec4d89c909e5766893c06de091fabb6) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 15/Feb/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}Message: This patch handles $min, $max, $first, and $last. The (cherry picked from commit e2d6db6b489a9f4514c7d66462d6f3b0d1835bd4) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 15/Feb/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}Message: (cherry picked from commit 0d9b3912cee0897af35ec529a687d886f2b5a06d) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 31/Jan/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}Message: The new algorithm spills the entire hash table to a | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 10/Jan/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Hippler (Inactive) [ 09/Jan/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks david.storch@mongodb.com for the clarification. Unless christopher.harris@mongodb.com says otherwise, I'll continue preparing the release of 6.0.4-rc0 tomorrow as scheduled. Thanks again! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 06/Jan/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi james.hippler@mongodb.com! Answers to your questions below:
I don't think it's a strict 6.0.4 requirement. It's a high priority and fairly severe issue, but unfortunately the technical solution to the issue is relatively complex. I've merged three commits so far towards fixing the issue in the master branch and have a fourth in code review which I need to return to now that I am back from my holiday break. There will be a fifth capstone commit which actually solves the issue. I haven't typed this fifth patch up yet, but I had a working version in my original prototype branch from when we were reviewing design alternatives. After the fifth commit lands, we need to decide on the strategy for backporting the fix and complete the backports. This is all to say that I still need more time to resolve the issue, and we probably shouldn't hold up 6.0.4-rc0 for it. Copying christopher.harris@mongodb.com (the query team product manager familiar with this issue) in case he has a different opinion.
As I described above, that is not possible. Thanks and let me know if you have any other questions. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Hippler (Inactive) [ 06/Jan/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We're scheduled to release 6.0.4-rc0 on Tuesday, January 10th. I'm going through any outstanding items in the 6.0 Release Board for status updates. I have a few questions regarding this SERVER Ticket.
Considering that the issue was first reported in October 2022 and 6.0.3 was since released without this change, it is my assumption that this is not a release blocker (but please correct me if I'm wrong). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 16/Dec/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I wanted to provide a quick status update. This ticket is once again marked as "In Code Review", since I have a pull request available internally which implements support in the SBE stage builders for the last batch of accumulators ($sum, $avg, $stdDevPop, and $stdDevSamp). Once this is merged to the master branch, there will be a final commit which actually changes the HashAggStage to use the new spilling algorithm that we agreed upon and that I prototyped in November. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 08/Dec/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 28/Nov/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Moving this back to in-progress now that the second commit has landed and I'm actively working on the third. There is still a bit more work to do in sbe_stage_builder_accumulator.cpp in preparation for the actual HashAggStage changes. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 28/Nov/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}Message: This patch handles $min, $max, $first, and $last. The | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 28/Nov/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi cast.away.debugger@gmail.com! Thanks for the additional information, and our apologies that you are experiencing this issue. We are definitely aware of the severity of this issue and are currently working on implementing the fix. As you can see in the comments above, I've committed the first part of the fix to the master branch. I have three more patches planned which together will constitute the fix. There is a decent amount of technical complexity because we need to re-implement the new slot-based execution (SBE) engine's spill-to-disk behavior for $group. We haven't discussed backport yet internally, but I believe that we will aim to provide a fix in the 6.2 and 6.0 branches as well as the master branch. As mentioned in my comment above as well as your comment, the workaround is to disable SBE. This can be done on the 6.0 branch by setting the internalQueryForceClassicEngine parameter to true. On 6.2, the same thing is instead achieved by configuring internalQueryFrameworkControl="forceClassicEngine". Best, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Cast Away [ 23/Nov/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Just a user comment to recognize the severity of the issue. Using MongoDB server v6.1.0 with the latest `pymongo` driver I used the following pipeline with a collection that has a little over 3 million documents.
Using default settings, where I later discovered `internalQueryFrameworkControl` was set to `tryBonsai`, this query didn't finish after 52 minutes, when I stopped it. Using `internalQueryFrameworkControl` set to `forceClassicEngine`, the query completed in about 40 seconds. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 21/Nov/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The commit above consists of some preliminary work. I still have at least two separate patches planned in order to complete the implementation. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 18/Nov/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 17/Nov/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We've agreed on a path forward in the draft PR. Moving this back to in progress while I work on the final implementation. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 15/Nov/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Marking as "In Code Review" while we evaluate my draft PR. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 25/Oct/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I confirmed using perf record that the performance problem relates to the way that spilling to disk is implemented in SBE's hash aggregation operator. The linux perf profile shows ~46% of samples in sbe::readFromRecordStore, for example, which is involved in the spilling implementation. This is responsible for fetching and deserializing a record from WiredTiger. 20% is in the MaterializedRow destructor and another 22% is in sbe::copyValue() (which I need to dig into further to see where this may stemming from; it is potentially an unnecessary MaterializedRow copy). This indicates that mihai.andrei@mongodb.com's hypothesis appears to be correct. Quoting him:
Each time we deal with a new record – and that record belongs to a group which is resident in the RecordStore and not the in-memory hash table – we retrieve the binary blob from the RecordStore, deserialize it into an SBE value, execute the bytecode for the agg function, reserialize the result and write it into the RecordStore. This workload involves ArraySet values which grow to 100,000 elements. Therefore, we are repeatedly constructing and destroying very large SBE ArraySet objects. I actually don't think disk I/O is the real problem, since WiredTiger is likely to keep the data for the spill table in memory. We're just spending a ton of time doing serialization/deserialization and creating/destroying ArraySet. It seems to me that we should strongly consider redesigning the way that SBE hash aggregation spills to disk. I'm going to kick off a discussion internally about alternatives. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 24/Oct/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Repro scriptBased on micallst@gmail.com's description of the issue, I wrote the following repro script:
This uses 2 million documents. The query creates 20 groups, each with an array of about 2,000,000 / 20 = 100,000 entries. Each element in the array is a 60 byte string, meaning that each resulting document is about 60,000,000 bytes or around 6MB. Since there are 20 groups, this means that the total aggregated data size is 120MB, just a little bit over the 100MB limit at which we start to spill. If I start the server with internalQueryFrameworkControl="forceClassicEngine", the query finishes in around 6 seconds. If I start the server with no special parameters, the query uses SBE and takes at least several minutes. (It's still running on my workstation and I have no idea when it might finish.) Question about difference in when spilling kicks in between classic and SBEMy repro causes the $group to spill for both classic and SBE. I did not reproduce this aspect of the reported behavior:
micallst@gmail.com could you provide any more insight on this aspect of the problem or how I might reproduce it? The memory footprint calculations for SBE and classic may be a bit different. One hypothesis is that your query was very close to the 100MB threshold for spilling and a small difference in the size estimation put it over the edge when SBE was enabled. Without further information, however, I can't be certain that this was the case. WorkaroundUntil this problem is resolved, the workaround for affected users is to disable SBE. On MongoDB 6.0, this can be done by setting internalQueryForceClassicEngine to true as a command line flag, in a config file, or using the setParameter command. Another (potentially inferior) option is to set allowDiskUse=false on any affected queries, or to set allowDiskUseByDefault to false. This will prevent spilling but will cause queries which exceed their memory allotment to fail. |