[SERVER-70395] Slot-Based Engine too aggressively uses disk for $group and is slow Created: 10/Oct/22  Updated: 29/Oct/23  Resolved: 31/Jan/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.3.0-rc0, 6.0.5

Type: Bug Priority: Critical - P2
Reporter: Steve Micallef Assignee: David Storch
Resolution: Fixed Votes: 2
Labels: pm2697-m3
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-71680 $operationMetrics fails to account fo... Closed
is depended on by SERVER-73311 Report group spill stats in serverSta... Closed
Related
related to SERVER-71684 $operationMetrics accounts for readin... Backlog
related to SERVER-76321 Buffer overrun while deserializing co... Closed
related to SERVER-74291 Investigate whether new SBE HashAggSt... Open
related to SERVER-74293 Improve spilling algorithm for HashLo... Open
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.2, v6.0
Sprint: QE 2023-02-06
Participants:
Case:
Story Points: 10

 Description   

Please refer to the discussion here: https://www.mongodb.com/community/forums/t/addtoset-slower-hanging-on-large-collections-since-mongodb-6/192395/3

In short, there appear to be two issues with the SBE:

  1. It too aggressively resorts to disk for $group processing. In my example above, disk isn't used with the classic engine on 6.0, or on 4.0.
  2. When resorting to disk, it is incredibly slow.

The only solution was to force the use of the classic query engine, which doesn't seem like a desirable approach.



 Comments   
Comment by David Storch [ 15/Feb/23 ]

I've applied each of the five commits individually to the 6.0 branch. Each cherry-picked commit also contains the necessary merge conflict resolution.

Comment by Githook User [ 15/Feb/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-70395 Change spilling for SBE HashAggStage to use a more efficient algorithm

The new algorithm spills the entire hash table to a
RecordStore whenever the memory budget is exceeded. Once all
the input is consumed, it switches to a streaming approach,
merging the partial aggregates recovered from disk.

(cherry picked from commit 355aadd8ff9d5599da2983990f87c7ec300a972d)
Branch: v6.0
https://github.com/mongodb/mongo/commit/eb9bbc53af8813ef7465cc8d1d6b63765576401a

Comment by Githook User [ 15/Feb/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-70395 Combining partial aggs support for $sum, $avg, and std dev

(cherry picked from commit efb394a3e7b3133148181e50f86edddedbc588d9)
Branch: v6.0
https://github.com/mongodb/mongo/commit/cb24503913347f080d85480683e91ad7282a1697

Comment by Githook User [ 15/Feb/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-70395 Combining partial aggs support for $addToSet, $push, and $mergeObjects

(cherry picked from commit ed83b1244ec4d89c909e5766893c06de091fabb6)
Branch: v6.0
https://github.com/mongodb/mongo/commit/30ddfe887b07bbde3ded94d0d909fc6c2f716001

Comment by Githook User [ 15/Feb/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-70395 Make stage builders generate partial agg combining exprs (part 1)

This patch handles $min, $max, $first, and $last. The
remaining accumulators will be implemented as follow-up
work.

(cherry picked from commit e2d6db6b489a9f4514c7d66462d6f3b0d1835bd4)
Branch: v6.0
https://github.com/mongodb/mongo/commit/88e851c06ecfd3182627ec68c2d03ca4de294cab

Comment by Githook User [ 15/Feb/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-70395 De-stage-ify accumulator stage builder

(cherry picked from commit 0d9b3912cee0897af35ec529a687d886f2b5a06d)
Branch: v6.0
https://github.com/mongodb/mongo/commit/cf43ac9fd794b8751931e1494390079937362748

Comment by Githook User [ 31/Jan/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-70395 Change spilling for SBE HashAggStage to use a more efficient algorithm

The new algorithm spills the entire hash table to a
RecordStore whenever the memory budget is exceeded. Once all
the input is consumed, it switches to a streaming approach,
merging the partial aggregates recovered from disk.
Branch: master
https://github.com/mongodb/mongo/commit/355aadd8ff9d5599da2983990f87c7ec300a972d

Comment by Githook User [ 10/Jan/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-70395 Combining partial aggs support for $sum, $avg, and std dev
Branch: master
https://github.com/mongodb/mongo/commit/efb394a3e7b3133148181e50f86edddedbc588d9

Comment by James Hippler (Inactive) [ 09/Jan/23 ]

Thanks david.storch@mongodb.com for the clarification.  Unless christopher.harris@mongodb.com says otherwise, I'll continue preparing the release of 6.0.4-rc0 tomorrow as scheduled.

Thanks again!

Comment by David Storch [ 06/Jan/23 ]

Hi james.hippler@mongodb.com! Answers to your questions below:

Is this ticket a requirement for 6.0.4 or can we build the release candidate binaries without the inclusion of this SERVER ticket change?

I don't think it's a strict 6.0.4 requirement. It's a high priority and fairly severe issue, but unfortunately the technical solution to the issue is relatively complex. I've merged three commits so far towards fixing the issue in the master branch and have a fourth in code review which I need to return to now that I am back from my holiday break. There will be a fifth capstone commit which actually solves the issue. I haven't typed this fifth patch up yet, but I had a working version in my original prototype branch from when we were reviewing design alternatives. After the fifth commit lands, we need to decide on the strategy for backporting the fix and complete the backports. This is all to say that I still need more time to resolve the issue, and we probably shouldn't hold up 6.0.4-rc0 for it. Copying christopher.harris@mongodb.com (the query team product manager familiar with this issue) in case he has a different opinion.

If required, is there any possibility that these changes will be implemented on or before Monday, January 9th?

As I described above, that is not possible.

Thanks and let me know if you have any other questions.

Comment by James Hippler (Inactive) [ 06/Jan/23 ]

david.storch@mongodb.com 

We're scheduled to release 6.0.4-rc0 on Tuesday, January 10th.  I'm going through any outstanding items in the 6.0 Release Board for status updates.  I have a few questions regarding this SERVER Ticket.

  • Is this ticket a requirement for 6.0.4 or can we build the release candidate binaries without the inclusion of this SERVER ticket change?
  • If required, is there any possibility that these changes will be implemented on or before Monday, January 9th?

Considering that the issue was first reported in October 2022 and 6.0.3 was since released without this change, it is my assumption that this is not a release blocker (but please correct me if I'm wrong).

Comment by David Storch [ 16/Dec/22 ]

I wanted to provide a quick status update. This ticket is once again marked as "In Code Review", since I have a pull request available internally which implements support in the SBE stage builders for the last batch of accumulators ($sum, $avg, $stdDevPop, and $stdDevSamp). Once this is merged to the master branch, there will be a final commit which actually changes the HashAggStage to use the new spilling algorithm that we agreed upon and that I prototyped in November.

Comment by Githook User [ 08/Dec/22 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-70395 Combining partial aggs support for $addToSet, $push, and $mergeObjects
Branch: master
https://github.com/mongodb/mongo/commit/ed83b1244ec4d89c909e5766893c06de091fabb6

Comment by David Storch [ 28/Nov/22 ]

Moving this back to in-progress now that the second commit has landed and I'm actively working on the third. There is still a bit more work to do in sbe_stage_builder_accumulator.cpp in preparation for the actual HashAggStage changes.

Comment by Githook User [ 28/Nov/22 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-70395 Make stage builders generate partial agg combining exprs (part 1)

This patch handles $min, $max, $first, and $last. The
remaining accumulators will be implemented as follow-up
work.
Branch: master
https://github.com/mongodb/mongo/commit/e2d6db6b489a9f4514c7d66462d6f3b0d1835bd4

Comment by David Storch [ 28/Nov/22 ]

Hi cast.away.debugger@gmail.com! Thanks for the additional information, and our apologies that you are experiencing this issue. We are definitely aware of the severity of this issue and are currently working on implementing the fix. As you can see in the comments above, I've committed the first part of the fix to the master branch. I have three more patches planned which together will constitute the fix. There is a decent amount of technical complexity because we need to re-implement the new slot-based execution (SBE) engine's spill-to-disk behavior for $group. We haven't discussed backport yet internally, but I believe that we will aim to provide a fix in the 6.2 and 6.0 branches as well as the master branch.

As mentioned in my comment above as well as your comment, the workaround is to disable SBE. This can be done on the 6.0 branch by setting the internalQueryForceClassicEngine parameter to true. On 6.2, the same thing is instead achieved by configuring internalQueryFrameworkControl="forceClassicEngine".

Best,
Dave

Comment by Cast Away [ 23/Nov/22 ]

Just a user comment to recognize the severity of the issue.

Using MongoDB server v6.1.0 with the latest `pymongo` driver I used the following pipeline with a collection that has a little over 3 million documents.

pipeline = [
    {
        "$group": {
            "_id": {
                "sourceIPv4Address": "$sourceIPv4Address",
                "destinationIPv4Address": "$destinationIPv4Address",
            },
            "count": {"$count": {}},
            # "silkAppLabels": {"$addToSet": "$silkAppLabel"},
            "minSrcPort": {"$min": "$sourceTransportPort"},
            "maxSrcPort": {"$max": "$sourceTransportPort"},
            "srcPorts": {"$addToSet": "$sourceTransportPort"},
            "minDstPort": {"$min": "$destinationTransportPort"},
            "maxDstPort": {"$max": "$destinationTransportPort"},
            "dstPorts": {"$addToSet": "$destinationTransportPort"}
        }
    },
    {
        "$set":{
            "srcPortUniquesCount":{"$size": "$srcPorts"},
            "srcPorts": "$$REMOVE",
            "dstPortUniquesCount":{"$size": "$dstPorts"},
            "dstPorts": "$$REMOVE"
        }
    },
    {"$sort": {"count": -1}},
    {"$limit": 15},
]
cursor = collection.aggregate(pipeline)
for doc in cursor:
    pprint.pp(doc)

Using default settings, where I later discovered `internalQueryFrameworkControl` was set to `tryBonsai`, this query didn't finish after 52 minutes, when I stopped it.

Using `internalQueryFrameworkControl` set to `forceClassicEngine`, the query completed in about 40 seconds.

Comment by David Storch [ 21/Nov/22 ]

The commit above consists of some preliminary work. I still have at least two separate patches planned in order to complete the implementation.

Comment by Githook User [ 18/Nov/22 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-70395 De-stage-ify accumulator stage builder
Branch: master
https://github.com/mongodb/mongo/commit/0d9b3912cee0897af35ec529a687d886f2b5a06d

Comment by David Storch [ 17/Nov/22 ]

We've agreed on a path forward in the draft PR. Moving this back to in progress while I work on the final implementation.

Comment by David Storch [ 15/Nov/22 ]

Marking as "In Code Review" while we evaluate my draft PR.

Comment by David Storch [ 25/Oct/22 ]

I confirmed using perf record that the performance problem relates to the way that spilling to disk is implemented in SBE's hash aggregation operator. The linux perf profile shows ~46% of samples in sbe::readFromRecordStore, for example, which is involved in the spilling implementation. This is responsible for fetching and deserializing a record from WiredTiger. 20% is in the MaterializedRow destructor and another 22% is in sbe::copyValue() (which I need to dig into further to see where this may stemming from; it is potentially an unnecessary MaterializedRow copy).

This indicates that mihai.andrei@mongodb.com's hypothesis appears to be correct. Quoting him:

In SBE, however, we're spilling to a temporary record store, and on each spill, we read the previously spilled record from disk, update it in memory, then spill it back to the record store and update it. For reducing accumulators (think $sum, $avg, etc.), this is probably fine. For accumulators that never reduce like $push and $addToSet, however, this might result in a lot of unnecessary disk I/O given that we're reading from disk, appending some values to our set/array, and writing it back to disk.

Each time we deal with a new record – and that record belongs to a group which is resident in the RecordStore and not the in-memory hash table – we retrieve the binary blob from the RecordStore, deserialize it into an SBE value, execute the bytecode for the agg function, reserialize the result and write it into the RecordStore. This workload involves ArraySet values which grow to 100,000 elements. Therefore, we are repeatedly constructing and destroying very large SBE ArraySet objects. I actually don't think disk I/O is the real problem, since WiredTiger is likely to keep the data for the spill table in memory. We're just spending a ton of time doing serialization/deserialization and creating/destroying ArraySet.

It seems to me that we should strongly consider redesigning the way that SBE hash aggregation spills to disk. I'm going to kick off a discussion internally about alternatives.

Comment by David Storch [ 24/Oct/22 ]

Repro script

Based on micallst@gmail.com's description of the issue, I wrote the following repro script:

/**
 * Script designed to repro the performance problem with $group spilling
 * reported in SERVER-70395.
 */
(function() {
'use strict';
 
// Reporter says their repro involves 1.3 million documents.
const NUM_DOCS = 2000000;
 
// The number of unique values of the field we're grouping on.
const NUM_GROUPS = 20;
 
// The field we're aggregating with $addToSet will consist of random strings of
// this length. The reporter mentions that each value is about 60 bytes.
const SECOND_FIELD_LEN = 60;
 
Random.setRandomSeed();
 
const coll = db.server70395;
const pipeline = [{
    $group:
        {_id: {groupedField: '$firstField'}, list: {$addToSet: '$secondField'}}
}];
 
// Execute 'func' and return the elapsed time in millisecond
function time(func) {
    let start = new Date();
    func();
    let end = new Date();
    return end - start;
}
 
 
function prepareCollection() {
    coll.drop();
 
    function makeRandomString() {
        let chars = '0123456789';
        let str = '';
        for (let i = 0; i < SECOND_FIELD_LEN; ++i) {
            str += chars[Random.randInt(chars.length)];
        }
        return str;
    }
 
    const insertBatch = coll.initializeUnorderedBulkOp();
    for (let i = 0; i < NUM_DOCS; ++i) {
        insertBatch.insert(
            {firstField: Random.randInt(10), secondField: makeRandomString()});
    }
    assert.commandWorked(insertBatch.execute());
}
 
function runPipeline() {
    coll.aggregate(pipeline).itcount();
}
 
function runExplain(shouldPrint) {
    let explain = coll.explain('executionStats').aggregate(pipeline);
    if (shouldPrint) {
        printjson(explain);
    }
}
 
let prepTime = time(prepareCollection);
print('Preparing collection took (milliseconds): ' + prepTime);
 
let runPipelineTime = time(runPipeline);
print('Running pipeline took (milliseconds): ' + runPipelineTime);
 
let explainTime = time(() => runExplain(true));
print('explaining pipeline took (milliseconds): ' + explainTime);
}());

This uses 2 million documents. The query creates 20 groups, each with an array of about 2,000,000 / 20 = 100,000 entries. Each element in the array is a 60 byte string, meaning that each resulting document is about 60,000,000 bytes or around 6MB. Since there are 20 groups, this means that the total aggregated data size is 120MB, just a little bit over the 100MB limit at which we start to spill.

If I start the server with internalQueryFrameworkControl="forceClassicEngine", the query finishes in around 6 seconds. If I start the server with no special parameters, the query uses SBE and takes at least several minutes. (It's still running on my workstation and I have no idea when it might finish.)

Question about difference in when spilling kicks in between classic and SBE

My repro causes the $group to spill for both classic and SBE. I did not reproduce this aspect of the reported behavior:

It too aggressively resorts to disk for $group processing. In my example above, disk isn't used with the classic engine on 6.0, or on 4.0.

micallst@gmail.com could you provide any more insight on this aspect of the problem or how I might reproduce it? The memory footprint calculations for SBE and classic may be a bit different. One hypothesis is that your query was very close to the 100MB threshold for spilling and a small difference in the size estimation put it over the edge when SBE was enabled. Without further information, however, I can't be certain that this was the case.

Workaround

Until this problem is resolved, the workaround for affected users is to disable SBE. On MongoDB 6.0, this can be done by setting internalQueryForceClassicEngine to true as a command line flag, in a config file, or using the setParameter command.

Another (potentially inferior) option is to set allowDiskUse=false on any affected queries, or to set allowDiskUseByDefault to false. This will prevent spilling but will cause queries which exceed their memory allotment to fail.

Generated at Thu Feb 08 06:16:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.