XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.2.25, 7.1.1, 7.2.0-rc0, 5.0.22, 7.0.3, 4.4.26, 6.0.12
Affects Version/s: 4.2.25, 7.0.1, 6.0.10, 5.0.21, 7.2.0-rc0, 7.1.0
Component/s: None
Labels:
- balancer-round-perf

Assigned Teams:

Catalog and Routing
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v7.1, v7.0, v6.0, v5.0, v4.4, v4.2
Sprint:
Sharding EMEA 2023-10-16
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Bug description

During routing table refresh, we create an updated ChunkMap from an existing one (copy on write). It is important that during the creation of the new ChunkMap the existing one remain untouched and valid.

The current update algorithm is affected by the bug that could cause a vector of the original ChunkMap to be erased.

This happens in ChunkMap::_mergeAndCommitUpdatedChunkVector where we std::move the chunkInfo pointers from the old vector to the new one.
This old vector hasn't been copied so far, and thus it is shared with other ChunkMap instances. So in order to preserve its integrity, we should copy the pointers instead of moving them.

Conditions to trigger the bug

Several conditions need to apply in order to trigger this bug:

At least one merge chunk operation must have happened in-between on routing table refresh and the subsequent one.
The merge chunk operation need to happen on the last ChunkVector of the ChunkMap (a.k.a it need to be toward the end of the RoutingTable)
The merge operation need to reduce the size of the last ChunkVector to less than half of the configured max chunk vector size.

Additionally, in order for this bug to cause any harm, the original RoutingTable needs to be accessed after the refreshed one is constructed, that usually happen with long-lasting requests or with a very high frequency of quick requests.

Affected versions

[ 7.1.0-rc0, 7.2.0 ]
[ 7.0.1, 7.0.2]
[ 6.0.10, 6.0.11]
[ 5.0.21]
[ 4.4.25]

Remediations

Chunk merges are a prerequisite to hit this bug, thus the way to prevent triggering it is just to stop all chunk merges activities and restart all the binaries in the cluster (both mongod and mongos).

Version >= `7.0`

Disable auto-merger:
Use the sh.disableAutoMerger() shell helper or update directly the "config.settings" collection:

db.getSiblingDB("config").settings.update(
        {_id: 'automerge'},
        {$set: {enabled: false}},
        {upsert: true, writeConcern: {w: 'majority'}}
);

Stop defragmentations for all collections
Stop performing manual chunk merges.
Restart all binaries
- All mongod and mongos processes

Version `6.0`

Stop defragmentations for all collections
Stop performing manual chunk merges.
Restart all binaries
- All mongod and mongos processes

Version <= `5.0`

In these versions, the balancer does not perform any automatic chunk merges, thus the only users that can be affected and need to take the remediation steps are the ones that executed at least one manual chunk merge.

Stop performing manual chunk merges.
Restart all binaries
- All mongod and mongos processes

is caused by

SERVER-71627 Refreshed cached collection route info will severely block all client request when a cluster with 1 million chunks

Closed

Assignee:: Tommaso Tocci
Reporter:: Tommaso Tocci
Participants:: Githook User, Tommaso Tocci
Votes:: 0 Vote for this issue
Watchers:: 13 Start watching this issue

Created:: Oct 08 2023 02:59:25 PM UTC
Updated:: Sep 19 2024 03:36:54 PM UTC
Resolved:: Oct 09 2023 08:23:34 AM UTC

Details

Description

Bug description

Conditions to trigger the bug

Affected versions

Remediations

Version >= 7.0

Version 6.0

Version <= 5.0

Attachments

Issue Links

Forms

Activity

People

Dates

Version >= `7.0`

Version `6.0`

Version <= `5.0`