Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.2.0, 5.0.5, 5.1.1
Affects Version/s: 5.0.2
Component/s: Sharding
Labels:
- LFR-BUG

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.1, v5.0
Sprint:
Sharding 2021-09-06, Sharding 2021-10-04, Sharding 2021-10-18, Sharding 2021-11-01, Sharding 2021-11-15
Linked BF Score:
151
Story Points:
1
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Background

It is possible that if a stepdown occurs while the resharding operation is in progress, that the opCtx doing the commit will be killed before the opCtx handling the command does. Which means, for instance, that the ShardsvrCommitReshardCollectionCommand could reach the final uassert even though the Resharding(Recipient/Donor)Service was not able to finish committing (because it was interrupted).

The config server primary won't retry on the error returned by the shard and will lead the config server primary to fassert in the ReshardingCoordinatorService.

Problem

The lock-free reads made it such that reads could happen concurrent to any stepdowns that were in progress.
This means that database reads that were reliant on _alwaysInterruptAtStepDownOrUp to verify that there wasn't a stepdown in progress such as ShardsvrCommitReshardCollectionCommand no longer work
This is because even if is _alwaysInterruptAtStepDownOrUp is set to true, the RSTL lock won't be acquired for a database read. Which means that the database read can complete before the opCtx would eventually be interrupted.
Before lock-free reads the database read would wait for the in-progress stepdown to complete and hence wouldwait for the opCtx to have been interrupted by the actively running stepdown. We want to replicate this behavior for our fix.
The current uassert being returned will not lead to the ReshardingCoordinatorService retrying the commit/abort command until completion. Instead it will lead to a fatal assertion.

Proposed Solution

Do a no-op write using doNoopWrite before performing the sanity check to assure that the state document has been deleted. This will make sure that the operation hasn't been interrupted before asserting that there are no state documents left.

is related to

SERVER-59800 Add a flag to the lock-free collection helpers to optionally take the RSTL intent lock

Closed

related to

SERVER-66353 Add documentation of concurrency rules for OperationContext::setAlwaysInterruptAtStepDownOrUp

Closed

Assignee:: Brett Nawrocki
Reporter:: Luis Osta (Inactive)
Participants:: Brett Nawrocki, Dianna Hohensee, Githook User, Luis Osta
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Sep 01 2021 04:47:12 PM UTC
Updated:: Oct 29 2023 09:48:59 PM UTC
Resolved:: Nov 10 2021 10:22:09 PM UTC
Confidence Status Last Update:: 25/Oct/21 5:16 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates