Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.0.3, 5.1.0-rc0
Affects Version/s: None
Component/s: Sharding
Labels:
- PM-234-M3
- PM-234-T-lifecycle

Backwards Compatibility:
Fully Compatible
Backport Requested:

v5.0
Sprint:
Sharding 2021-07-12, Sharding 2021-07-26, Sharding 2021-08-09
Story Points:
2
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The _flushReshardingStateChanges command will stall the coordinator if the critical section is acquired by another thread after its initial check to see if the critical section since onShardVersionMismatch() blocks until the critical section is released.

Shards during a resharding operation also rely on refreshShardVersion() to be triggered after a new primary has stepped up for the DonorStateMachine and RecipientStateMachines to learn of a change to the coordinator's state.

The following events can cause a resharding operation to stall indefinitely waiting for _flushReshardingStateChanges to complete:

A donor is killed before the coordinator transitions to kBlockingWrites
The coordinator transitions to kBlockingWrites before a new primary steps up on the donor
The coordinator tries to inform each DonorStateMachine that it is safe to acquire the critical section via the _flushReshardingStateChanges cmd
The new primary on the donor steps up, and both recovery and _flushRoutingStateChanges cmd try to refresh the DonorStateMachine
The _flushReshardingStateChanges thread checks to see if the critical section has been acquired, it hasn't yet, and calls onShardVersionMismatch()
The recovery thread also triggers onShardVersionMismatch(), beats the _flushReshardingStateChanges thread, and refreshes the DonorStateMachine which then acquires the critical section
_flushReshardingStateChanges thread reaches a second check to see if the critical section is engaged, it is (thanks to the recovery thread), and the _flushReshardingStateChanges thread is blocked until the DonorStateMachine releases the critical section
The DonorStateMachine can't release the critical section until the coordinator transitions to kCommitting/kAborting and the coordinator cannot make it past _tellAllDonorsToRefresh until the _flushReshardingStateChanges command completes.

causes

SERVER-89893 Change executor used by _flushReshardingStateChange from arbitrary to fixed

Closed

is depended on by

SERVER-58343 Re-enable reshard_collection_failover_shutdown_basic.js

Closed

related to

SERVER-107952 Fix resharding hang when FlushReshardingStateChangeCmd fails

Closed

Assignee:: Haley Connelly
Reporter:: Haley Connelly
Participants:: Githook User, Haley Connelly, Max Hirschhorn, Vivian Ge
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Jun 24 2021 10:36:51 PM UTC
Updated:: Jul 23 2025 03:29:55 PM UTC
Resolved:: Jul 28 2021 05:07:45 PM UTC
Confidence Status Last Update:: 09/Jul/21 8:05 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates