The _flushReshardingStateChanges command will stall the coordinator if the critical section is acquired by another thread after its initial check to see if the critical section since onShardVersionMismatch() blocks until the critical section is released.
Shards during a resharding operation also rely on refreshShardVersion() to be triggered after a new primary has stepped up for the DonorStateMachine and RecipientStateMachines to learn of a change to the coordinator's state.
The following events can cause a resharding operation to stall indefinitely waiting for _flushReshardingStateChanges to complete:
- A donor is killed before the coordinator transitions to kBlockingWrites
- The coordinator transitions to kBlockingWrites before a new primary steps up on the donor
- The coordinator tries to inform each DonorStateMachine that it is safe to acquire the critical section via the _flushReshardingStateChanges cmd
- The new primary on the donor steps up, and both recovery and _flushRoutingStateChanges cmd try to refresh the DonorStateMachine
- The _flushReshardingStateChanges thread checks to see if the critical section has been acquired, it hasn't yet, and calls onShardVersionMismatch()
- The recovery thread also triggers onShardVersionMismatch(), beats the _flushReshardingStateChanges thread, and refreshes the DonorStateMachine which then acquires the critical section
- _flushReshardingStateChanges thread reaches a second check to see if the critical section is engaged, it is (thanks to the recovery thread), and the _flushReshardingStateChanges thread is blocked until the DonorStateMachine releases the critical section
- The DonorStateMachine can't release the critical section until the coordinator transitions to kCommitting/kAborting and the coordinator cannot make it past _tellAllDonorsToRefresh until the _flushReshardingStateChanges command completes.