Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-58081

_flushReshardingStateChange from coordinator races with donor shard acquiring critical section, stalling the resharding operation

    • Fully Compatible
    • v5.0
    • Sharding 2021-07-12, Sharding 2021-07-26, Sharding 2021-08-09
    • 2

      The _flushReshardingStateChanges command will stall the coordinator if the critical section is acquired by another thread after its initial check to see if the critical section since onShardVersionMismatch() blocks until the critical section is released.

      Shards during a resharding operation also  rely on refreshShardVersion() to be triggered after a new primary has stepped up for the DonorStateMachine and RecipientStateMachines to learn of a change to the coordinator's state.

      The following events can cause a resharding operation to stall indefinitely waiting for _flushReshardingStateChanges to complete:

      • A donor is killed before the coordinator transitions to kBlockingWrites
      • The coordinator transitions to kBlockingWrites before a new primary steps up on the donor
      • The coordinator tries to inform each DonorStateMachine that it is safe to acquire the critical section via the  _flushReshardingStateChanges cmd
      • The new primary on the donor steps up, and both recovery and _flushRoutingStateChanges cmd try to refresh the DonorStateMachine
      • The _flushReshardingStateChanges thread checks to see if the critical section has been acquired, it hasn't yet, and calls onShardVersionMismatch()
      • The recovery thread also  triggers onShardVersionMismatch(), beats the _flushReshardingStateChanges thread, and refreshes the DonorStateMachine which then acquires the critical section
      • _flushReshardingStateChanges thread reaches a second check to see if the critical section is engaged, it is (thanks to the recovery thread), and the _flushReshardingStateChanges thread is blocked until the DonorStateMachine releases the critical section
      • The DonorStateMachine can't release the critical section until the coordinator transitions to kCommitting/kAborting and the coordinator cannot make it past _tellAllDonorsToRefresh until the _flushReshardingStateChanges command completes.

            Assignee:
            haley.connelly@mongodb.com Haley Connelly
            Reporter:
            haley.connelly@mongodb.com Haley Connelly
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: