Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-55852

Shards first acquire LockManager locks before reacting to abortReshardCollection

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Sharding
    • Labels:
    • Sharding NYC
    • ALL
    • 2

      The abortReshardCollection command triggers a shard to refresh using the _flushReshardingStateChange command. The _flushReshardingStateChange command first acquires a database and collection lock to check whether the critical section is held and again acquires these locks as part of onShardVersionMismatch() if the critical section wasn't held. These lock acquisitions can block if the shard has enqueued a strong lock. However, writes being stalled by the strong lock may be the motivation for the user having run abortReshardCollection in the first place. The abortReshardCollection command waiting for a strong lock request to be granted + released means an end-user would need to additionally run killOp on operations from internal (system) threads to have the server make forward progress, which undermines the utility of the abortReshardCollection command.

      We should instead have an explicit {_shardsvrAbortReshardCollection: <reshardingUUID>} command that interacts with the DonorStateMachines and RecipientStateMachines directly. Note that the coordinator's decision is irreversible so 'pushing' out the decision as opposed to having the participant shards 'pulling' it via a shard version refresh is still safe in presence of delayed messages.

            backlog-server-sharding-nyc [DO NOT USE] Backlog - Sharding NYC
            max.hirschhorn@mongodb.com Max Hirschhorn
            0 Vote for this issue
            2 Start watching this issue