[SERVER-55852] Shards first acquire LockManager locks before reacting to abortReshardCollection Created: 07/Apr/21  Updated: 06/Dec/22  Resolved: 24/May/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: [DO NOT USE] Backlog - Sharding NYC
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-56638 Fix flushReshardingStateChanges criti... Closed
Related
is related to SERVER-53258 [Resharding] Reject writes in opObser... Closed
is related to SERVER-54474 Introduce the _flushReshardingStateCh... Closed
Assigned Teams:
Sharding NYC
Operating System: ALL
Participants:
Story Points: 2

 Description   

The abortReshardCollection command triggers a shard to refresh using the _flushReshardingStateChange command. The _flushReshardingStateChange command first acquires a database and collection lock to check whether the critical section is held and again acquires these locks as part of onShardVersionMismatch() if the critical section wasn't held. These lock acquisitions can block if the shard has enqueued a strong lock. However, writes being stalled by the strong lock may be the motivation for the user having run abortReshardCollection in the first place. The abortReshardCollection command waiting for a strong lock request to be granted + released means an end-user would need to additionally run killOp on operations from internal (system) threads to have the server make forward progress, which undermines the utility of the abortReshardCollection command.

We should instead have an explicit {_shardsvrAbortReshardCollection: <reshardingUUID>} command that interacts with the DonorStateMachines and RecipientStateMachines directly. Note that the coordinator's decision is irreversible so 'pushing' out the decision as opposed to having the participant shards 'pulling' it via a shard version refresh is still safe in presence of delayed messages.



 Comments   
Comment by Max Hirschhorn [ 24/May/21 ]

The new _shardsvrAbortReshardCollection command no longer acquires locks before canceling the abortToken for the DonorStateMachines and RecipientStateMachines.

Comment by Max Hirschhorn [ 05/May/21 ]

Cross-posting my comment from SERVER-56638 because the two issues can be addressed with the same code changes.

I think to address both SERVER-56638 and SERVER-55852 we should introduce an explicit _shardsvrAbortReshardCollection that the resharding coordinator sends to abort the resharding operation (rather than the generic _flushReshardingStateChange). The {_shardsvrAbortReshardCollection: <reshardingUUID>} command would call new DonorStateMachine::abort() and RecipientStateMachine::abort() methods to cancel their _abortSource. In particular, the resharding coordinator actively aborting a resharding operation would not depend in any way on the ability to complete a shard version refresh.

  • The _shardsvrAbortReshardCollection command must check for both a DonorStateMachine and a RecipientStateMachine to call abort() on.
  • It isn't an error if _shardsvrAbortReshardCollection finds neither a DonorStateMachine nor a RecipientStateMachine. This is because it is possible for there to have been a delayed message where the donor and/or recipient has already exited by the time the _shardsvrAbortReshardCollection is being processed. Similarly, the abort() method must allow being called multiple times.
  • The _shardsvrAbortReshardCollection command must wait for the donor + recipient to have transitioned to kDone and for that local write to have become majority-committed. (It doesn't need to wait for the donor/recipient to have updated the config.reshardingOperations collection on the config server though.) The reasoning here is two-fold:
    1. Performing a write and waiting for it to be majority-committed after having called the abort() method(s) ensures the resharding coordinator has contacted a current primary.
    2. Step-up will continue to trigger a shard version refresh and may discover the resharding operation has been aborted. However, the race described in SERVER-56638 will continue to exist for that scenario and may lead to the shard version refresh becoming stuck. Requiring the resharding coordinator wait for the donor/recipient to have persisted its acknowledgment of the abort signal guarantees that the resharding coordinator would be separately sending (and continually retrying) the _shardsvrAbortReshardCollection command in any cases where the shard version refresh could get stuck.

https://jira.mongodb.org/browse/SERVER-56638?focusedCommentId=3756789&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-3756789

Generated at Thu Feb 08 05:37:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.