[SERVER-58081]  _flushReshardingStateChange from coordinator races with donor shard acquiring critical section, stalling the resharding operation Created: 24/Jun/21  Updated: 29/Oct/23  Resolved: 28/Jul/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.0.3, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Haley Connelly Assignee: Haley Connelly
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-lifecycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-58343 Re-enable reshard_collection_failover... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v5.0
Sprint: Sharding 2021-07-12, Sharding 2021-07-26, Sharding 2021-08-09
Participants:
Story Points: 2

 Description   

The _flushReshardingStateChanges command will stall the coordinator if the critical section is acquired by another thread after its initial check to see if the critical section since onShardVersionMismatch() blocks until the critical section is released.

Shards during a resharding operation also  rely on refreshShardVersion() to be triggered after a new primary has stepped up for the DonorStateMachine and RecipientStateMachines to learn of a change to the coordinator's state.

The following events can cause a resharding operation to stall indefinitely waiting for _flushReshardingStateChanges to complete:

  • A donor is killed before the coordinator transitions to kBlockingWrites
  • The coordinator transitions to kBlockingWrites before a new primary steps up on the donor
  • The coordinator tries to inform each DonorStateMachine that it is safe to acquire the critical section via the  _flushReshardingStateChanges cmd
  • The new primary on the donor steps up, and both recovery and _flushRoutingStateChanges cmd try to refresh the DonorStateMachine
  • The _flushReshardingStateChanges thread checks to see if the critical section has been acquired, it hasn't yet, and calls onShardVersionMismatch()
  • The recovery thread also  triggers onShardVersionMismatch(), beats the _flushReshardingStateChanges thread, and refreshes the DonorStateMachine which then acquires the critical section
  • _flushReshardingStateChanges thread reaches a second check to see if the critical section is engaged, it is (thanks to the recovery thread), and the _flushReshardingStateChanges thread is blocked until the DonorStateMachine releases the critical section
  • The DonorStateMachine can't release the critical section until the coordinator transitions to kCommitting/kAborting and the coordinator cannot make it past _tellAllDonorsToRefresh until the _flushReshardingStateChanges command completes.


 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 19/Aug/21 ]

Author:

{'name': 'Haley Connelly', 'email': 'haley.connelly@mongodb.com', 'username': 'haleyConnelly'}

Message: SERVER-58081 Make _flushReshardingStateChange return instead of blocking if the critical section is held

(cherry picked from commit 2ca1f733d619809d1e712860fc0070f0cc8d81f5)
Branch: v5.0
https://github.com/mongodb/mongo/commit/98937ba21a64a127d6238d641fb676bdef797cf4

Comment by Githook User [ 28/Jul/21 ]

Author:

{'name': 'Haley Connelly', 'email': 'haley.connelly@mongodb.com', 'username': 'haleyConnelly'}

Message: SERVER-58081 Make _flushReshardingStateChange return instead of blocking if the critical section is held
Branch: master
https://github.com/mongodb/mongo/commit/2ca1f733d619809d1e712860fc0070f0cc8d81f5

Comment by Max Hirschhorn [ 07/Jul/21 ]

Checking for whether the critical section is currently held is race prone. As outlined with the sequence of events in the ticket's description, it is possible for thread executing DonorStateMachine::run() to be about to acquire the critical section. I feel like there are two possible approaches but only one of them is considered viable:

  • (a) Make it possible for the RecoverRefreshThread to refresh the shard version while the critical section is held. This would make it safe for a shard to receive the _flushReshardingStateChange command twice even after processing its effects once before. However, having the RecoverRefreshThread wait on the critical section being released is intentional to avoid mongos exhausting its StaleConfig exception retries before a chunk migration commits. We would need to change commands to wait on the critical section being released themselves instead of indirectly waiting through the shard version refresh not having completed yet.
  • (b) Change the _flushReshardingStateChange command so it asynchronously scheduled a shard version refresh and doesn't require the resharding coordinator to wait for the shard veresion refresh to complete.

My proposal would be to implement option (b) by changing the _flushReshardingStateChange command to the following:

  1. Call onShardVersionMismatch() in a task scheduled on an arbitrary executor pool. The usage of the arbitrary executor pool is intentional to avoid exhausting the threads available in the fixed executor. Note that this thread in the arbitrary executor pool will block until the critical section is released still.
  2. Wait for the donor and/or recipient state documents to have been inserted locally. This would be done by exposing a new SharedSemiFuture<void> on DonorStateMachine and RecipientStateMachine. These functions would be immediately fulfilled when DonorStateMachine and RecipientStateMachine recovers on step-up.
  3. Insert a no-op oplog entry. This ensures in combination with waiting for majority write concern that the resharding coordinator cannot have run the _flushReshardingStateChange command on a stale primary.
  4. Wait for majority write concern. Note that this is automatic from the w:majority write concern that the ReshardingCoordinator attaches to the _flushReshardingStateChange command already.

Note that the CatalogCacheLoader::waitForCollectionFlush() line which was copied from _flushRoutingTableCacheUpdatesWithWriteConcern isn't necessary for the _flushReshardingStateChange command. The only dependency on the config.cache.chunks collection having been written locally is for the temporary resharding collection's on the donor shards and is handled by DonorStateMachine itself already.

Generated at Thu Feb 08 05:43:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.