[SERVER-58081] _flushReshardingStateChange from coordinator races with donor shard acquiring critical section, stalling the resharding operation Created: 24/Jun/21 Updated: 29/Oct/23 Resolved: 28/Jul/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 5.0.3, 5.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Haley Connelly | Assignee: | Haley Connelly |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | PM-234-M3, PM-234-T-lifecycle | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||
| Sprint: | Sharding 2021-07-12, Sharding 2021-07-26, Sharding 2021-08-09 | ||||||||||||
| Participants: | |||||||||||||
| Story Points: | 2 | ||||||||||||
| Description |
|
The _flushReshardingStateChanges command will stall the coordinator if the critical section is acquired by another thread after its initial check to see if the critical section since onShardVersionMismatch() blocks until the critical section is released. Shards during a resharding operation also rely on refreshShardVersion() to be triggered after a new primary has stepped up for the DonorStateMachine and RecipientStateMachines to learn of a change to the coordinator's state. The following events can cause a resharding operation to stall indefinitely waiting for _flushReshardingStateChanges to complete:
|
| Comments |
| Comment by Vivian Ge (Inactive) [ 06/Oct/21 ] |
|
Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you! |
| Comment by Githook User [ 19/Aug/21 ] |
|
Author: {'name': 'Haley Connelly', 'email': 'haley.connelly@mongodb.com', 'username': 'haleyConnelly'}Message: (cherry picked from commit 2ca1f733d619809d1e712860fc0070f0cc8d81f5) |
| Comment by Githook User [ 28/Jul/21 ] |
|
Author: {'name': 'Haley Connelly', 'email': 'haley.connelly@mongodb.com', 'username': 'haleyConnelly'}Message: |
| Comment by Max Hirschhorn [ 07/Jul/21 ] |
|
Checking for whether the critical section is currently held is race prone. As outlined with the sequence of events in the ticket's description, it is possible for thread executing DonorStateMachine::run() to be about to acquire the critical section. I feel like there are two possible approaches but only one of them is considered viable:
My proposal would be to implement option (b) by changing the _flushReshardingStateChange command to the following:
Note that the CatalogCacheLoader::waitForCollectionFlush() line which was copied from _flushRoutingTableCacheUpdatesWithWriteConcern isn't necessary for the _flushReshardingStateChange command. The only dependency on the config.cache.chunks collection having been written locally is for the temporary resharding collection's on the donor shards and is handled by DonorStateMachine itself already. |