[SERVER-67457] Resharding operation aborted in the midst of contacting participants may stall on config server primary indefinitely Created: 22/Jun/22 Updated: 29/Oct/23 Resolved: 06/Jul/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 5.0.0, 6.0.0-rc10 |
| Fix Version/s: | 6.0.1, 5.0.10, 6.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Max Hirschhorn | Assignee: | Abdul Qadeer |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-nyc-subteam1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v6.0, v5.0
|
||||||||||||||||||||
| Sprint: | Sharding 2022-06-27, Sharding 2022-07-11 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 5 | ||||||||||||||||||||
| Story Points: | 3 | ||||||||||||||||||||
| Description |
|
After the resharding coordinator has transitioned into the "preparing-to-donate" state, it is required to establish the DonorStateMachines and RecipientStateMachines on the participant shards before proceeding with the remainder of the resharding operation. This synchronization every participant shard is aware of the resharding operation and will react accordingly to a subsequent _flushReshardingStateChange, _shardsvrCommitReshardCollection, or _shardsvrAbortReshardCollection command. The logic prior to _tellAllParticipantsReshardingStarted() is flawed because it possible for the resharding coordinator to
Shards receive the _flushRoutingTableCacheUpdatesWithWriteConcern command and observe a state of config.collections for the source collection and temporary resharding collection entries prior to the replica set transaction from step (1). In particular, the recipient shards would not observe the config.collections entry for the temporary resharding collection at all and would treat the namespace as unsharded. The shards therefore skip constructing the DonorStateMachine and RecipientStateMachine objects but responded ok:1 to the resharding coordinator as if they had. The resharding coordinator continues to wait for the participant shards to update their state within the config.reshardingOperations document to "done" and signal they've finished their cleanup for the resharding operation. However, because the participants shards never constructed the DonorStateMachine and RecipientStateMachine object, they'll also never perform that update on the config.reshardingOperations document. This leads the resharding coordinator to wait indefinitely on this future. Manual intervention on the config.reshardingOperations document would be required to unblock the resharding coordinator. The source collection will be unable to perform other sharding DDL commands in the meantime.
|
| Comments |
| Comment by Githook User [ 07/Jul/22 ] |
|
Author: {'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}Message: |
| Comment by Githook User [ 06/Jul/22 ] |
|
Author: {'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}Message: |
| Comment by Max Hirschhorn [ 22/Jun/22 ] |
|
A solution here would be to move the _waitForMajority() into the _tellAllParticipantsReshardingStarted() logic so it is part of the onCompletion() and also to use the stepdown token rather than the abort token for the wait. |