-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
ALL
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
During a reshardCollection operation, if a recipient shard steps down while processing the ShardsvrReshardRecipientCloneCommand before persisting recipient state document and awaiting the completion of majority write, the command can hang indefinitely.
This hang occurs because the command waits on a future _transitionedToCreateCollection that is not fulfilled with an error during the recipient service's mandatory cleanup on stepdown. The command's operation context provides a cancellation token, which is intended to allow interruption during shard stepDown. However, the method used to hook this cancellation – setAlwaysInterruptAtStepDownOrUp_UNSAFE is unsafe because it does not properly synchronize with RSTL, potentially preventing the command from being interrupted as expected during stepDown if it misses the state transition. As a result the reshard operation doesn't make forward progress.
I think one way to resume the progress would be to failover the config server primary what would make it retry the command.
- is related to
-
SERVER-104258 Resharding Can Hang If Recipient Fails During ShardsvrReshardRecipientClone
-
- Backlog
-