-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Some of the older resharding commands rely on setAlwaysInterruptAtStepDownOrUp_UNSAFE() without first taking the RSTL. That leaves a race where stepdown can begin before the op is marked killable, so the command may keep running after the node is no longer primary.
The newer safer pattern avoids that by taking the RSTL, making the command behavior easier to reason about. Update the following resharding commands to take the RSTL lock:
- ShardsvrReshardRecipientInitializeCommand
- ShardsvrReshardDonorInitializeCommand
- ShardsvrReshardRecipientCloneCommand
Audit any remaining resharding commands for the same issue to ensure consistent behavior.
This also lets us treat missing state as a real invalid-timing error instead of something that might just be a stepdown race. Consider throwing an error (likely IllegalOperation), since we'd never expect to see notPrimary errors in the RSTL lock path.
More details in this thread.