-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Cluster Scalability
-
Cluster Scalability Priorities
-
2
-
None
-
None
-
None
-
None
-
None
-
None
-
None
The ReshardingCoordinator treats PrimaryOnlyService's interrupt() call as a no-op. This method is called to interrupt the instances when stepping or shutting down, as well as when the instance is released.
The resharding coordinator exposes a number of futures publicly (e.g. the completionFuture) which are only fulfilled if run() is called. However, there is no guarantee that run() is ever called, for example, if a stepdown were to occur here.
In theory, a waiter on the coordiantor's completion future could hang or receive a broken promise error if it were to wait on this future and run() was never called.
The two existing callers of the completion future are:
- ConfigsvrReshardCollection
- This properly synchronizes with the RSTL, so if run() was never called because of a stepdown, this command would be interrupted anyway.
- ConfigsvrAbortReshardCollection
- This does not properly synchronize with the RSTL so it's feasible that this dodges the RSTL killOp thread and continues running as a secondary. If that race occurs, as well as run() never being called, it's likely that the command would hang until that node steps up again (and it gets either killed by the RSTL killOp thread properly this time, or fails with a broken promise error after the instance is cleaned up as part of PrimaryOnlyService's step up logic).
The above uses the completion future as an example, but similar issues could exist for all other futures the coordinator exposes. It is likely that all promises should be set with an error in interrupt, similarly to what is being done in the ShardingCoordinator.