-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
Fully Compatible
-
ALL
-
-
ClusterScalability Jul21-Aug3
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
-
None
If FlushReshardingStateChangeCmd fails due to a write concern timeout and never refreshes the resharding state it will log that it failed, but not return a status to indicate it failed.
The resharding coordinator will then not be able to retry and resharding will hang because the resharding participants will not be able to make progress if this command is called during cloning / recipients will not be established.
See this patch build for a reproducer and logs that show this failure mode.
The easiest fix is to have this command return a status instead of being void.
- is related to
-
SERVER-58081 _flushReshardingStateChange from coordinator races with donor shard acquiring critical section, stalling the resharding operation
-
- Closed
-
- related to
-
SERVER-104317 Update WithAutomaticRetry to retry on WCEs
-
- In Code Review
-