-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: 8.3.0-rc0
-
Component/s: None
-
None
-
Cluster Scalability
-
Fully Compatible
-
ALL
-
ClusterScalability Jan5-Jan19
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
The resharding coordinator can get stuck in an infinite retry loop if an abort is requested right after it transitions to the committing state. When the coordinator reaches the committing state, it calls tellAllRecipientsToRefresh to send _flushReshardingStateChange commands to recipient shards. This function passes the abort token to the async RPC layer. If an abort command (e.g. abortUnshardCollection) arrives while this is in progress, the abort token is cancelled. The async RPC layer cancels the pending flushReshardingStateChange command when it detects the token cancellation, returning a CallbackCanceled error. Since CallbackCanceled is classified as a retriable error and the retry loop uses the stepdown token (which is still valid), the coordinator retries the operation. Each retry immediately fails with CallbackCanceled because the abort token remains cancelled, creating an infinite loop.
Note that resharding cannot abort once it is past the committing state however in this case it doesn't continue to completion. This bug was introduced in SERVER-114005
- is related to
-
SERVER-114005 Resharding critical section timeout should cancel remaining steps on coordinator
-
- Closed
-