Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.3.0-rc0
Affects Version/s: 8.3.0-rc0
Component/s: None
Labels:
None

Assigned Teams:

Cluster Scalability
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
ClusterScalability Jan5-Jan19
Linked BF Score:
200
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The resharding coordinator can get stuck in an infinite retry loop if an abort is requested right after it transitions to the committing state. When the coordinator reaches the committing state, it calls tellAllRecipientsToRefresh to send _flushReshardingStateChange commands to recipient shards. This function passes the abort token to the async RPC layer. If an abort command (e.g. abortUnshardCollection) arrives while this is in progress, the abort token is cancelled. The async RPC layer cancels the pending flushReshardingStateChange command when it detects the token cancellation, returning a CallbackCanceled error. Since CallbackCanceled is classified as a retriable error and the retry loop uses the stepdown token (which is still valid), the coordinator retries the operation. Each retry immediately fails with CallbackCanceled because the abort token remains cancelled, creating an infinite loop.

Note that resharding cannot abort once it is past the committing state however in this case it doesn't continue to completion. This bug was introduced in SERVER-114005

is related to

SERVER-114005 Resharding critical section timeout should cancel remaining steps on coordinator

Closed

Assignee:: Abdul Qadeer
Reporter:: Abdul Qadeer
Participants:: Abdul Qadeer, Githook User
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Dec 26 2025 06:58:54 PM UTC
Updated:: Jan 14 2026 09:04:24 PM UTC
Resolved:: Jan 07 2026 09:29:50 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates