Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 9.0.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Cluster Scalability
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
ClusterScalability 13Apr-27Apr
Linked BF Score:
200
Story Points:
2
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

After the resharding commit is persisted, the coordinator runs a retryable block that performs post-commit work i.e. sending change stream notifications, telling participants to commit, cleaning up temporary chunk metadata, waiting for all participants to finish, and finally removing the coordinator document from config.reshardingOperations.

If a transient error like InterruptedDueToReplStateChange occurs after the coordinator document has been removed but before the retryable block fully completes, the retry mechanism restarts the entire sequence from the beginning. Several of the earlier steps in the sequence acquire a new retryable writes session, which persists session state by writing to the coordinator document for OSI replay protection. Since that document no longer exists, the write matches 0 documents instead of 1, producing a non-transient error (Location10323900) that propagates to the fatal assertion handler and crashes the server.

This was introduced by ~~SERVER-120993~~

is related to

SERVER-120993 Stamp OSI on resharding coordinator participant commands

Closed

Assignee:: Abdul Qadeer
Reporter:: Abdul Qadeer
Participants:: Abdul Qadeer, Githook User, TPM Jira Automations Bot
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Apr 21 2026 07:25:28 PM UTC
Updated:: Jun 09 2026 06:15:44 PM UTC
Resolved:: Apr 22 2026 10:23:57 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates