-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
Fully Compatible
-
ALL
-
ClusterScalability 13Apr-27Apr
-
200
-
2
-
None
-
None
-
None
-
None
-
None
-
None
-
None
After the resharding commit is persisted, the coordinator runs a retryable block that performs post-commit work i.e. sending change stream notifications, telling participants to commit, cleaning up temporary chunk metadata, waiting for all participants to finish, and finally removing the coordinator document from config.reshardingOperations.
If a transient error like InterruptedDueToReplStateChange occurs after the coordinator document has been removed but before the retryable block fully completes, the retry mechanism restarts the entire sequence from the beginning. Several of the earlier steps in the sequence acquire a new retryable writes session, which persists session state by writing to the coordinator document for OSI replay protection. Since that document no longer exists, the write matches 0 documents instead of 1, producing a non-transient error (Location10323900) that propagates to the fatal assertion handler and crashes the server.
This was introduced by SERVER-120993
- is related to
-
SERVER-120993 Stamp OSI on resharding coordinator participant commands
-
- Closed
-