Resharding coordinator can crash if retried after removing state document

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • Fully Compatible
    • ALL
    • ClusterScalability 13Apr-27Apr
    • 200
    • 2
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      After the resharding commit is persisted, the coordinator runs a retryable block that performs post-commit work i.e. sending change stream notifications, telling participants to commit, cleaning up temporary chunk metadata, waiting for all participants to finish, and finally removing the coordinator document from config.reshardingOperations.
       
      If a transient error like InterruptedDueToReplStateChange occurs after the coordinator document has been removed but before the retryable block fully completes, the retry mechanism restarts the entire sequence from the beginning. Several of the earlier steps in the sequence acquire a new retryable writes session, which persists session state by writing to the coordinator document for OSI replay protection. Since that document no longer exists, the write matches 0 documents instead of 1, producing a non-transient error (Location10323900) that propagates to the fatal assertion handler and crashes the server.
       
       This was introduced by SERVER-120993 
       

            Assignee:
            Abdul Qadeer
            Reporter:
            Abdul Qadeer
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: