Resharding coordinator can fail to commit due to stale session

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • Fully Compatible
    • ALL
    • 200
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Resharding operation can fail to commit due to reuse of old txn number. The bug was exposed after addition of OSI replay protection SPM-4126

      Following are sequence of trigger conditions for this.

      1. Resharding is in the commit phase specifically this line after sending commit notification for change stream here.
      2. An addShard operation runs concurrently. During addShard we send _shardsvrDrainOngoingDDLOperations triggering killSessionsAbortUnpreparedTransactions.
        This aborts the coordinator's writeDecisionPersistedState transaction with InterruptedDueToAddShard
      3. The _commitAndFinishReshardOperation retries, resending change stream notifications and advancing session txnNumber.
      4. On successful retry of commit, the stale updatedCoordinatorDoc is installed in memory with old session/txnNumber here instead of updating it with higher txnNumber.
      5. The second retry of _generateCommitNotificationForChangeStreams (first one already succeeded) causes it to read old txnNumber from coordinator doc's session _getSession

            Assignee:
            Abdul Qadeer
            Reporter:
            Abdul Qadeer
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: