Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-49019

refineShardKey can hit getLastError invariant during stepdown

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.4
    • Sprint:
      Sharding 2020-09-07
    • Linked BF Score:
      10

      Description

      This invariant

      Sequence:

      1. refineShardKey finished all updates and about to run commitTransaction.
      2. stepDown occurs, stepDown thread enqueues RSTL MODE_X
      3. commitTransaction tries to grab RSTL lock inside beginOrContinue and throws LockTimeout error.
      4. commitTransaction thread tries to call waitForWriteConcern inside the catch block and encounters NotMaster error.
      5. Server tries to create response object and hits invariant in getErrorLabels because LockTimeout is a transient error and the NotMaster writeConcern error is retryable.

      Note: this issue doesn't hit normal transaction because in a normal commitTransaction, the opCtx is fresh and the lock timeout has not been set yet (until the TxnResource is unstashed, which happens after beginOrContinue). In the refineShardKey command, it uses the same AlternativeClientRegion opCtx for all the previous writes in the transaction and the commit command so the opCtx still had the 5ms timeout set from the previous writes.

        Attachments

          Activity

            People

            Assignee:
            janna.golden Janna Golden
            Reporter:
            renctan Randolph Tan
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: