Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-41261

Use the oplog entry after the common point to calculate rollbackTimeLimitSecs

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2.0-rc3, 4.0.13, 4.3.1
    • Component/s: None
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Backport Requested:
      v4.2, v4.0
    • Sprint:
      Repl 2019-07-01, Repl 2019-07-15
    • Case:

      Description

      In Atlas you can pause a cluster, effectively shutting the nodes down for a period of time.

      Let's assume we pause for more than 24 hours and that all the nodes are current having committed all the writes. When they are restarted at the same time, we are seeing two nodes run and two branches of history forming. Eventually, one goes into rollback and gets a fassert because the common point is more than 24 hours behind even though we are only rolling back 1 or 2 very recent oplog entries. The common point, in this case, is from over 24 hours ago where the oplog entry immediately after the common point is from less than 5 mins ago.

      While we believe we are fixing the two nodes running at the same time problem via SERVER-40336, it still makes sense to change this calculation if true network partitions occur after unpausing. Resolving this manually is a headache.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              jason.chan Jason Chan
              Reporter:
              alyson.cabral Alyson Cabral (Inactive)
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: