Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-41261

Use the oplog entry after the common point to calculate rollbackTimeLimitSecs

    XMLWordPrintableJSON

Details

    • Icon: Improvement Improvement
    • Resolution: Fixed
    • Icon: Major - P3 Major - P3
    • 4.2.0-rc3, 4.0.13, 4.3.1
    • None
    • None
    • None
    • Fully Compatible
    • v4.2, v4.0
    • Repl 2019-07-01, Repl 2019-07-15

    Description

      In Atlas you can pause a cluster, effectively shutting the nodes down for a period of time.

      Let's assume we pause for more than 24 hours and that all the nodes are current having committed all the writes. When they are restarted at the same time, we are seeing two nodes run and two branches of history forming. Eventually, one goes into rollback and gets a fassert because the common point is more than 24 hours behind even though we are only rolling back 1 or 2 very recent oplog entries. The common point, in this case, is from over 24 hours ago where the oplog entry immediately after the common point is from less than 5 mins ago.

      While we believe we are fixing the two nodes running at the same time problem via SERVER-40336, it still makes sense to change this calculation if true network partitions occur after unpausing. Resolving this manually is a headache.

      Attachments

        Activity

          People

            jason.chan@mongodb.com Jason Chan
            alyson.cabral@mongodb.com Alyson Cabral (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: