Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-34279

Crash after upgrade before stable checkpoint can cause replication recovery to skip oplog entries

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Sprint:
      Repl 2018-04-23

      Description

      Consider the following scenario:
      1. Clean shutdown a 3.6-binary. It's appliedThrough value will be null.
      2. Bring the node up with a 4.0 binary. Replication recovery will do nothing since we are consistent at the top of the oplog; there is no appliedThrough or recoveryTimestamp.
      3. The node starts taking writes from ts=T1 to ts=T2, as a primary or secondary. These writes get written to the oplog, but only the oplog writes get journaled. The appliedThrough may move forward if it's a secondary, but those writes will also not be journaled.
      4. Now, before we take a stable checkpoint, the node crashes.
      5. Restart the 4.0 binary node. The node starts up with the same data as at step 2 (reflecting a consistent point at T1), but also with the oplog entries through T2 from step 3.
      6. There is no recoveryTimestamp and the appliedThrough will be null, so we assume we're consistent at the top of the oplog, T2, when in reality we are consistent at T1. We then do not replay T1->T2.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              daniel.gottlieb Daniel Gottlieb
              Reporter:
              judah.schvimer Judah Schvimer
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: