Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-46758

setFCV can be interrupted before an FCV change is majority committed and rollback the FCV without running the setFCV server logic

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.4, v4.2, v4.0
    • Sprint:
      Repl 2020-04-06, Repl 2020-04-20, Repl 2020-05-04
    • Linked BF Score:
      30

      Description

      I believe this bug goes back all the way back to the beginning of the setFCV framework. Therefore it will need to be backport'ed.

      A setFCV cmd will change the FCV value twice: first to put FCV into upgrading / downgrading; then to put FCV into fully upgraded / fully downgraded. For each of these FCV writes, we wait for majority confirmation before proceeding.

      However, setFCV can be interrupted while waiting for majority write concern – InterruptedDueToReplStateChange for example – and roll back a step in FCV value. This manifested in test failures where the in-memory FCV value was found not to match the persisted FCV value: the persisted value had been rolled back, but the in-memory value was left unchanged by roll back. Recover to a stable timestamp wipes out writes back to the checkpoint and then plays writes forward from the oplog up to the desired point, so an FCV value change never goes through the OpObserver, even.

      I think it’s okay if rollback moves FCV from fully upgraded/downgraded to upgrading/downgrading because the user can simply rerun setFCV in the right direction and the logic is idempotent. This scenario is the same as if the server fails at any point in setFCV and setFCV is retried and we know it works.

      However, rolling back from upgrading/downgrading to fully downgraded/upgraded requires running the setFCV logic to make sure the rest of the server settings match the new FCV. And then I believe we must finish an upgrading/downgrading before we can move to downgrading/upgrading. Config servers will be their own special problem because their setFCV logic involves setting the shard servers first or last (I forget).

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              jason.chan Jason Chan
              Reporter:
              dianna.hohensee Dianna Hohensee
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: