Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-40954

Error message for UnrecoverableRollbackError in FCV 3.6 should recommend downgrading to 3.6

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 4.0.9
    • Fix Version/s: 4.0.13
    • Component/s: Replication
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      I reproduced it with PSA, but we have an ongoing case where it was encountered in PSSSS configuration.

      1. Set up PSA repl set with v3.6 binaries and FCV 3.6
      2. Upgrade to v4.0.9, do not upgrade to FCV 4.0
      3. Suspend one of the data bearing nodes with SIGSTOP
      4. Insert some data into the primary
      5. Suspend the primary
      6. Resume the node we suspended in step 3 (SIGCONT), it should become the primary
      7. Insert some data into the new primary (both insert steps are probably not needed since having noop entries is enough to diverge the timeline)
      8. Shut down and restart the ex-primary. It will crash as:

      Show all

      2019-05-02T17:51:48.865+1000 F ROLLBACK [rsBackgroundSync] RecoverToStableTimestamp failed.  :: caused by :: UnrecoverableRollbackError: No stable timestamp available to recover to. Initial data timestamp: Timestamp(1556783333, 1), Stable timestamp: Timestamp(0, 0)
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] Rollback summary:
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      start time: 2019-05-02T17:51:48.837+1000
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      end time: 2019-05-02T17:51:48.857+1000
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      sync source: nubuntu:27017
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      rollback data file directory: /data/rs36/rollback/test.t
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      rollback id: 2
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      last optime on branch of history rolled back: { ts: Timestamp(1556783333, 1), t: 4 }
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      common point optime: { ts: Timestamp(1556783273, 1), t: 4 }
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      last wall clock time on the branch of history rolled back: 2019-05-02T17:48:53.577+1000
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      common point wall clock time: 2019-05-02T17:47:53.575+1000
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      difference in wall clock times: 60 second(s)
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      shard identity document rolled back: false
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      config server config version document rolled back: false
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      affected sessions: none
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      affected namespaces: 
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]              test.t
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      counts of interesting commands rolled back: 
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]              insert: 6
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]              create: 1
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      total number of entries rolled back (including no-ops): 11
      2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] transition to SECONDARY
      2019-05-02T17:51:48.865+1000 I REPL     [rsBackgroundSync] transition to SECONDARY from ROLLBACK
      2019-05-02T17:51:48.866+1000 I REPL     [rsBackgroundSync] Resetting sync source to empty, which was nubuntu:27017
      2019-05-02T17:51:48.866+1000 F REPL     [rsBackgroundSync] Rollback failed with unrecoverable error: UnrecoverableRollbackError: Recover to stable timestamp failed.
      2019-05-02T17:51:48.866+1000 F -        [rsBackgroundSync] Fatal assertion 50666 UnrecoverableRollbackError: Recover to stable timestamp failed. at src/mongo/db/repl/bgsync.cpp 705
      2019-05-02T17:51:48.866+1000 F -        [rsBackgroundSync] 
       
      ***aborting after fassert() failure
      

      Show
      I reproduced it with PSA, but we have an ongoing case where it was encountered in PSSSS configuration. 1. Set up PSA repl set with v3.6 binaries and FCV 3.6 2. Upgrade to v4.0.9, do not upgrade to FCV 4.0 3. Suspend one of the data bearing nodes with SIGSTOP 4. Insert some data into the primary 5. Suspend the primary 6. Resume the node we suspended in step 3 (SIGCONT), it should become the primary 7. Insert some data into the new primary (both insert steps are probably not needed since having noop entries is enough to diverge the timeline) 8. Shut down and restart the ex-primary. It will crash as: Show all 2019-05-02T17:51:48.865+1000 F ROLLBACK [rsBackgroundSync] RecoverToStableTimestamp failed. :: caused by :: UnrecoverableRollbackError: No stable timestamp available to recover to. Initial data timestamp: Timestamp(1556783333, 1), Stable timestamp: Timestamp(0, 0) 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] Rollback summary: 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] start time: 2019-05-02T17:51:48.837+1000 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] end time: 2019-05-02T17:51:48.857+1000 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] sync source: nubuntu:27017 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] rollback data file directory: /data/rs36/rollback/test.t 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] rollback id: 2 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] last optime on branch of history rolled back: { ts: Timestamp(1556783333, 1), t: 4 } 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] common point optime: { ts: Timestamp(1556783273, 1), t: 4 } 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] last wall clock time on the branch of history rolled back: 2019-05-02T17:48:53.577+1000 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] common point wall clock time: 2019-05-02T17:47:53.575+1000 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] difference in wall clock times: 60 second(s) 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] shard identity document rolled back: false 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] config server config version document rolled back: false 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] affected sessions: none 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] affected namespaces: 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] test.t 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] counts of interesting commands rolled back: 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] insert: 6 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] create: 1 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] total number of entries rolled back (including no-ops): 11 2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] transition to SECONDARY 2019-05-02T17:51:48.865+1000 I REPL [rsBackgroundSync] transition to SECONDARY from ROLLBACK 2019-05-02T17:51:48.866+1000 I REPL [rsBackgroundSync] Resetting sync source to empty, which was nubuntu:27017 2019-05-02T17:51:48.866+1000 F REPL [rsBackgroundSync] Rollback failed with unrecoverable error: UnrecoverableRollbackError: Recover to stable timestamp failed. 2019-05-02T17:51:48.866+1000 F - [rsBackgroundSync] Fatal assertion 50666 UnrecoverableRollbackError: Recover to stable timestamp failed. at src/mongo/db/repl/bgsync.cpp 705 2019-05-02T17:51:48.866+1000 F - [rsBackgroundSync]   ***aborting after fassert() failure
    • Sprint:
      Repl 2019-06-17, Repl 2019-07-29, Repl 2019-08-12, Repl 2019-08-26, Repl 2019-09-09
    • Case:

      Description

      In a replica set with all nodes on v4.0 binary version and in FCV=3.6, a clean shutdown will cause a node to set its recovery timestamp to 0. If this happens for a node whose oplog has diverged (i.e. needs to enter rollback), this node won't be able to complete the rollback since it does not have a stable timestamp to roll back to which is needed for recover-to-timestamp. Furthermore, in order to take a new stable checkpoint, it would have to commit a new majority write, which it shouldn't be able to do until it completes the rollback. It also shouldn't be able to upgrade to FCV=4.0 until the node can completes the rollback and replicate new log entries from the primary. The recommended path forward is to downgrade the binary to 3.6 and restart to complete the rollback. We will update the error message with this recommendation.

      Original description:

      In a replica set with all nodes on v4.0 binary version and in FCV=3.6, a clean shutdown will cause a node to set its recovery timestamp to 0. If this happens for a node whose oplog has diverged (i.e. needs to enter rollback), this node won't be able to complete the rollback since it does not have a stable timestamp to roll back to which is needed for recover-to-timestamp. Furthermore, in order to take a new stable checkpoint, it would have to commit a new majority write, which it shouldn't be able to do until it completes the rollback. It also shouldn't be able to upgrade to FCV=4.0 until the node can completes the rollback and replicate new log entries from the primary. If FCV=3.6 and we encounter this situation, falling back on the rollbackViaFetch algorithm may be the appropriate solution. Another alternative may be to always use rollbackViaRefetch whenever FCV=3.6.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              vesselina.ratcheva Vesselina Ratcheva
              Reporter:
              dmitry.ryabtsev Dmitry Ryabtsev
              Participants:
              Votes:
              1 Vote for this issue
              Watchers:
              20 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: