[SERVER-40954] Error message for UnrecoverableRollbackError in FCV 3.6 should recommend downgrading to 3.6 Created: 02/May/19  Updated: 29/Oct/23  Resolved: 05/Sep/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.0.9
Fix Version/s: 4.0.13

Type: Bug Priority: Major - P3
Reporter: Dmitry Ryabtsev Assignee: Vesselina Ratcheva (Inactive)
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File upgrade_repl.js    
Issue Links:
Depends
Documented
is documented by DOCS-13014 Investigate changes in SERVER-40954: ... Closed
Related
related to SERVER-41931 Remove forceRollbackViaRefetch flag Backlog
related to SERVER-32590 RTT 3.6<->4.0 upgrade/downgrade. Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

I reproduced it with PSA, but we have an ongoing case where it was encountered in PSSSS configuration.

1. Set up PSA repl set with v3.6 binaries and FCV 3.6
2. Upgrade to v4.0.9, do not upgrade to FCV 4.0
3. Suspend one of the data bearing nodes with SIGSTOP
4. Insert some data into the primary
5. Suspend the primary
6. Resume the node we suspended in step 3 (SIGCONT), it should become the primary
7. Insert some data into the new primary (both insert steps are probably not needed since having noop entries is enough to diverge the timeline)
8. Shut down and restart the ex-primary. It will crash as:

Show all

2019-05-02T17:51:48.865+1000 F ROLLBACK [rsBackgroundSync] RecoverToStableTimestamp failed.  :: caused by :: UnrecoverableRollbackError: No stable timestamp available to recover to. Initial data timestamp: Timestamp(1556783333, 1), Stable timestamp: Timestamp(0, 0)
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] Rollback summary:
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      start time: 2019-05-02T17:51:48.837+1000
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      end time: 2019-05-02T17:51:48.857+1000
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      sync source: nubuntu:27017
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      rollback data file directory: /data/rs36/rollback/test.t
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      rollback id: 2
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      last optime on branch of history rolled back: { ts: Timestamp(1556783333, 1), t: 4 }
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      common point optime: { ts: Timestamp(1556783273, 1), t: 4 }
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      last wall clock time on the branch of history rolled back: 2019-05-02T17:48:53.577+1000
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      common point wall clock time: 2019-05-02T17:47:53.575+1000
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      difference in wall clock times: 60 second(s)
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      shard identity document rolled back: false
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      config server config version document rolled back: false
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      affected sessions: none
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      affected namespaces: 
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]              test.t
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      counts of interesting commands rolled back: 
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]              insert: 6
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]              create: 1
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync]      total number of entries rolled back (including no-ops): 11
2019-05-02T17:51:48.865+1000 I ROLLBACK [rsBackgroundSync] transition to SECONDARY
2019-05-02T17:51:48.865+1000 I REPL     [rsBackgroundSync] transition to SECONDARY from ROLLBACK
2019-05-02T17:51:48.866+1000 I REPL     [rsBackgroundSync] Resetting sync source to empty, which was nubuntu:27017
2019-05-02T17:51:48.866+1000 F REPL     [rsBackgroundSync] Rollback failed with unrecoverable error: UnrecoverableRollbackError: Recover to stable timestamp failed.
2019-05-02T17:51:48.866+1000 F -        [rsBackgroundSync] Fatal assertion 50666 UnrecoverableRollbackError: Recover to stable timestamp failed. at src/mongo/db/repl/bgsync.cpp 705
2019-05-02T17:51:48.866+1000 F -        [rsBackgroundSync] 
 
***aborting after fassert() failure

Sprint: Repl 2019-06-17, Repl 2019-07-29, Repl 2019-08-12, Repl 2019-08-26, Repl 2019-09-09
Participants:
Case:

 Description   

In a replica set with all nodes on v4.0 binary version and in FCV=3.6, a clean shutdown will cause a node to set its recovery timestamp to 0. If this happens for a node whose oplog has diverged (i.e. needs to enter rollback), this node won't be able to complete the rollback since it does not have a stable timestamp to roll back to which is needed for recover-to-timestamp. Furthermore, in order to take a new stable checkpoint, it would have to commit a new majority write, which it shouldn't be able to do until it completes the rollback. It also shouldn't be able to upgrade to FCV=4.0 until the node can completes the rollback and replicate new log entries from the primary. The recommended path forward is to downgrade the binary to 3.6 and restart to complete the rollback. We will update the error message with this recommendation.

Original description:

In a replica set with all nodes on v4.0 binary version and in FCV=3.6, a clean shutdown will cause a node to set its recovery timestamp to 0. If this happens for a node whose oplog has diverged (i.e. needs to enter rollback), this node won't be able to complete the rollback since it does not have a stable timestamp to roll back to which is needed for recover-to-timestamp. Furthermore, in order to take a new stable checkpoint, it would have to commit a new majority write, which it shouldn't be able to do until it completes the rollback. It also shouldn't be able to upgrade to FCV=4.0 until the node can completes the rollback and replicate new log entries from the primary. If FCV=3.6 and we encounter this situation, falling back on the rollbackViaFetch algorithm may be the appropriate solution. Another alternative may be to always use rollbackViaRefetch whenever FCV=3.6.



 Comments   
Comment by Githook User [ 05/Sep/19 ]

Author:

{'username': 'vessy-mongodb', 'email': 'vesselina.ratcheva@mongodb.com', 'name': 'Vesselina Ratcheva'}

Message: SERVER-40954 Add recommendation to downgrade binary version on unrecoverable rollbacks
Branch: v4.0
https://github.com/mongodb/mongo/commit/09cfdda802a4a41c19e3eaa8b2f6667272eab1a8

Comment by Judah Schvimer [ 01/Jul/19 ]

The recommended fix when hitting this error in fCV 3.6 is to downgrade the binary version to 3.6, complete the rollback with rollback via refetch and then retry the upgrade. We will improve the error message when there is "No stable timestamp available to recover to" in fCV 3.6 to say to downgrade the binary version.

Generated at Thu Feb 08 04:56:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.