[SERVER-47401] Rollback cannot complete while setting FCV in 3.6 Created: 07/Apr/20  Updated: 27/Oct/23  Resolved: 08/Apr/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.6.17
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Kevin Arhelger Assignee: Tess Avitabile (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File logs.tar.b2     File test.sh    
Issue Links:
Related
Operating System: ALL
Steps To Reproduce:

Reproducer script test.sh included in attachments.

Sprint: Repl 2020-04-20
Participants:

 Description   
  1. Primary losses connectivity with replica set
  2. setFeatureCompatibilityVersion: "3.6" is run on primary
  3. setFeatureCompatibilityVersion: "3.6" is run on remainder of replica set
  4. Old primary will rollback, remove the UUID on admin.system.versions
  5. Old primary will fetch FCV = 3.6 from the remaining members.
  6. Rollback loop continues forever

2020-04-07T16:28:11.279-0500 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1586294837, 2896), t: 2 }. source's GTE: { t
s: Timestamp(1586294864, 1), t: 3 } hashes: (-5373381131897261603/8604888661770171088)
2020-04-07T16:28:11.279-0500 I REPL     [rsBackgroundSync] Replication commit point: { ts: Timestamp(0, 0), t: -1 }
2020-04-07T16:28:11.279-0500 I REPL     [rsBackgroundSync] Rollback using the 'rollbackViaRefetchNoUUID' method because UUID support is not feature compatible with featureCompatibilityVersi
on 3.4
2020-04-07T16:28:11.279-0500 I REPL     [rsBackgroundSync] transition to ROLLBACK from SECONDARY
2020-04-07T16:28:11.279-0500 I NETWORK  [rsBackgroundSync] Skip closing connection for connection # 9
2020-04-07T16:28:11.279-0500 I NETWORK  [rsBackgroundSync] Skip closing connection for connection # 8
2020-04-07T16:28:11.279-0500 I ROLLBACK [rsBackgroundSync] Starting rollback. Sync source: localhost:27019
2020-04-07T16:28:11.282-0500 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to localhost:27019, took 3ms (2 connections now open to localhost:27019)
2020-04-07T16:28:11.283-0500 I ROLLBACK [rsBackgroundSync] Finding the Common Point
2020-04-07T16:28:11.286-0500 I ROLLBACK [rsBackgroundSync] our last optime:   Timestamp(1586294837, 2896)
2020-04-07T16:28:11.286-0500 I ROLLBACK [rsBackgroundSync] their last optime: Timestamp(1586294878, 101)
2020-04-07T16:28:11.286-0500 I ROLLBACK [rsBackgroundSync] diff in end of log times: -41 seconds
2020-04-07T16:28:11.392-0500 I ROLLBACK [rsBackgroundSync] Rollback common point is { ts: Timestamp(1586294827, 2326), t: 2 }
2020-04-07T16:28:11.392-0500 I ROLLBACK [rsBackgroundSync] Starting refetching documents
2020-04-07T16:28:14.834-0500 I ROLLBACK [rsBackgroundSync] Finished refetching documents. Total size of documents refetched: 2
2020-04-07T16:28:14.834-0500 I ROLLBACK [rsBackgroundSync] Checking the RollbackID and updating the MinValid if necessary
2020-04-07T16:28:14.834-0500 I ROLLBACK [rsBackgroundSync] Setting minvalid to { ts: Timestamp(1586294894, 1), t: 3 }
2020-04-07T16:28:14.835-0500 I ROLLBACK [rsBackgroundSync] Dropping collections to roll back create operations
2020-04-07T16:28:14.835-0500 I ROLLBACK [rsBackgroundSync] Deleting and updating documents to roll back insert, update and remove operations
2020-04-07T16:28:14.837-0500 I ROLLBACK [rsBackgroundSync] Exception in rollback ns:admin.system.version { _id: "featureCompatibilityVersion" } IllegalOperation: Tried to complete upgrade, but admin.system.version did not have a UUID. ndeletes:0



 Comments   
Comment by Kevin Arhelger [ 08/Apr/20 ]

For future reference, the rollback will complete if "setFeatureCompatibilityVersion" is reverted to 3.4 on the remaining members.

After the rollback completes the node will fassert, but it can now be resynced after the rollback data is saved.

2020-04-08T12:36:02.121-0500 F REPL     [repl writer worker 12] writer worker caught exception: OplogOperationUnsupported: Must be in primary or secondary state to downgrade feature compatibility version document: { _id: "featureCompatibilityVersion", version: "3.4", targetVersion: "3.4" } on: { ts: Timestamp(1586367356, 1), t: 12, h: -2576021385126439168, v: 2, op: "u", ns: "admin.system.version", ui: UUID("dcd69483-1385-4297-b148-35c3d8c877b5"), o2: { _id: "featureCompatibilityVersion" }, wall: new Date(1586367356426), o: { _id: "featureCompatibilityVersion", version: "3.4", targetVersion: "3.4" } }
2020-04-08T12:36:02.121-0500 F -        [repl writer worker 12] Fatal assertion 16359 OplogOperationUnsupported: Must be in primary or secondary state to downgrade feature compatibility version document: { _id: "featureCompatibilityVersion", version: "3.4", targetVersion: "3.4" } at src/mongo/db/repl/sync_tail.cpp 1266

Comment by Tess Avitabile (Inactive) [ 08/Apr/20 ]

kevin.arhelger, I apologize that there is no solution for this issue in rollback-via-refetch. This unrecoverable rollback scenario is fixed by the recover-to-timestamp algorithm in 4.0.

Comment by Judah Schvimer [ 08/Apr/20 ]

Yes. I think this is works as designed, unfortunately.

Comment by Tess Avitabile (Inactive) [ 08/Apr/20 ]

judah.schvimer, is this works as designed, as described inĀ SERVER-31189? From that ticket, it looks like we intentionally fassert because there is no way to become consistent.

Generated at Thu Feb 08 05:14:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.