[SERVER-71156] Fatal Assertion 40507 UnrecoverableRollbackError and Mongo Crash with 4.0.27 WT Created: 07/Nov/22  Updated: 15/Nov/22  Resolved: 15/Nov/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: venkataramans rama Assignee: Yuan Fang
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File Mongo-logs_11_8_2022.tar    
Issue Links:
Related
related to SERVER-28068 Do not go into rollback due to fallin... Closed
related to SERVER-33878 Update OplogFetcher to go into SyncSo... Closed
related to SERVER-23392 Increase Replication Rollback (Data) ... Closed
Operating System: ALL
Participants:

 Description   

Problem Description

With mongo 4.0 WT StorageEngine, frequently experiencing UnrecoverableRollbackError Crash.

 

Setup Details

 

Mongo Server Version: v4.0.27

git version: d47b151b55f286546e7c7c98888ae0577856ca20

OpenSSL version: OpenSSL 1.1.1k  FIPS 25 Mar 2021

distmod: rhel80

Mongod Options:  options: { net:

{ bindIpAll: true, ipv6: true, port: 27717 }

, operationProfiling: { slowOpThresholdMs: 500 }, processManagement: { fo7.pid" }, replication: { enableMajorityReadConcern: false, oplogSizeMB: 5120, replSet: "set01f" }, security: { keyFile: "/root/.dbkey" }, storage:

{ dbPath: "/var/data/sessions.1/f", *enginecheSizeGB: 22.0* }

} }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb-27717.log", quiet: true } }

ReplicaSet with 4 data bearing members and 1 arbiter

 

RepSet Settings

"protocolVersion" : NumberLong(1),

"writeConcernMajorityJournalDefault" : false,

"settings" : {

                "chainingAllowed" : true,

                "heartbeatIntervalMillis" : 2000,

                "heartbeatTimeoutSecs" : 1,

                "electionTimeoutMillis" : 10000,

                "catchUpTimeoutMillis" : -1,

                "catchUpTakeoverDelayMillis" : 30000,

                "getLastErrorModes" :

{                   }

,

                "getLastErrorDefaults" :

{                         "w" : 1,                         "wtimeout" : 0                 }

,

 

 

 

Test Case while the crash reported

 

We have migrated from 3.6.17 mmapV1 storage Engine to Mongo 4.0.27 WT Storage Engine recently. The 4.0.27 Upgrade is required inorder for us to migrate to 4.2 Version. This is the First time we have been running our application with WT storage Engine. We configured WT Cache size of 22GB and oplog of size 5GB(have been using this opLog for more than 7 years).

 

The test case runs a delete of records at 700 TPS per replicaSet. While this test is running, we see few replicaSet are going to RECOVERING State. When we analyzed the log, we see

 

[replication-3928] Restarting oplog query due to error: CappedPositionLost: error in fetcher batch callback :: caused by :: CollectionScan died due tast seen record id: RecordId(7161504155264090918). Last fetched optime (with hash): { ts: Timestamp(1667417622, 805), t: 104 }[5387250820607207456]. Restarts remaining: 1

2022-11-02T19:48:13.531+0000 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1667417622, 805), t: 104 }. source's GTE:

{  (5387250820607207456/5740888219355370057)   2022-11-02T19:49:31.633+0000 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: need to rollback, but unable to determine cochingDocument: reached beginning of remote oplog: \{them: nd2bwa3psm11vb:27717, theirTime: Timestamp(1667417690, 868)}

2022-11-02T19:49:31.633+0000 F -        [rsBackgroundSync] Fatal Assertion 40507 at src/mongo/db/repl/rs_rollback.cpp 1567

2022-11-02T19:49:31.634+0000 F -        [rsBackgroundSync] \n\n***aborting after fassert() failure\n\n

2022-11-02T19:49:34.853+0000 I CONTROL  [main] ***** SERVER RESTARTED *****

 

When the server restarts, its recovering from a unstable checkpoint and fail to syncup with any members stating they are too stale and finally going to maintenance mode.

 

2022-11-02T19:49:41.746+0000 I REPL     [initandlisten] Recovering from an unstable checkpoint (top of oplog: { ts: Timestamp(1667417622, 805), t: 104 }, appliedThrough:

{ ts: Timestamp(166 2022-11-02T19:49:41.746+0000 I REPL     [initandlisten] Starting recovery oplog application at the appliedThrough: \{ ts: Timestamp(1667417622, 805), t: 104 }

, through the top of the oplog:

2022-11-02T19:49:41.746+0000 I REPL     [initandlisten] No oplog entries to apply for recovery. Start point is at the top of the oplog.

 

 

Server logs and diagnostics.data are attached. Nd2bwa3psm12vb, nd2bwa3psm11va are the affected replicaSets. But attached all the 4 replicaMembers logs for reference.

 



 Comments   
Comment by Yuan Fang [ 14/Nov/22 ]

Hi veramasu@hcl.com ,

Thank you for your report. Looking at the mongod.log, it appears that the secondary node started "rollback due to OplogStartMissing", then the error below suggests the node fell behind the oplog window and the rollback failed because the common point could not be determined:

2022-11-02T19:40:10.769+0000 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: need to rollback, but unable to determine common point between local and remote oplog: NoMatchingDocument: reached beginning of remote oplog: {them: nd2bwa3psm11va:27717, theirTime: Timestamp(1667417174, 675)}

Additionally, the message below also indicates that node nd2bwa3psm11vb:27717 has fallen off the oplog:

2022-11-02T19:40:29.986+0000 I REPL [replication-0] We are too stale to use nd2bwa3psm11va:27717 as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1667417084, 143) is before their earliest timestamp: Timestamp(1667417202, 127) for 1min until: 2022-11-02T19:41:29.986+0000

A replica set member becomes “stale” when its replication process falls so far behind that the primary overwrites oplog entries the member has not yet replicated. You can find more information here. You may also want to check this comment in a thread that discusses "mitigating the risk of falling off the oplog".

With all being said, to remediate it, you will need to completely resynchronize the member by performing an initial sync. Also, you may consider increasing the size of the oplog on both members, if the incident reoccurs after upgrading to v4.2 (v4.0 is no longer supported).

For issues you may encounter, we'd like to encourage you to start by asking our community for help by posting on the MongoDB Developer Community Forums. If the discussion there leads you to suspect a bug in the MongoDB server, then we'd want to investigate it as a possible bug here in the SERVER project.

Regards,
Yuan

Generated at Thu Feb 08 06:18:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.