Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-71156

Fatal Assertion 40507 UnrecoverableRollbackError and Mongo Crash with 4.0.27 WT

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • ALL

      Problem Description

      With mongo 4.0 WT StorageEngine, frequently experiencing UnrecoverableRollbackError Crash.

       

      Setup Details

       

      Mongo Server Version: v4.0.27

      git version: d47b151b55f286546e7c7c98888ae0577856ca20

      OpenSSL version: OpenSSL 1.1.1k  FIPS 25 Mar 2021

      distmod: rhel80

      Mongod Options:  options: { net:

      { bindIpAll: true, ipv6: true, port: 27717 }

      , operationProfiling: { slowOpThresholdMs: 500 }, processManagement: { fo7.pid" }, replication: { enableMajorityReadConcern: false, oplogSizeMB: 5120, replSet: "set01f" }, security: { keyFile: "/root/.dbkey" }, storage:

      { dbPath: "/var/data/sessions.1/f", *enginecheSizeGB: 22.0* }

      } }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb-27717.log", quiet: true } }

      ReplicaSet with 4 data bearing members and 1 arbiter

       

      RepSet Settings

      "protocolVersion" : NumberLong(1),

      "writeConcernMajorityJournalDefault" : false,

      "settings" : {

                      "chainingAllowed" : true,

                      "heartbeatIntervalMillis" : 2000,

                      "heartbeatTimeoutSecs" : 1,

                      "electionTimeoutMillis" : 10000,

                      "catchUpTimeoutMillis" : -1,

                      "catchUpTakeoverDelayMillis" : 30000,

                      "getLastErrorModes" :

      {                   }

      ,

                      "getLastErrorDefaults" :

      {                         "w" : 1,                         "wtimeout" : 0                 }

      ,

       

       

       

      Test Case while the crash reported

       

      We have migrated from 3.6.17 mmapV1 storage Engine to Mongo 4.0.27 WT Storage Engine recently. The 4.0.27 Upgrade is required inorder for us to migrate to 4.2 Version. This is the First time we have been running our application with WT storage Engine. We configured WT Cache size of 22GB and oplog of size 5GB(have been using this opLog for more than 7 years).

       

      The test case runs a delete of records at 700 TPS per replicaSet. While this test is running, we see few replicaSet are going to RECOVERING State. When we analyzed the log, we see

       

      [replication-3928] Restarting oplog query due to error: CappedPositionLost: error in fetcher batch callback :: caused by :: CollectionScan died due tast seen record id: RecordId(7161504155264090918). Last fetched optime (with hash): { ts: Timestamp(1667417622, 805), t: 104 }[5387250820607207456]. Restarts remaining: 1

      2022-11-02T19:48:13.531+0000 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1667417622, 805), t: 104 }. source's GTE:

      {  (5387250820607207456/5740888219355370057)   2022-11-02T19:49:31.633+0000 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: need to rollback, but unable to determine cochingDocument: reached beginning of remote oplog: \{them: nd2bwa3psm11vb:27717, theirTime: Timestamp(1667417690, 868)}

      2022-11-02T19:49:31.633+0000 F -        [rsBackgroundSync] Fatal Assertion 40507 at src/mongo/db/repl/rs_rollback.cpp 1567

      2022-11-02T19:49:31.634+0000 F -        [rsBackgroundSync] \n\n***aborting after fassert() failure\n\n

      2022-11-02T19:49:34.853+0000 I CONTROL  [main] ***** SERVER RESTARTED *****

       

      When the server restarts, its recovering from a unstable checkpoint and fail to syncup with any members stating they are too stale and finally going to maintenance mode.

       

      2022-11-02T19:49:41.746+0000 I REPL     [initandlisten] Recovering from an unstable checkpoint (top of oplog: { ts: Timestamp(1667417622, 805), t: 104 }, appliedThrough:

      { ts: Timestamp(166 2022-11-02T19:49:41.746+0000 I REPL     [initandlisten] Starting recovery oplog application at the appliedThrough: \{ ts: Timestamp(1667417622, 805), t: 104 }

      , through the top of the oplog:

      2022-11-02T19:49:41.746+0000 I REPL     [initandlisten] No oplog entries to apply for recovery. Start point is at the top of the oplog.

       

       

      Server logs and diagnostics.data are attached. Nd2bwa3psm12vb, nd2bwa3psm11va are the affected replicaSets. But attached all the 4 replicaMembers logs for reference.

       

            Assignee:
            yuan.fang@mongodb.com Yuan Fang
            Reporter:
            veramasu@hcl.com venkataramans rama
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: