[SERVER-26271] UnrecoverableRollbackError after election Created: 23/Sep/16  Updated: 30/Sep/16  Resolved: 27/Sep/16

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.2.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Maziyar Panahi Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive mongo-replica1.zip     Zip Archive mongo-replica2.zip    
Issue Links:
Duplicate
duplicates SERVER-7200 use oplog as op buffer on secondaries Closed
Related
related to SERVER-25848 Enforce single sync source during rol... Closed
Operating System: ALL
Participants:

 Description   

Hello,

Last night the primary node stepped down for reason I don't know and a secondary became a primary. The former primary immediately displayed this error:

2016-09-23T04:10:09.012+0200 I NETWORK  [conn4831] end connection 10.0.0.10:40780 (3 connections now open)
2016-09-23T04:10:09.349+0200 I REPL     [ReplicationExecutor] syncing from: mongodb-replica2:27017
2016-09-23T04:10:09.353+0200 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to mongodb-replica2:27017
2016-09-23T04:10:09.355+0200 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 238, timestamp: Sep 23 04:10:05:b7). source's GTE: (term: 239, timestamp: Sep 23 04:10:06:1) hashes: (298476498444083566/5675657933368738701)
2016-09-23T04:10:09.356+0200 I -        [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 239, timestamp: Sep 23 04:10:06:1) > our last optime: (term: 238, timestamp: Sep 23 04:10:05:b7)
2016-09-23T04:10:09.356+0200 I -        [rsBackgroundSync]
 
***aborting after fassert() failure

I tried this morning to start mongod on former primary but the same error appeared after few seconds.

I had this problem before but it was due to ungraceful shutdown and bad RAID array. But in this replica set, each node is on a RAID 10 SSD with lots of memory and cpu. Journaling is enabled on both mongod and file system. And again there was no ungraceful shutdown it just happened during the election even though the primary had the highest priority.

Also, I just finished a 2-day mongod --repair on the primary before this incident and I did rsync the disk image to other secondaries. So they were all identical with no corruption.

I attached logs, diagnostics and WiredTiger file for the former primary and a new primary. The incident happened at "2016-09-23T04:10".

Please let me know if you need anything else.

rs0:PRIMARY> rs.conf()
{
        "_id" : "rs0",
        "version" : 468974,
        "protocolVersion" : NumberLong(1),
        "members" : [
                {
                        "_id" : 4,
                        "host" : "mongodb-replica1:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 3,
                        "tags" : {
 
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 5,
                        "host" : "mongodb-replica2:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {
 
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 6,
                        "host" : "mongodb-replica3:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {
 
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                }
        ],
        "settings" : {
                "chainingAllowed" : true,
                "heartbeatIntervalMillis" : 2000,
                "heartbeatTimeoutSecs" : 10,
                "electionTimeoutMillis" : 10000,
                "getLastErrorModes" : {
 
                },
                "getLastErrorDefaults" : {
                        "w" : 1,
                        "wtimeout" : 0
                }
        }
}

Log from new primary:

2016-09-23T04:10:05.189+0200 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2016-09-23T04:10:05.189+0200 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2016-09-23T04:10:05.189+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replica1:27017
2016-09-23T04:10:05.189+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replica3:27017
2016-09-23T04:10:05.601+0200 I REPL     [ReplicationExecutor] could not find member to sync from
2016-09-23T04:10:05.601+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replica1:27017
2016-09-23T04:10:05.601+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replica3:27017



 Comments   
Comment by Maziyar Panahi [ 30/Sep/16 ]

Hi Thomas,

Thanks for your reply, I was guessing the same thing. I did disable the chain sync for the time being and also resync the affected node from a good source by rsync.

Now all the 3 nodes are working properly with no issue after one week.

I will keep track of SERVER-25848 and wait for version 3.4

Many thanks Thomas and have a great day,

Cheers,
Maziyar

Comment by Kelsey Schubert [ 27/Sep/16 ]

Hi maziyar,

Thank you for very detailed report and including the relevant logs. I have examined the logs and identified that this issue will be resolved in MongoDB 3.4 by SERVER-7200. Additionally, a backport is planned of SERVER-25848 to MongoDB 3.2, which includes fixes to resolve this issue. Please feel free to vote for SERVER-25848 and watch it for updates.

In the interim, I would recommend disabling chained replication as we have observed that this mitigates the issue. To resolve this fatal assertion, please execute an clean resync on the affected secondary.

Kind regards,
Thomas

Generated at Thu Feb 08 04:11:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.