[SERVER-33446] PowerPC rollback failure Created: 22/Feb/18  Updated: 02/Apr/18  Resolved: 10/Mar/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.7.2
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Kevin Albertson Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File arbiter.log     Text File primary.log     Text File secondary.log     Text File server_status_after_tests.txt    
Participants:

 Description   

Tests for the C driver had a failure on PowerPC which look like a mongod failure. I haven't yet been able to reproduce. Looking at the logs we get a secondary unable to rollback.

The replica set is initiated with following config:

{
    _id: "repl0",
    version: 1,
    protocolVersion: 1,
    members: [{
        _id: 0,
        host: "localhost:27017",
        arbiterOnly: false,
        buildIndexes: true,
        hidden: false,
        priority: 1.0,
        tags: {
            ordinal: "one",
            dc: "ny"
        },
        slaveDelay: 0,
        votes: 1
    }, {
        _id: 1,
        host: "localhost:27018",
        arbiterOnly: false,
        buildIndexes: true,
        hidden: false,
        priority: 1.0,
        tags: {
            ordinal: "two",
            dc: "pa"
        },
        slaveDelay: 0,
        votes: 1
    }, {
        _id: 2,
        host: "localhost:27019",
        arbiterOnly: true,
        buildIndexes: true,
        hidden: false,
        priority: 0.0,
        tags: {},
        slaveDelay: 0,
        votes: 1
    }],
    settings: {
        chainingAllowed: true,
        heartbeatIntervalMillis: 2000,
        heartbeatTimeoutSecs: 10,
        electionTimeoutMillis: 10000,
        catchUpTimeoutMillis: -1,
        catchUpTakeoverDelayMillis: 30000,
        getLastErrorModes: {},
        getLastErrorDefaults: {
            w: 1,
            wtimeout: 0
        },
        replicaSetId: ObjectId('5a8d8aeb5e061defabdebc4d')
    }
}

The logs show the following roles are transitioned to:
localhost:27017 - primary
localhost:27018 - secondary
localhost:27019 - arbiter

The secondary fassert's with a failure later:

2018-02-21T15:12:43.034+0000 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: need to rollback, but unable to determine common point between local and remote oplog: NoMatchingDocument: RS100 reached beginning of remote oplog [1]

It looks like it starts rollback on this line

2018-02-21T15:08:44.330+0000 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1519225716, 2), t: 1 }. source's GTE: { ts: Timestamp(1519225716, 3), t: 1 } hashes: (9214982500984603846/-255950376535661437)

It isn't clear to me that this is a bug, but it also seems unlikely the C driver tests are generating so much data that the secondary's oplog rolls off. Can someone confirm that this is a server bug, or help explain what is going on here?



 Comments   
Comment by Kelsey Schubert [ 06/Mar/18 ]

I think this is the most likely explanation. If you wanted to continue to look into it, I'd suggest reviewing the diagnostic.data after trying to reproduce the issue and looking at the oplog stats as well as the repl lag metrics.

Let me know if you'd like help continuing to investigate or if you're comfortable closing this ticket.

Comment by Kevin Albertson [ 22/Feb/18 ]

Looking at our logs, each node has 100MB set for the oplog size. Regarding how much data is generated, to get at least a rough idea I ran tests on a clean replset and checked the server status after (attached).

...
"document" : {
    "deleted" : NumberLong(800453),
    "inserted" : NumberLong(1551189),
    "returned" : NumberLong(2367735),
    "updated" : NumberLong(281)
},
...

So it looks like the tests are applying more operations than I suspected. Perhaps this combined with the small oplog is causing the secondary to rollover?

Comment by William Schultz (Inactive) [ 22/Feb/18 ]

As a rough start, do you have any idea of how much data was generated during the span of that test? And what the oplog sizes of each node were and how much data was in their oplogs at the end of the test? The rollback failure you see may be due to some other issue, but eliminating the oplog rollover case is a good start.

Generated at Thu Feb 08 04:33:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.