Details
-
Question
-
Status: Closed
-
Major - P3
-
Resolution: Incomplete
-
3.7.2
-
None
-
None
Description
Tests for the C driver had a failure on PowerPC which look like a mongod failure. I haven't yet been able to reproduce. Looking at the logs we get a secondary unable to rollback.
The replica set is initiated with following config:
{
|
_id: "repl0",
|
version: 1,
|
protocolVersion: 1,
|
members: [{
|
_id: 0,
|
host: "localhost:27017",
|
arbiterOnly: false,
|
buildIndexes: true,
|
hidden: false,
|
priority: 1.0,
|
tags: {
|
ordinal: "one",
|
dc: "ny"
|
},
|
slaveDelay: 0,
|
votes: 1
|
}, {
|
_id: 1,
|
host: "localhost:27018",
|
arbiterOnly: false,
|
buildIndexes: true,
|
hidden: false,
|
priority: 1.0,
|
tags: {
|
ordinal: "two",
|
dc: "pa"
|
},
|
slaveDelay: 0,
|
votes: 1
|
}, {
|
_id: 2,
|
host: "localhost:27019",
|
arbiterOnly: true,
|
buildIndexes: true,
|
hidden: false,
|
priority: 0.0,
|
tags: {},
|
slaveDelay: 0,
|
votes: 1
|
}],
|
settings: {
|
chainingAllowed: true,
|
heartbeatIntervalMillis: 2000,
|
heartbeatTimeoutSecs: 10,
|
electionTimeoutMillis: 10000,
|
catchUpTimeoutMillis: -1,
|
catchUpTakeoverDelayMillis: 30000,
|
getLastErrorModes: {},
|
getLastErrorDefaults: {
|
w: 1,
|
wtimeout: 0
|
},
|
replicaSetId: ObjectId('5a8d8aeb5e061defabdebc4d')
|
}
|
}
|
The logs show the following roles are transitioned to:
localhost:27017 - primary
localhost:27018 - secondary
localhost:27019 - arbiter
The secondary fassert's with a failure later:
2018-02-21T15:12:43.034+0000 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: need to rollback, but unable to determine common point between local and remote oplog: NoMatchingDocument: RS100 reached beginning of remote oplog [1]
|
It looks like it starts rollback on this line
2018-02-21T15:08:44.330+0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1519225716, 2), t: 1 }. source's GTE: { ts: Timestamp(1519225716, 3), t: 1 } hashes: (9214982500984603846/-255950376535661437)
|
It isn't clear to me that this is a bug, but it also seems unlikely the C driver tests are generating so much data that the secondary's oplog rolls off. Can someone confirm that this is a server bug, or help explain what is going on here?