Resolution: Incomplete
Major - P3
Affects Version/s: 3.7.2
Component/s: Replication
Tests for the C driver had a failure on PowerPC which look like a mongod failure. I haven't yet been able to reproduce. Looking at the logs we get a secondary unable to rollback.
The replica set is initiated with following config:
{ _id: "repl0", version: 1, protocolVersion: 1, members: [{ _id: 0, host: "localhost:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { ordinal: "one", dc: "ny" }, slaveDelay: 0, votes: 1 }, { _id: 1, host: "localhost:27018", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { ordinal: "two", dc: "pa" }, slaveDelay: 0, votes: 1 }, { _id: 2, host: "localhost:27019", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 0.0, tags: {}, slaveDelay: 0, votes: 1 }], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: -1, catchUpTakeoverDelayMillis: 30000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('5a8d8aeb5e061defabdebc4d') } }
The logs show the following roles are transitioned to:
localhost:27017 - primary
localhost:27018 - secondary
localhost:27019 - arbiter
The secondary fassert's with a failure later:
2018-02-21T15:12:43.034+0000 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: need to rollback, but unable to determine common point between local and remote oplog: NoMatchingDocument: RS100 reached beginning of remote oplog [1]
It looks like it starts rollback on this line
2018-02-21T15:08:44.330+0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1519225716, 2), t: 1 }. source's GTE: { ts: Timestamp(1519225716, 3), t: 1 } hashes: (9214982500984603846/-255950376535661437)
It isn't clear to me that this is a bug, but it also seems unlikely the C driver tests are generating so much data that the secondary's oplog rolls off. Can someone confirm that this is a server bug, or help explain what is going on here?