Loading...

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.2.8
Component/s: Replication
Labels:
None

Operating System:
ALL
Steps To Reproduce:

Hide

3 System replicaset. m1, m2, m3. m1 (priority 3), m2 (2), m3 (1). While operating normally, from m1: rs.stepDown(120); use admin; db.shutdownServer();

wait ~2 minutes

restart m1

Show
3 System replicaset. m1, m2, m3. m1 (priority 3), m2 (2), m3 (1). While operating normally, from m1: rs.stepDown(120); use admin; db.shutdownServer(); wait ~2 minutes restart m1
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

After rebooting a system, the database which was currently Primary lost an election to the system which is now the primary. During that process, and in the seconds that followed, the system encountered an error, got stuck in "RECOVERING" and then crashed.

DAY:15:38.702-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 21
DAY:15:38.703-0500 I REPL     [replExecDBWorker-1] transition to SECONDARY
DAY:15:38.710-0500 I NETWORK  [conn244] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [5.5.5.12:58244]
DAY:15:39.248-0500 I REPL     [ReplicationExecutor] Member mongo01.domain.us:27017 is now in state PRIMARY
DAY:15:39.694-0500 I REPL     [ReplicationExecutor] syncing from: mongo01.domain.us:27017
DAY:15:39.696-0500 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to mongo01.domain.us:27017
DAY:15:39.697-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongo01.domain.us:27017
DAY:15:39.698-0500 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 20, timestamp: Jul 12 22:15:38:1ce). source's GTE: (term: 21, timestamp: Jul 12 22:15:38:1ce) hashes: (7825714044400008130/-1471538608383373916)
DAY:15:39.698-0500 I REPL     [rsBackgroundSync] beginning rollback
DAY:15:39.698-0500 I REPL     [rsBackgroundSync] rollback 0
DAY:15:39.698-0500 I REPL     [ReplicationExecutor] transition to ROLLBACK
DAY:15:39.699-0500 I REPL     [rsBackgroundSync] rollback 1
DAY:15:39.700-0500 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
DAY:15:39.700-0500 I REPL     [rsBackgroundSync] rollback our last optime:   Jul 12 22:15:38:1ce
DAY:15:39.700-0500 I REPL     [rsBackgroundSync] rollback their last optime: Jul 12 22:15:39:167
DAY:15:39.700-0500 I REPL     [rsBackgroundSync] rollback diff in end of log times: -1 seconds
DAY:15:39.707-0500 I QUERY    [conn545] assertion 13436 cannot read from oplog collection while in rollback ns:local.oplog.rs query:{}
DAY:15:39.707-0500 I QUERY    [conn545]  ntoskip:0 ntoreturn:1
DAY:15:39.809-0500 I QUERY    [conn547] assertion 13436 cannot read from oplog collection while in rollback ns:local.oplog.rs query:{}
DAY:15:39.809-0500 I QUERY    [conn547]  ntoskip:0 ntoreturn:1
DAY:15:39.907-0500 I REPL     [rsBackgroundSync] rollback 3 fixup
DAY:15:39.908-0500 I REPL     [rsBackgroundSync] rollback 3.5
DAY:15:39.908-0500 I REPL     [rsBackgroundSync] rollback 4 n:2
DAY:15:39.908-0500 I REPL     [rsBackgroundSync] minvalid=(term: 21, timestamp: Jul 12 22:15:39:1df)
DAY:15:39.908-0500 I REPL     [rsBackgroundSync] rollback 4.6
DAY:15:39.908-0500 I REPL     [rsBackgroundSync] rollback 4.7
DAY:15:39.909-0500 I REPL     [rsBackgroundSync] rollback 5 d:1 u:2
DAY:15:39.909-0500 I REPL     [rsBackgroundSync] rollback 6
DAY:15:39.910-0500 I REPL     [rsBackgroundSync] rollback done
DAY:15:39.910-0500 I REPL     [rsBackgroundSync] rollback finished
DAY:15:39.910-0500 I NETWORK  [SyncSourceFeedback] Socket say send() errno:9 Bad file descriptor 5.5.6.11:27017
DAY:15:39.910-0500 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 5.5.6.11:27017
DAY:15:39.910-0500 I REPL     [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for 5.5.6.11:27017, will retry
DAY:15:39.910-0500 I QUERY    [conn548] assertion 13436 cannot read from oplog collection while in rollback ns:local.oplog.rs query:{}
DAY:15:39.910-0500 I QUERY    [conn548]  ntoskip:0 ntoreturn:1
DAY:15:39.920-0500 I REPL     [ReplicationExecutor] transition to RECOVERING
DAY:15:39.920-0500 I REPL     [ReplicationExecutor] syncing from: mongo03.domain.us:27017
DAY:15:39.921-0500 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to mongo03.domain.us:27017
DAY:15:39.922-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongo03.domain.us:27017
DAY:16:09.931-0500 I REPL     [ReplicationExecutor] re-evaluating sync source because our current sync source's most recent OpTime is (term: 20, timestamp: Jul 12 22:15:38:1cc) which is more than 30s behind member mongo01.domain.us:27017 whose most recent OpTime is (term: 21, timestamp: Jul 12 22:16:09:a8)
DAY:16:09.932-0500 I REPL     [ReplicationExecutor] syncing from: mongo01.domain.us:27017
DAY:16:09.933-0500 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to mongo01.domain.us:27017
DAY:16:09.934-0500 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 20, timestamp: Jul 12 22:15:38:1cc). source's GTE: (term: 21, timestamp: Jul 12 22:15:38:1cd) hashes: (1087207337565601823/-6762428419615138830)
DAY:16:09.935-0500 I -        [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 21, timestamp: Jul 12 22:15:39:1df) > our last optime: (term: 20, timestamp: Jul 12 22:15:38:1cc)
DAY:16:09.935-0500 I -        [rsBackgroundSync]

***aborting after fassert() failure

Followed by this on restart:

DAY::21:36.465-0500 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/mongodb/data/wt/diagnostic.data'
DAY::21:36.465-0500 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
DAY::21:36.467-0500 I NETWORK  [initandlisten] waiting for connections on port 27017
DAY::21:36.470-0500 I REPL     [ReplicationExecutor] New replica set config in use: { _id: "OnspringCavern01", version: 8, protocolVersion: 1, members: [ { _id: 0, host: "mongo01.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "mongo02.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "mongo03.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 1000, heartbeatTimeoutSecs: 5, electionTimeoutMillis: 6000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('57857435bb2a73b423c7eb4c') } }
DAY::21:36.470-0500 I REPL     [ReplicationExecutor] This node is mongo02.domain.us:27017 in the config
DAY::21:36.470-0500 I REPL     [ReplicationExecutor] transition to STARTUP2
DAY::21:36.470-0500 I REPL     [ReplicationExecutor] Starting replication applier threads
DAY::21:36.470-0500 I REPL     [ReplicationExecutor] transition to RECOVERING
DAY::21:36.471-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo01.domain.us:27017
DAY::21:36.472-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo03.domain.us:27017
DAY::21:36.476-0500 I REPL     [ReplicationExecutor] New replica set config in use: { _id: "OnspringCavern01", version: 9, protocolVersion: 1, members: [ { _id: 0, host: "mongo01.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "mongo02.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "mongo03.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: false, heartbeatIntervalMillis: 1000, heartbeatTimeoutSecs: 5, electionTimeoutMillis: 6000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('57857435bb2a73b423c7eb4c') } }
DAY::21:36.476-0500 I REPL     [ReplicationExecutor] This node is mongo02.domain.us:27017 in the config
DAY::21:36.477-0500 I REPL     [ReplicationExecutor] Member mongo03.domain.us:27017 is now in state SECONDARY
DAY::21:36.477-0500 I REPL     [ReplicationExecutor] Member mongo01.domain.us:27017 is now in state PRIMARY
DAY::21:37.002-0500 I FTDC     [ftdc] Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost. OK
DAY::21:37.471-0500 I REPL     [ReplicationExecutor] syncing from primary: mongo01.domain.us:27017
DAY::21:37.475-0500 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to mongo01.domain.us:27017
DAY::21:37.476-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongo01.domain.us:27017
DAY::21:37.477-0500 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 20, timestamp: Jul 12 22:15:38:1cc). source's GTE: (term: 21, timestamp: Jul 12 22:15:38:1cd) hashes: (1087207337565601823/-6762428419615138830)
DAY::21:37.477-0500 I -        [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 21, timestamp: Jul 12 22:15:39:1df) > our last optime: (term: 20, timestamp: Jul 12 22:15:38:1cc)
DAY::21:37.477-0500 I -        [rsBackgroundSync]

***aborting after fassert() failure

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

apiBasketsCheckoutEU-cgn-mongodb01.log.gz
16 kB
Sep 22 2016 02:10:45 PM UTC
apiBasketsCheckoutEU-cgn-mongodb02.log.gz
1.42 MB
Sep 22 2016 02:10:45 PM UTC
apiBasketsCheckoutEU-dfw-mongodb01.log.gz
1.34 MB
Sep 22 2016 02:10:45 PM UTC
apiBasketsCheckoutEU-dfw-mongodb02.log.gz
1.34 MB
Sep 22 2016 02:10:45 PM UTC
apiBasketsCheckoutEU-lej-mongodb-arb01.log.gz
648 kB
Sep 22 2016 02:10:45 PM UTC
cgn.mongodb01.abcEU.crash-2016-09-15.log.gz
613 kB
Sep 19 2016 02:52:49 PM UTC
cgn.mongodb02.abcEU.crash-2016-09-15.log.gz
613 kB
Sep 19 2016 02:52:49 PM UTC
dfw.mongodb01.abcEU.crash-2016-09-15.log.gz
13 kB
Sep 19 2016 02:52:49 PM UTC
dfw.mongodb02.abcEU.crash-2016-09-15.log.gz
582 kB
Sep 19 2016 02:52:49 PM UTC
flexiApiCGNArbiter-lej-mongodb-arb01.log.gz
679 kB
Sep 22 2016 02:11:36 PM UTC
flexiApiCGNShardSrv-cgn-mongodb01.log.gz
2.44 MB
Sep 22 2016 02:11:36 PM UTC
flexiApiCGNShardSrv-cgn-mongodb02.log.gz
50 kB
Sep 22 2016 02:11:36 PM UTC
flexiApiCGNShardSrv-dfw-mongodb01.log.gz
1.50 MB
Sep 22 2016 02:11:36 PM UTC
flexiApiCGNShardSrv-dfw-mongodb02.log.gz
1.50 MB
Sep 22 2016 02:11:36 PM UTC
flexi-cgn.mongodb01.flexiEU.crash-2016-09-15.log.gz
1019 kB
Sep 19 2016 02:50:52 PM UTC
flexi-cgn.mongodb02.flexiEU.crash-2016-09-15.log.gz
1023 kB
Sep 19 2016 02:50:52 PM UTC
flexi-dfw.mongodb01.flexiEU.crash-2016-09-15.log.gz
919 kB
Sep 19 2016 02:50:53 PM UTC
flexi-dfw.mongodb02.flexiEU.crash-2016-09-15.log.gz
72 kB
Sep 19 2016 02:50:53 PM UTC
flexi-lej.mongodb-arb01flexiEU.crash-2016-09-15.log.gz
391 kB
Sep 19 2016 02:50:53 PM UTC
lej.mongodb-arb01.abcEU.crash-2016-09-15.log.gz
290 kB
Sep 19 2016 02:52:49 PM UTC
mongodb.log.2016-10-12.to.2016-10-14.arbiter01.gz
5 kB
Nov 09 2016 11:41:14 PM UTC
mongodb.log.2016-10-12.to.2016-10-14.mongodb01.cgn.gz
101 kB
Nov 09 2016 11:41:15 PM UTC
mongodb.log.2016-10-12.to.2016-10-14.mongodb01.dfw.gz
152 kB
Nov 09 2016 11:41:16 PM UTC
mongodb.log.2016-10-12.to.2016-10-14.mongodb02.cgn.gz
101 kB
Nov 09 2016 11:41:16 PM UTC
mongodb.log.2016-10-12.to.2016-10-14.mongodb02.dfw.gz
144 kB
Nov 09 2016 11:41:16 PM UTC

duplicates

SERVER-25145 During rollback (or w/minvalid invalid) select sync source based on end OpTime

Closed

SERVER-7200 use oplog as op buffer on secondaries

Closed

related to

SERVER-25848 Enforce single sync source during rollback

Closed

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates