Loading...

Type: Bug
Resolution: Incomplete
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.2.3
Component/s: Replication
Labels:
None

Operating System:
ALL
Steps To Reproduce:

Hide

In a simple two member replset with a single arbiter:
As a temp collection is being written to the PRIMARY, force a shutdown of the SECONDARY.

Show
In a simple two member replset with a single arbiter: As a temp collection is being written to the PRIMARY, force a shutdown of the SECONDARY.
Confidence Status:
None
Work Order:
0

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

When a server came down with a hard stop, it recovered and did a rollback and started to replicate the PRIMARY opslog. As it was replicating the log, it a writer worker exception error that caused a Fatal Assertion. Even after wiping the secondary, the replset replication hit the same error continually. The only solution was to nuke the entire replicaset and arbiters from orbit and rebuild.

PRIMARY:
Metadata: { acquireCount:

{ w: 1 }

}, oplog: { acquireCount:

{ W: 1 }

} } protocol:op_query 155ms
2017-03-25T16:29:36.208+0000 I NETWORK [conn64] end connection 10.196.36.193:20019 (49 connections now open)
2017-03-25T16:29:36.209+0000 I NETWORK [initandlisten] connection accepted from 10.196.36.193:20029 #69 (50 connections now open)
2017-03-25T16:29:36.211+0000 I NETWORK [conn69] end connection 10.196.36.193:20029 (49 connections now open)
2017-03-25T16:29:36.407+0000 I NETWORK [conn55] end connection 10.196.36.193:20003 (48 connections now open)
2017-03-25T16:29:36.408+0000 I NETWORK [conn62] end connection 10.196.36.193:20015 (48 connections now open)
2017-03-25T16:29:36.917+0000 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to server1.site:33000
2017-03-25T16:29:36.918+0000 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2017-03-25T16:29:44.935+0000 I REPL [ReplicationExecutor] Error in heartbeat request to server1.site:33000; HostUnreachable: Connection refused
....
2017-03-25T16:29:45.659+0000 I COMMAND [conn13] command admin.$cmd command: isMaster

{ ismaster: 1 }

keyUpdates:0 writeConflicts:0 ...
SECONDARY
2017-03-25T16:29:28.219+0000 I REPL 2017-03-25T16:29:28.219+0000 I REPL 2017-03-25T16:29:28.219+0000 I REPL 2017-03-25T16:29:28.221+0000 I REPL 2017-03-25T16:29:28.224+0000 I ASIO 2017-03-25T16:29:28.224+0000 I ASIO 2017-03-25T16:29:28.225+0000 I REPL 2017-03-25T16:29:28.225+0000 I REPL 2017-03-25T16:29:28.227+0000 I REPL ...
2017-03-25T16:29:34.223+0000 I REPL 2017-03-25T16:29:34.227+0000 I REPL 2017-03-25T16:29:34.231+0000 I ASIO 2017-03-25T16:29:34.238+0000 I REPL 2017-03-25T16:29:34.238+0000 I REPL 2017-03-25T16:29:34.238+0000 I REPL 2017-03-25T16:29:34.238+0000 I REPL 2017-03-25T16:29:34.239+0000 I NETWORK 2017-03-25T16:29:34.239+0000 I REPL ..
2017-03-25T16:29:34.241+0000 I REPL 2017-03-25T16:29:34.242+0000 I REPL 2017-03-25T16:29:34.242+0000 I REPL 2017-03-25T16:29:34.242+0000 I REPL 2017-03-25T16:29:34.747+0000 I REPL 2017-03-25T16:29:34.787+0000 I REPL 2017-03-25T16:29:34.891+0000 I NETWORK 2017-03-25T16:29:34.915+0000 I REPL 2017-03-25T16:29:34.915+0000 I REPL 2017-03-25T16:29:34.916+0000 I REPL 2017-03-25T16:29:34.945+0000 I INDEX numYields:0 reslen:397 locks:{} protocol:op_query 521ms
[ReplicationExecutor] This node is server1.site:33000 in the config
[ReplicationExecutor] transition to STARTUP2
[ReplicationExecutor] Starting replication applier threads
[ReplicationExecutor] transition to RECOVERING
[NetworkInterfaceASIO-Replication-0] Successfully connected to arbiter.site:33000
[NetworkInterfaceASIO-Replication-0] Successfully connected to server2.site:33000
[ReplicationExecutor] Member arbiter.site:33000 is now in state ARBITER
[ReplicationExecutor] Member server2.site:33000 is now in state PRIMARY
[ReplicationExecutor] transition to SECONDARY
[ReplicationExecutor] syncing from: server2.site:33000
[SyncSourceFeedback] setting syncSourceFeedback to server2.site:33000
[NetworkInterfaceASIO-BGSync-0] Successfully connected to server2.site:33000
[rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 4, timestamp: Mar 25 16:26:13:21). source's GTE: (term: 5, timestamp: Mar 25 16:29:25:1) hashes: (-7718240682381005025/-2391617886558021771)
[rsBackgroundSync] beginning rollback
[rsBackgroundSync] rollback 0
[ReplicationExecutor] transition to ROLLBACK
[conn4] end connection 10.196.36.203:20550 (5 connections now open)
[rsBackgroundSync] rollback 1
[rsBackgroundSync] rollback 2 FindCommonPoint
[rsBackgroundSync] rollback our last optime: Mar 25 16:26:13:21
[rsBackgroundSync] rollback their last optime: Mar 25 16:29:34:22
[rsBackgroundSync] rollback diff in end of log times: -201 seconds
[rsBackgroundSync] rollback 3 fixup
[rsBackgroundSync] rollback 3.5
[initandlisten] connection accepted from 10.196.36.195:4166 #12 (1 connection now open)
[rsBackgroundSync] rollback 4 n:1
[rsBackgroundSync] minvalid=(term: 5, timestamp: Mar 25 16:29:34:33)
[rsBackgroundSync] rollback 4.1.1 coll resync ConfigData
[rsBackgroundSync] build index on: ConfigData properties: { v: 1, key:

{ _id: 1 }

, name: "id", ns: "ConfigData" }
2017-03-25T16:29:34.945+0000 I INDEX [rsBackgroundSync] building index using bulk method
2017-03-25T16:29:34.946+0000 I INDEX [rsBackgroundSync] build index done. scanned 409 total records. 0 sec
....
2017-03-25T16:29:36.155+0000 I INDEX [rsBackgroundSync] build index done. scanned 769 total records. 0 secs
2017-03-25T16:29:36.155+0000 I REPL [rsBackgroundSync] rollback 4.2
2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] minvalid=(term: 5, timestamp: Mar 25 16:29:36:1)
2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] rollback 4.3
2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] rollback 4.6
2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] rollback drop: Data1
2017-03-25T16:29:36.157+0000 I REPL [rsBackgroundSync] rollback drop: Data2
2017-03-25T16:29:36.165+0000 I REPL [rsBackgroundSync] rollback drop: Data3
2017-03-25T16:29:36.179+0000 I REPL [rsBackgroundSync] rollback 4.7
2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback 5 d:4 u:71
2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback 6
2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback done
2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback finished
2017-03-25T16:29:36.197+0000 I NETWORK [SyncSourceFeedback] Socket say send() errno:9 Bad file descriptor 10.196.36.195:33000
2017-03-25T16:29:36.207+0000 I REPL [ReplicationExecutor] transition to RECOVERING
2017-03-25T16:29:36.208+0000 I REPL [ReplicationExecutor] syncing from: server2.site:33000
017-03-25T16:29:36.356+0000 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 10.196.36.195:33000
2017-03-25T16:29:36.356+0000 I REPL [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for 10.196.36.195:33000, will retry
2017-03-25T16:29:36.365+0000 F REPL [repl writer worker 2] writer worker caught exception: :: caused by :: 26 Failed to apply insert due to missing collection:

{ ts: Timestamp 1490459371000|82, t: 5, h: 7141407996441188723, v: 2, op: "i", <collection data here> }

2017-03-25T16:29:36.365+0000 I - [repl writer worker 2] Fatal Assertion 16360
2017-03-25T16:29:36.365+0000 I - [repl writer worker 2]

Details

Description

Attachments

Activity

People

Dates