-
Type:
Bug
-
Resolution: Incomplete
-
Priority:
Major - P3
-
None
-
Affects Version/s: 3.2.3
-
Component/s: Replication
-
None
-
ALL
-
-
None
-
0
-
None
-
None
-
None
-
None
-
None
-
None
When a server came down with a hard stop, it recovered and did a rollback and started to replicate the PRIMARY opslog. As it was replicating the log, it a writer worker exception error that caused a Fatal Assertion. Even after wiping the secondary, the replset replication hit the same error continually. The only solution was to nuke the entire replicaset and arbiters from orbit and rebuild.
PRIMARY:
Metadata: { acquireCount:
}, oplog: { acquireCount:
{ W: 1 } } } protocol:op_query 155ms
2017-03-25T16:29:36.208+0000 I NETWORK [conn64] end connection 10.196.36.193:20019 (49 connections now open)
2017-03-25T16:29:36.209+0000 I NETWORK [initandlisten] connection accepted from 10.196.36.193:20029 #69 (50 connections now open)
2017-03-25T16:29:36.211+0000 I NETWORK [conn69] end connection 10.196.36.193:20029 (49 connections now open)
2017-03-25T16:29:36.407+0000 I NETWORK [conn55] end connection 10.196.36.193:20003 (48 connections now open)
2017-03-25T16:29:36.408+0000 I NETWORK [conn62] end connection 10.196.36.193:20015 (48 connections now open)
2017-03-25T16:29:36.917+0000 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to server1.site:33000
2017-03-25T16:29:36.918+0000 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2017-03-25T16:29:44.935+0000 I REPL [ReplicationExecutor] Error in heartbeat request to server1.site:33000; HostUnreachable: Connection refused
....
2017-03-25T16:29:45.659+0000 I COMMAND [conn13] command admin.$cmd command: isMaster
keyUpdates:0 writeConflicts:0 numYields:0 reslen:397 locks:{} protocol:op_query 521ms
...
SECONDARY
2017-03-25T16:29:28.219+0000 I REPL [ReplicationExecutor] This node is server1.site:33000 in the config
2017-03-25T16:29:28.219+0000 I REPL [ReplicationExecutor] transition to STARTUP2
2017-03-25T16:29:28.219+0000 I REPL [ReplicationExecutor] Starting replication applier threads
2017-03-25T16:29:28.221+0000 I REPL [ReplicationExecutor] transition to RECOVERING
2017-03-25T16:29:28.224+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to arbiter.site:33000
2017-03-25T16:29:28.224+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to server2.site:33000
2017-03-25T16:29:28.225+0000 I REPL [ReplicationExecutor] Member arbiter.site:33000 is now in state ARBITER
2017-03-25T16:29:28.225+0000 I REPL [ReplicationExecutor] Member server2.site:33000 is now in state PRIMARY
2017-03-25T16:29:28.227+0000 I REPL [ReplicationExecutor] transition to SECONDARY
...
2017-03-25T16:29:34.223+0000 I REPL [ReplicationExecutor] syncing from: server2.site:33000
2017-03-25T16:29:34.227+0000 I REPL [SyncSourceFeedback] setting syncSourceFeedback to server2.site:33000
2017-03-25T16:29:34.231+0000 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to server2.site:33000
2017-03-25T16:29:34.238+0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 4, timestamp: Mar 25 16:26:13:21). source's GTE: (term: 5, timestamp: Mar 25 16:29:25:1) hashes: (-7718240682381005025/-2391617886558021771)
2017-03-25T16:29:34.238+0000 I REPL [rsBackgroundSync] beginning rollback
2017-03-25T16:29:34.238+0000 I REPL [rsBackgroundSync] rollback 0
2017-03-25T16:29:34.238+0000 I REPL [ReplicationExecutor] transition to ROLLBACK
2017-03-25T16:29:34.239+0000 I NETWORK [conn4] end connection 10.196.36.203:20550 (5 connections now open)
2017-03-25T16:29:34.239+0000 I REPL [rsBackgroundSync] rollback 1
..
2017-03-25T16:29:34.241+0000 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint
2017-03-25T16:29:34.242+0000 I REPL [rsBackgroundSync] rollback our last optime: Mar 25 16:26:13:21
2017-03-25T16:29:34.242+0000 I REPL [rsBackgroundSync] rollback their last optime: Mar 25 16:29:34:22
2017-03-25T16:29:34.242+0000 I REPL [rsBackgroundSync] rollback diff in end of log times: -201 seconds
2017-03-25T16:29:34.747+0000 I REPL [rsBackgroundSync] rollback 3 fixup
2017-03-25T16:29:34.787+0000 I REPL [rsBackgroundSync] rollback 3.5
2017-03-25T16:29:34.891+0000 I NETWORK [initandlisten] connection accepted from 10.196.36.195:4166 #12 (1 connection now open)
2017-03-25T16:29:34.915+0000 I REPL [rsBackgroundSync] rollback 4 n:1
2017-03-25T16:29:34.915+0000 I REPL [rsBackgroundSync] minvalid=(term: 5, timestamp: Mar 25 16:29:34:33)
2017-03-25T16:29:34.916+0000 I REPL [rsBackgroundSync] rollback 4.1.1 coll resync ConfigData
2017-03-25T16:29:34.945+0000 I INDEX [rsBackgroundSync] build index on: ConfigData properties: { v: 1, key:
, name: "id", ns: "ConfigData" }
2017-03-25T16:29:34.945+0000 I INDEX [rsBackgroundSync] building index using bulk method
2017-03-25T16:29:34.946+0000 I INDEX [rsBackgroundSync] build index done. scanned 409 total records. 0 sec
....
2017-03-25T16:29:36.155+0000 I INDEX [rsBackgroundSync] build index done. scanned 769 total records. 0 secs
2017-03-25T16:29:36.155+0000 I REPL [rsBackgroundSync] rollback 4.2
2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] minvalid=(term: 5, timestamp: Mar 25 16:29:36:1)
2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] rollback 4.3
2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] rollback 4.6
2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] rollback drop: Data1
2017-03-25T16:29:36.157+0000 I REPL [rsBackgroundSync] rollback drop: Data2
2017-03-25T16:29:36.165+0000 I REPL [rsBackgroundSync] rollback drop: Data3
2017-03-25T16:29:36.179+0000 I REPL [rsBackgroundSync] rollback 4.7
2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback 5 d:4 u:71
2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback 6
2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback done
2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback finished
2017-03-25T16:29:36.197+0000 I NETWORK [SyncSourceFeedback] Socket say send() errno:9 Bad file descriptor 10.196.36.195:33000
2017-03-25T16:29:36.207+0000 I REPL [ReplicationExecutor] transition to RECOVERING
2017-03-25T16:29:36.208+0000 I REPL [ReplicationExecutor] syncing from: server2.site:33000
017-03-25T16:29:36.356+0000 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 10.196.36.195:33000
2017-03-25T16:29:36.356+0000 I REPL [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for 10.196.36.195:33000, will retry
2017-03-25T16:29:36.365+0000 F REPL [repl writer worker 2] writer worker caught exception: :: caused by :: 26 Failed to apply insert due to missing collection:
2017-03-25T16:29:36.365+0000 I - [repl writer worker 2] Fatal Assertion 16360
2017-03-25T16:29:36.365+0000 I - [repl writer worker 2]