[SERVER-27547] Adjust the system time . then MongoDB status maintian RECOVERING Created: 30/Dec/16  Updated: 30/Jan/17  Resolved: 30/Jan/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: eveninger Assignee: Spencer Brody (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File screenshot-1.png    
Issue Links:
Duplicate
duplicates SERVER-25145 During rollback (or w/minvalid invali... Closed
Related
related to SERVER-27403 Consider term and rbid when validatin... Closed
Operating System: ALL
Steps To Reproduce:

0、Replica set with 3 nodes
1、shutdown all 3 nodes
2、adjust the system time from 2017-03-14 to 2016-12-17 (date -s)
3、start all 3 nodes
4、one PRIMARY, other two nodes maintian RECOVERING

Participants:

 Description   

grep -v "[initandlisten]|[conn" log

--------------------------------node 1-------------------------------------
2017-03-14T14:41:57.607+0800 I STORAGE [signalProcessingThread] shutdown: removing fs lock...
2017-03-14T14:41:57.608+0800 I CONTROL [signalProcessingThread] dbexit: rc: 0
2016-12-17T12:50:25.846+0800 I CONTROL [main] ***** SERVER RESTARTED *****
2016-12-17T12:50:26.921+0800 I NETWORK [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-12-17T12:50:26.965+0800 I REPL [ReplicationExecutor] New replica set config in use: { _id: "fsp", version: 1, protocolVersion: 1, members: [ { _id: 0, host: "172.28.1.140:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "172.28.1.141:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "172.28.1.142:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults:

{ w: 1, wtimeout: 0 }

, replicaSetId: ObjectId('58c552b12d4be615ca28e5b1') } }
2016-12-17T12:50:26.965+0800 I REPL [ReplicationExecutor] This node is 172.28.1.140:27017 in the config
2016-12-17T12:50:26.965+0800 I REPL [ReplicationExecutor] transition to STARTUP2
2016-12-17T12:50:26.965+0800 I REPL [ReplicationExecutor] Starting replication applier threads
2016-12-17T12:50:26.965+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.141:27017
2016-12-17T12:50:26.965+0800 I REPL [ReplicationExecutor] transition to RECOVERING
2016-12-17T12:50:26.965+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.142:27017
2016-12-17T12:50:26.966+0800 I REPL [ReplicationExecutor] transition to SECONDARY
2016-12-17T12:50:27.021+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 172.28.1.141:27017
2016-12-17T12:50:27.021+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 172.28.1.142:27017
2016-12-17T12:50:27.021+0800 I REPL [ReplicationExecutor] Member 172.28.1.141:27017 is now in state SECONDARY
2016-12-17T12:50:27.022+0800 I REPL [ReplicationExecutor] Member 172.28.1.142:27017 is now in state PRIMARY
2016-12-17T12:51:52.989+0800 I REPL [ReplicationExecutor] syncing from: 172.28.1.142:27017
2016-12-17T12:51:53.013+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Connecting to 172.28.1.142:27017
2016-12-17T12:51:53.013+0800 I REPL [SyncSourceFeedback] setting syncSourceFeedback to 172.28.1.142:27017
2016-12-17T12:51:53.045+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to 172.28.1.142:27017
2016-12-17T12:51:53.048+0800 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 11, timestamp: Mar 14 14:47:57:3c9d). source's GTE: (term: 12, timestamp: Mar 14 14:47:57:3c9d) hashes: (-1985836079535698320/4988593413792691282)
2016-12-17T12:51:53.050+0800 I REPL [rsBackgroundSync] beginning rollback
2016-12-17T12:51:53.050+0800 I REPL [rsBackgroundSync] rollback 0
2016-12-17T12:51:53.050+0800 I REPL [ReplicationExecutor] transition to ROLLBACK
2016-12-17T12:51:53.052+0800 I REPL [rsBackgroundSync] rollback 1
2016-12-17T12:51:53.079+0800 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint
2016-12-17T12:51:53.080+0800 I REPL [rsBackgroundSync] rollback our last optime: Mar 14 14:47:57:3c9d
2016-12-17T12:51:53.080+0800 I REPL [rsBackgroundSync] rollback their last optime: Mar 14 14:47:57:3d65
2016-12-17T12:51:53.080+0800 I REPL [rsBackgroundSync] rollback diff in end of log times: 0 seconds
2016-12-17T12:51:53.343+0800 I REPL [rsBackgroundSync] rollback 3 fixup
2016-12-17T12:51:53.362+0800 I REPL [rsBackgroundSync] rollback 3.5
2016-12-17T12:51:53.363+0800 I REPL [rsBackgroundSync] rollback 4 n:2
2016-12-17T12:51:53.363+0800 I REPL [rsBackgroundSync] minvalid=(term: 12, timestamp: Mar 14 14:47:57:3d70)
2016-12-17T12:51:53.364+0800 I REPL [rsBackgroundSync] rollback 4.6
2016-12-17T12:51:53.364+0800 I REPL [rsBackgroundSync] rollback 4.7
2016-12-17T12:51:53.375+0800 I REPL [rsBackgroundSync] rollback 5 d:12 u:11
2016-12-17T12:51:53.376+0800 I REPL [rsBackgroundSync] rollback 6
2016-12-17T12:51:53.376+0800 I REPL [rsBackgroundSync] rollback done
2016-12-17T12:51:53.376+0800 I REPL [rsBackgroundSync] rollback finished
2016-12-17T12:51:53.376+0800 I NETWORK [SyncSourceFeedback] Socket say send() errno:9 Bad file descriptor 172.28.1.142:27017
2016-12-17T12:51:53.376+0800 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 172.28.1.142:27017
2016-12-17T12:51:53.376+0800 I REPL [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for 172.28.1.142:27017, will retry
2016-12-17T12:51:53.387+0800 I REPL [ReplicationExecutor] transition to RECOVERING
2016-12-17T12:51:53.388+0800 I REPL [ReplicationExecutor] syncing from: 172.28.1.141:27017
2016-12-17T12:51:53.416+0800 I REPL [SyncSourceFeedback] setting syncSourceFeedback to 172.28.1.141:27017
2016-12-17T12:51:53.416+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Connecting to 172.28.1.141:27017
2016-12-17T12:51:53.446+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to 172.28.1.141:27017
2016-12-17T12:51:57.039+0800 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to 172.28.1.141:27017
2016-12-17T12:51:57.039+0800 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-12-17T12:51:57.040+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.141:27017
2016-12-17T12:51:57.076+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 172.28.1.141:27017
2016-12-17T12:51:57.077+0800 I REPL [ReplicationExecutor] Member 172.28.1.141:27017 is now in state RECOVERING
2016-12-17T15:13:45.705+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Connecting to 172.28.1.141:27017

--------------------------------node 2-------------------------------------
2017-03-14T14:41:50.897+0800 I CONTROL [signalProcessingThread] now exiting
2017-03-14T14:41:50.897+0800 I NETWORK [signalProcessingThread] shutdown: going to close listening sockets...
2017-03-14T14:41:50.897+0800 I NETWORK [signalProcessingThread] closing listening socket: 7
2017-03-14T14:41:50.897+0800 I NETWORK [signalProcessingThread] shutdown: going to flush diaglog...
2017-03-14T14:41:50.897+0800 I NETWORK [signalProcessingThread] shutdown: going to close sockets...
2017-03-14T14:41:50.897+0800 I STORAGE [signalProcessingThread] WiredTigerKVEngine shutting down
2017-03-14T14:41:50.913+0800 I STORAGE [signalProcessingThread] shutdown: removing fs lock...
2017-03-14T14:41:50.913+0800 I CONTROL [signalProcessingThread] dbexit: rc: 0
2016-12-17T12:50:12.767+0800 I CONTROL [main] ***** SERVER RESTARTED *****
2016-12-17T12:50:13.573+0800 I NETWORK [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-12-17T12:50:16.580+0800 W NETWORK [ReplicationExecutor] Failed to connect to 172.28.1.140:27017, reason: errno:113 No route to host
2016-12-17T12:50:16.599+0800 I REPL [ReplicationExecutor] New replica set config in use: { _id: "fsp", version: 1, protocolVersion: 1, members: [ { _id: 0, host: "172.28.1.140:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "172.28.1.141:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "172.28.1.142:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults:

{ w: 1, wtimeout: 0 }

, replicaSetId: ObjectId('58c552b12d4be615ca28e5b1') } }
2016-12-17T12:50:16.599+0800 I REPL [ReplicationExecutor] This node is 172.28.1.141:27017 in the config
2016-12-17T12:50:16.599+0800 I REPL [ReplicationExecutor] transition to STARTUP2
2016-12-17T12:50:16.599+0800 I REPL [ReplicationExecutor] Starting replication applier threads
2016-12-17T12:50:16.600+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017
2016-12-17T12:50:16.600+0800 I REPL [ReplicationExecutor] transition to RECOVERING
2016-12-17T12:50:16.600+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.142:27017
2016-12-17T12:50:16.601+0800 I COMMAND [ftdc] serverStatus was very slow:

{ after basic: 0, after asserts: 0, after connections: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 2580, after storageEngine: 2580, after tcmalloc: 2580, after wiredTiger: 2590, at end: 2590 }

2016-12-17T12:50:16.602+0800 I REPL [ReplicationExecutor] transition to SECONDARY
2016-12-17T12:50:16.994+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 172.28.1.142:27017
2016-12-17T12:50:16.998+0800 I REPL [ReplicationExecutor] Member 172.28.1.142:27017 is now in state SECONDARY
2016-12-17T12:50:19.586+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to 172.28.1.140:27017 - HostUnreachable: No route to host
2016-12-17T12:50:19.587+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 172.28.1.140:27017; HostUnreachable: No route to host
2016-12-17T12:50:19.587+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017
2016-12-17T12:50:22.592+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to 172.28.1.140:27017 - HostUnreachable: No route to host
2016-12-17T12:50:22.593+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 172.28.1.140:27017; HostUnreachable: No route to host
2016-12-17T12:50:22.593+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017
2016-12-17T12:50:25.599+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to 172.28.1.140:27017 - HostUnreachable: No route to host
2016-12-17T12:50:25.599+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 172.28.1.140:27017; HostUnreachable: No route to host
2016-12-17T12:50:30.599+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017
2016-12-17T12:50:30.631+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 172.28.1.140:27017
2016-12-17T12:50:30.632+0800 I REPL [ReplicationExecutor] Member 172.28.1.140:27017 is now in state SECONDARY
2016-12-17T12:50:31.614+0800 I REPL [ReplicationExecutor] syncing from: 172.28.1.140:27017
2016-12-17T12:50:31.648+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Connecting to 172.28.1.140:27017
2016-12-17T12:50:31.648+0800 I REPL [SyncSourceFeedback] setting syncSourceFeedback to 172.28.1.140:27017
2016-12-17T12:50:31.677+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to 172.28.1.140:27017
2016-12-17T12:50:32.001+0800 I REPL [ReplicationExecutor] Member 172.28.1.142:27017 is now in state PRIMARY
2016-12-17T12:50:36.684+0800 I REPL [ReplicationExecutor] could not find member to sync from
2016-12-17T12:51:52.704+0800 I REPL [ReplicationExecutor] syncing from: 172.28.1.142:27017
2016-12-17T12:51:52.729+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Connecting to 172.28.1.142:27017
2016-12-17T12:51:52.730+0800 I REPL [SyncSourceFeedback] setting syncSourceFeedback to 172.28.1.142:27017
2016-12-17T12:51:52.752+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to 172.28.1.142:27017
2016-12-17T12:51:52.754+0800 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 11, timestamp: Mar 14 14:47:57:3c9d). source's GTE: (term: 12, timestamp: Mar 14 14:47:57:3c9d) hashes: (-1985836079535698320/4988593413792691282)
2016-12-17T12:51:52.754+0800 I REPL [rsBackgroundSync] beginning rollback
2016-12-17T12:51:52.754+0800 I REPL [rsBackgroundSync] rollback 0
2016-12-17T12:51:52.755+0800 I REPL [ReplicationExecutor] transition to ROLLBACK
2016-12-17T12:51:52.757+0800 I REPL [rsBackgroundSync] rollback 1
2016-12-17T12:51:52.784+0800 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint
2016-12-17T12:51:52.785+0800 I REPL [rsBackgroundSync] rollback our last optime: Mar 14 14:47:57:3c9d
2016-12-17T12:51:52.785+0800 I REPL [rsBackgroundSync] rollback their last optime: Mar 14 14:47:57:3d5e
2016-12-17T12:51:52.785+0800 I REPL [rsBackgroundSync] rollback diff in end of log times: 0 seconds
2016-12-17T12:51:53.111+0800 I REPL [rsBackgroundSync] rollback 3 fixup
2016-12-17T12:51:53.124+0800 I REPL [rsBackgroundSync] rollback 3.5
2016-12-17T12:51:53.125+0800 I REPL [rsBackgroundSync] rollback 4 n:2
2016-12-17T12:51:53.125+0800 I REPL [rsBackgroundSync] minvalid=(term: 12, timestamp: Mar 14 14:47:57:3d68)
2016-12-17T12:51:53.125+0800 I REPL [rsBackgroundSync] rollback 4.6
2016-12-17T12:51:53.125+0800 I REPL [rsBackgroundSync] rollback 4.7
2016-12-17T12:51:53.129+0800 I REPL [rsBackgroundSync] rollback 5 d:12 u:11
2016-12-17T12:51:53.129+0800 I REPL [rsBackgroundSync] rollback 6
2016-12-17T12:51:53.129+0800 I REPL [rsBackgroundSync] rollback done
2016-12-17T12:51:53.129+0800 I NETWORK [SyncSourceFeedback] Socket say send() errno:9 Bad file descriptor 172.28.1.142:27017
2016-12-17T12:51:53.129+0800 I REPL [rsBackgroundSync] rollback finished
2016-12-17T12:51:53.129+0800 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 172.28.1.142:27017
2016-12-17T12:51:53.130+0800 I REPL [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for 172.28.1.142:27017, will retry
2016-12-17T12:51:53.139+0800 I REPL [ReplicationExecutor] transition to RECOVERING
2016-12-17T12:51:53.139+0800 I REPL [ReplicationExecutor] syncing from: 172.28.1.140:27017
2016-12-17T12:51:53.164+0800 E REPL [rsBackgroundSync] sync producer problem: 13436 cannot read from oplog collection while in rollback
2016-12-17T12:51:53.265+0800 I REPL [ReplicationExecutor] syncing from: 172.28.1.140:27017
2016-12-17T12:51:53.286+0800 E REPL [rsBackgroundSync] sync producer problem: 13436 cannot read from oplog collection while in rollback
2016-12-17T12:51:53.386+0800 I REPL [ReplicationExecutor] syncing from: 172.28.1.140:27017
2016-12-17T12:51:53.414+0800 I REPL [SyncSourceFeedback] setting syncSourceFeedback to 172.28.1.140:27017
2016-12-17T12:51:53.414+0800 I ASIO [rsBackgroundSync] dropping unhealthy pooled connection to 172.28.1.140:27017
2016-12-17T12:51:53.414+0800 I ASIO [rsBackgroundSync] after drop, pool was empty, going to spawn some connections
2016-12-17T12:51:53.414+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Connecting to 172.28.1.140:27017
2016-12-17T12:51:53.448+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to 172.28.1.140:27017
2016-12-17T12:51:56.702+0800 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to 172.28.1.140:27017
2016-12-17T12:51:56.702+0800 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-12-17T12:51:56.703+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017
2016-12-17T12:51:56.723+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 172.28.1.140:27017
2016-12-17T12:51:56.723+0800 I REPL [ReplicationExecutor] Member 172.28.1.140:27017 is now in state RECOVERING
2016-12-17T13:38:09.036+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Connecting to 172.28.1.140:27017
2016-12-17T13:38:09.059+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to 172.28.1.140:27017
2016-12-17T13:42:19.087+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Connecting to 172.28.1.140:27017
2016-12-17T13:42:19.108+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to 172.28.1.140:27017
2016-12-17T16:52:06.754+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Connecting to 172.28.1.140:27017
2016-12-17T16:52:06.781+0800 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to 172.28.1.140:27017
2016-12-17T19:38:02.677+0800 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update: network error while attempting to run command 'replSetUpdatePosition' on host '172.28.1.140:27017'
2016-12-17T19:38:02.677+0800 I REPL [SyncSourceFeedback] updateUpstream failed: HostUnreachable: network error while attempting to run command 'replSetUpdatePosition' on host '172.28.1.140:27017' , will retry
2016-12-17T19:38:02.919+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 172.28.1.140:27017; HostUnreachable: End of file
2016-12-17T19:38:02.919+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017

--------------------------------node 3-------------------------------------
2017-03-14T14:41:50.894+0800 W EXECUTOR [rsBackgroundSync] killCursors command task failed: CallbackCanceled: Callback canceled
2017-03-14T14:41:50.895+0800 I CONTROL [signalProcessingThread] now exiting
2017-03-14T14:41:50.895+0800 I NETWORK [signalProcessingThread] shutdown: going to close listening sockets...
2017-03-14T14:41:50.895+0800 I NETWORK [signalProcessingThread] closing listening socket: 7
2017-03-14T14:41:50.895+0800 I NETWORK [signalProcessingThread] shutdown: going to flush diaglog...
2017-03-14T14:41:50.895+0800 I NETWORK [signalProcessingThread] shutdown: going to close sockets...
2017-03-14T14:41:50.895+0800 I STORAGE [signalProcessingThread] WiredTigerKVEngine shutting down
2017-03-14T14:41:51.034+0800 I STORAGE [signalProcessingThread] shutdown: removing fs lock...
2017-03-14T14:41:51.034+0800 I CONTROL [signalProcessingThread] dbexit: rc: 0
2016-12-17T12:50:13.156+0800 I CONTROL [main] ***** SERVER RESTARTED *****
2016-12-17T12:50:13.945+0800 I NETWORK [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-12-17T12:50:16.951+0800 W NETWORK [ReplicationExecutor] Failed to connect to 172.28.1.140:27017, reason: errno:113 No route to host
2016-12-17T12:50:16.974+0800 I REPL [ReplicationExecutor] New replica set config in use: { _id: "fsp", version: 1, protocolVersion: 1, members: [ { _id: 0, host: "172.28.1.140:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "172.28.1.141:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "172.28.1.142:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults:

{ w: 1, wtimeout: 0 }

, replicaSetId: ObjectId('58c552b12d4be615ca28e5b1') } }
2016-12-17T12:50:16.974+0800 I REPL [ReplicationExecutor] This node is 172.28.1.142:27017 in the config
2016-12-17T12:50:16.974+0800 I REPL [ReplicationExecutor] transition to STARTUP2
2016-12-17T12:50:16.974+0800 I REPL [ReplicationExecutor] Starting replication applier threads
2016-12-17T12:50:16.974+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017
2016-12-17T12:50:16.975+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.141:27017
2016-12-17T12:50:16.975+0800 I REPL [ReplicationExecutor] transition to RECOVERING
2016-12-17T12:50:16.976+0800 I REPL [ReplicationExecutor] transition to SECONDARY
2016-12-17T12:50:16.977+0800 I COMMAND [ftdc] serverStatus was very slow:

{ after basic: 0, after asserts: 0, after connections: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 2958, after storageEngine: 2958, after tcmalloc: 2958, after wiredTiger: 2958, at end: 2958 }

2016-12-17T12:50:17.003+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 172.28.1.141:27017
2016-12-17T12:50:17.003+0800 I REPL [ReplicationExecutor] Member 172.28.1.141:27017 is now in state SECONDARY
2016-12-17T12:50:19.957+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to 172.28.1.140:27017 - HostUnreachable: No route to host
2016-12-17T12:50:19.957+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 172.28.1.140:27017; HostUnreachable: No route to host
2016-12-17T12:50:19.957+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017
2016-12-17T12:50:22.963+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to 172.28.1.140:27017 - HostUnreachable: No route to host
2016-12-17T12:50:22.963+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 172.28.1.140:27017; HostUnreachable: No route to host
2016-12-17T12:50:22.963+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017
2016-12-17T12:50:25.969+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to 172.28.1.140:27017 - HostUnreachable: No route to host
2016-12-17T12:50:25.970+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 172.28.1.140:27017; HostUnreachable: No route to host
2016-12-17T12:50:27.019+0800 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2016-12-17T12:50:27.019+0800 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected
2016-12-17T12:50:27.019+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017
2016-12-17T12:50:27.020+0800 I REPL [ReplicationExecutor] dry election run succeeded, running for election
2016-12-17T12:50:27.021+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017
2016-12-17T12:50:27.021+0800 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 12
2016-12-17T12:50:27.021+0800 I REPL [ReplicationExecutor] transition to PRIMARY
2016-12-17T12:50:27.021+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017
2016-12-17T12:50:27.070+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 172.28.1.140:27017
2016-12-17T12:50:27.070+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 172.28.1.140:27017
2016-12-17T12:50:27.071+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 172.28.1.140:27017
2016-12-17T12:50:27.075+0800 I REPL [ReplicationExecutor] Member 172.28.1.140:27017 is now in state SECONDARY
2016-12-17T12:50:27.980+0800 I REPL [rsSync] transition to primary complete; database writes are now permitted
2016-12-17T12:51:53.053+0800 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to 172.28.1.141:27017
2016-12-17T12:51:53.053+0800 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-12-17T12:51:53.053+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.141:27017
2016-12-17T12:51:53.081+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 172.28.1.141:27017
2016-12-17T12:51:53.082+0800 I REPL [ReplicationExecutor] Member 172.28.1.141:27017 is now in state ROLLBACK
2016-12-17T12:51:53.114+0800 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to 172.28.1.140:27017
2016-12-17T12:51:53.114+0800 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-12-17T12:51:53.114+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017
2016-12-17T12:51:53.138+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 172.28.1.140:27017
2016-12-17T12:51:53.142+0800 I REPL [ReplicationExecutor] Member 172.28.1.140:27017 is now in state ROLLBACK
2016-12-17T12:51:55.083+0800 I REPL [ReplicationExecutor] Member 172.28.1.141:27017 is now in state RECOVERING
2016-12-17T12:51:55.143+0800 I REPL [ReplicationExecutor] Member 172.28.1.140:27017 is now in state RECOVERING
2016-12-17T19:38:02.677+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 172.28.1.140:27017; HostUnreachable: End of file
2016-12-17T19:38:02.678+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 172.28.1.140:27017
2016-12-17T19:38:02.679+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to 172.28.1.140:27017 - HostUnreachable: End of file
2016-12-17T19:38:02.679+0800 I ASIO [NetworkInterfaceASIO-Replication-0] failed to close stream: Transport endpoint is not connected
2016-12-17T19:38:02.679+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 172.28.1.140:27017; HostUnreachable: End of file



 Comments   
Comment by Kelsey Schubert [ 30/Jan/17 ]

Hi eveninger,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If you would like us to take a look at the logs, please attach them and we will reopen the ticket.

Regards,
Thomas

Comment by Spencer Brody (Inactive) [ 05/Jan/17 ]

Hi eveninger,
There are 3 components to the OpTime (the unique id of each oplog entry which must be increasing to preserve proper oplog ordering). The first is the term, which is incremented in every election attempt. There can only be 1 primary in a given term, so oplog entries in higher terms are guaranteed to have come from primaries that were elected later. After the term comes the Timestamp, which in turn has 2 components: 'ts', and 'i'. When an oplog entry is generated, the 'ts' field is initialized to either the current time in seconds from reading the system clock, or the 'ts' value of the last timestamp assigned to the last oplog entry in that nodes oplog, whichever is higher. This should prevent setting the clock backwards from generating OpTimes that go backwards. Finally, the 'i' field starts at 0 but if the 'ts' field of the last oplog entry is the same as the 'ts' field of the entry being generated (either because multiple writes happened in the same second or because the time was set backwards at some point and system time hasn't caught up to where the clock was when it was ahead), then 'i' becomes the 'i' of the previous oplog entry + 1.

As to why the rollback happened, that is hard to say without seeing fuller logs including the time period leading up to where the nodes were shut down and the clock was adjusted. If you attach full logs covering the same time range you posted above as well as 10 minutes leading up to where the nodes were restarted, I would be happy to take a look and see if I can understand why the rollback happened in the first place.

Comment by eveninger [ 31/Dec/16 ]

think you Spencer. I have some doubts.
Who triggered the rollback? I mean if nobody adjust the clock, why node 1 and 2 go into rollback at the same time?

in my opinion,oplog timestamp like this :
node 1: 1-2-3-(adjustment)-5
primary: 1-2-3-(adjustment)-4-6
so node 1 must rollback to : 1-2-3, then sync 4-6 from the primary for a consistent state.

how the oplog manager continuity? timestamp or position(AUTO_INCREMENT sequence)?
if use timestamp, when I adjust the clock from 2017-03-14 to 2016-12-17, and insert some new entries, oplog has become discontinuous.

Comment by Spencer Brody (Inactive) [ 30/Dec/16 ]

Hi eveninger,
I've read through the logs you posted, and from the logs I don't think the issue you're hitting is related to the clock adjustment, though there is definitely a real bug in play here. From the logs you can see that both node 1 and 2 go into rollback, rolling back against the primary. The problem comes when the nodes transition from ROLLBACK into RECOVERING, where they are looking for oplog entries to apply to bring them back into a consistent state. The problem is that node 1 winds up trying to sync ops from node 2, at the same time as node 2 is trying to sync ops from node 1, thus neither is ever able to catch up. I believe this was due to a narrow race condition resulting from the combination of SERVER-25145 and SERVER-27403. I'm sorry that you encountered this bug! SERVER-25145 has been fixed in 3.2.11, and would be sufficient to prevent this from occurring, so I recommend you upgrade to avoid hitting this in the future. Additional work on SERVER-27403 is ongoing and we hope to have ready in time for 3.2.12, which will provide additional protections in the area of sync source selection during rollbacks.

Cheers,
-Spencer

Generated at Thu Feb 08 04:15:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.