-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.0.6
-
Component/s: Networking, Replication
-
None
-
Replication
We have a 3 machines mongodb setup : one primary and 2 secondaries.
One of the secondaries crashed without explanation.
Setup:
primary : ip 1.2.3.4
secondary-1 : the machine that crashed
secondary-2 : ip 5.6.7.8
On all machines : mongod version v3.0.6-rc0
Log:
2015-09-29T15:56:42.683+0200 I NETWORK [ReplExecNetThread-3242] Socket recv() timeout 1.2.3.4:27017 2015-09-29T15:56:42.683+0200 I NETWORK [ReplExecNetThread-3242] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [1.2.3.4:27017] 2015-09-29T15:56:42.683+0200 I NETWORK [ReplExecNetThread-3242] DBClientCursor::init call() failed 2015-09-29T15:56:42.683+0200 I REPL [ReplicationExecutor] Error in heartbeat request to primary:27017; Location10276 DBClientBase::findN: transport error: primary:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D6F686177504275334D5231345A474F554351326350474267525A4632507A5454) } 2015-09-29T15:56:42.684+0200 I REPL [ReplicationExecutor] Standing for election 2015-09-29T15:56:42.739+0200 I REPL [ReplicationExecutor] not electing self, secondary-2:27017 would veto with 'secondary-1:27017 is trying to elect itself but primary:27017 is already primary and more up-to-date' 2015-09-29T15:56:42.739+0200 I REPL [ReplicationExecutor] not electing self, we are not freshest 2015-09-29T15:56:43.709+0200 I NETWORK [conn62818] end connection 5.6.7.8:57459 (10 connections now open) 2015-09-29T15:56:43.721+0200 I NETWORK [initandlisten] connection accepted from 5.6.7.8:57462 #62820 (11 connections now open) 2015-09-29T15:56:43.774+0200 I ACCESS [conn62820] Successfully authenticated as principal __system on local 2015-09-29T15:56:44.246+0200 I REPL [ReplicationExecutor] replSetElect voting yea for secondary-2:27017 (5) 2015-09-29T15:56:44.635+0200 I REPL [ReplicationExecutor] Member secondary-2:27017 is now in state PRIMARY 2015-09-29T15:56:54.687+0200 I NETWORK [ReplExecNetThread-3258] Socket recv() timeout 1.2.3.4:27017 2015-09-29T15:56:54.687+0200 I NETWORK [ReplExecNetThread-3258] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [1.2.3.4:27017] 2015-09-29T15:56:54.687+0200 I NETWORK [ReplExecNetThread-3258] DBClientCursor::init call() failed 2015-09-29T15:56:54.687+0200 I REPL [ReplicationExecutor] Error in heartbeat request to primary:27017; Location10276 DBClientBase::findN: transport error: primary:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D664D4C49456F584D7050524A3231325A665A6A32557A4B674977307041537978) } 2015-09-29T15:56:55.079+0200 E REPL [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: primary:27017 2015-09-29T15:56:55.079+0200 I - [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for primary:27017 (1.2.3.4) failed) in destructor (kill) 2015-09-29T15:56:55.080+0200 I REPL [ReplicationExecutor] syncing from: secondary-2:27017 2015-09-29T15:56:55.159+0200 I REPL [SyncSourceFeedback] replset setting syncSourceFeedback to secondary-2:27017 2015-09-29T15:56:55.169+0200 I REPL [rsBackgroundSync] replSet our last op time fetched: Sep 29 15:56:49:2 2015-09-29T15:56:55.169+0200 I REPL [rsBackgroundSync] replset source's GTE: Sep 29 15:56:52:1 2015-09-29T15:56:55.175+0200 I REPL [rsBackgroundSync] beginning rollback 2015-09-29T15:56:55.175+0200 I REPL [rsBackgroundSync] rollback 0 2015-09-29T15:56:55.175+0200 I REPL [ReplicationExecutor] transition to ROLLBACK 2015-09-29T15:56:55.175+0200 I REPL [rsBackgroundSync] rollback 1 2015-09-29T15:56:55.175+0200 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint 2015-09-29T15:56:55.199+0200 I NETWORK [rsBackgroundSync] DBClientCursor::init call() failed 2015-09-29T15:56:55.208+0200 E REPL [rsBackgroundSync] remote oplog empty or unreadable 2015-09-29T15:56:55.208+0200 I NETWORK [SyncSourceFeedback] DBClientCursor::init call() failed 2015-09-29T15:56:55.208+0200 I - [rsBackgroundSync] Fatal Assertion 18752 2015-09-29T15:56:55.208+0200 I - [rsBackgroundSync] ***aborting after fassert() failure
According to SERVER-20711, this is a normal shut down of the replica because of network errors.
However, temporary network errors are common in datacenters.
Moreover, in some cases, mongod handle some timeouts by waiting a bit :
2015-10-01T11:04:05.691+0200 I NETWORK [rsBackgroundSync] Socket recv() timeout 1.2.3.4:27017 2015-10-01T11:04:05.691+0200 I NETWORK [rsBackgroundSync] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [94.23.207.119:27017] 2015-10-01T11:04:05.691+0200 I - [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for primary:27017 (1.2.3.4) failed) in destructor (kill) 2015-10-01T11:04:05.691+0200 W REPL [rsBackgroundSync] rollback 2 exception 10278 dbclient error communicating with server: primary:27017; sleeping 1 min