Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.0.6
Component/s: Networking, Replication
Labels:
None

Assigned Teams:

Replication
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We have a 3 machines mongodb setup : one primary and 2 secondaries.
One of the secondaries crashed without explanation.

Setup:
primary : ip 1.2.3.4
secondary-1 : the machine that crashed
secondary-2 : ip 5.6.7.8

On all machines : mongod version v3.0.6-rc0

Log:

2015-09-29T15:56:42.683+0200 I NETWORK  [ReplExecNetThread-3242] Socket recv() timeout  1.2.3.4:27017
2015-09-29T15:56:42.683+0200 I NETWORK  [ReplExecNetThread-3242] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [1.2.3.4:27017] 
2015-09-29T15:56:42.683+0200 I NETWORK  [ReplExecNetThread-3242] DBClientCursor::init call() failed
2015-09-29T15:56:42.683+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to primary:27017; Location10276 DBClientBase::findN: transport error: primary:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D6F686177504275334D5231345A474F554351326350474267525A4632507A5454) }
2015-09-29T15:56:42.684+0200 I REPL     [ReplicationExecutor] Standing for election
2015-09-29T15:56:42.739+0200 I REPL     [ReplicationExecutor] not electing self, secondary-2:27017 would veto with 'secondary-1:27017 is trying to elect itself but primary:27017 is already primary and more up-to-date'
2015-09-29T15:56:42.739+0200 I REPL     [ReplicationExecutor] not electing self, we are not freshest
2015-09-29T15:56:43.709+0200 I NETWORK  [conn62818] end connection 5.6.7.8:57459 (10 connections now open) 
2015-09-29T15:56:43.721+0200 I NETWORK  [initandlisten] connection accepted from 5.6.7.8:57462 #62820 (11 connections now open) 
2015-09-29T15:56:43.774+0200 I ACCESS   [conn62820] Successfully authenticated as principal __system on local
2015-09-29T15:56:44.246+0200 I REPL     [ReplicationExecutor] replSetElect voting yea for secondary-2:27017 (5)
2015-09-29T15:56:44.635+0200 I REPL     [ReplicationExecutor] Member secondary-2:27017 is now in state PRIMARY
2015-09-29T15:56:54.687+0200 I NETWORK  [ReplExecNetThread-3258] Socket recv() timeout  1.2.3.4:27017
2015-09-29T15:56:54.687+0200 I NETWORK  [ReplExecNetThread-3258] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [1.2.3.4:27017] 
2015-09-29T15:56:54.687+0200 I NETWORK  [ReplExecNetThread-3258] DBClientCursor::init call() failed
2015-09-29T15:56:54.687+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to primary:27017; Location10276 DBClientBase::findN: transport error: primary:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D664D4C49456F584D7050524A3231325A665A6A32557A4B674977307041537978) }
2015-09-29T15:56:55.079+0200 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: primary:27017
2015-09-29T15:56:55.079+0200 I -        [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for primary:27017 (1.2.3.4) failed) in destructor (kill)
2015-09-29T15:56:55.080+0200 I REPL     [ReplicationExecutor] syncing from: secondary-2:27017
2015-09-29T15:56:55.159+0200 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to secondary-2:27017
2015-09-29T15:56:55.169+0200 I REPL     [rsBackgroundSync] replSet our last op time fetched: Sep 29 15:56:49:2
2015-09-29T15:56:55.169+0200 I REPL     [rsBackgroundSync] replset source's GTE: Sep 29 15:56:52:1
2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] beginning rollback
2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] rollback 0
2015-09-29T15:56:55.175+0200 I REPL     [ReplicationExecutor] transition to ROLLBACK
2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] rollback 1
2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2015-09-29T15:56:55.199+0200 I NETWORK  [rsBackgroundSync] DBClientCursor::init call() failed
2015-09-29T15:56:55.208+0200 E REPL     [rsBackgroundSync] remote oplog empty or unreadable
2015-09-29T15:56:55.208+0200 I NETWORK  [SyncSourceFeedback] DBClientCursor::init call() failed
2015-09-29T15:56:55.208+0200 I -        [rsBackgroundSync] Fatal Assertion 18752
2015-09-29T15:56:55.208+0200 I -        [rsBackgroundSync] 

***aborting after fassert() failure

According to ~~SERVER-20711~~, this is a normal shut down of the replica because of network errors.
However, temporary network errors are common in datacenters.
Moreover, in some cases, mongod handle some timeouts by waiting a bit :

2015-10-01T11:04:05.691+0200 I NETWORK [rsBackgroundSync] Socket recv() timeout 1.2.3.4:27017
2015-10-01T11:04:05.691+0200 I NETWORK [rsBackgroundSync] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [94.23.207.119:27017]
2015-10-01T11:04:05.691+0200 I - [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for primary:27017 (1.2.3.4) failed) in destructor (kill)
2015-10-01T11:04:05.691+0200 W REPL [rsBackgroundSync] rollback 2 exception 10278 dbclient error communicating with server: primary:27017; sleeping 1 min

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: samuel charron
Participants:: [DO NOT USE] Backlog - Replication Team, Ramon Fernandez Marina, samuel charron
Votes:: 1 Vote for this issue
Watchers:: 12 Start watching this issue

Created:: Oct 02 2015 03:26:58 PM UTC
Updated:: Dec 06 2022 04:43:13 AM UTC

Details

Description

Attachments

Forms

Activity

People

Dates