[SERVER-20739] Make secondaries more resilient to transient network problems during failover/rollback Created: 02/Oct/15  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Networking, Replication
Affects Version/s: 3.0.6
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: samuel charron Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Replication
Participants:

 Description   

We have a 3 machines mongodb setup : one primary and 2 secondaries.
One of the secondaries crashed without explanation.

Setup:
primary : ip 1.2.3.4
secondary-1 : the machine that crashed
secondary-2 : ip 5.6.7.8

On all machines : mongod version v3.0.6-rc0

Log:

2015-09-29T15:56:42.683+0200 I NETWORK  [ReplExecNetThread-3242] Socket recv() timeout  1.2.3.4:27017
2015-09-29T15:56:42.683+0200 I NETWORK  [ReplExecNetThread-3242] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [1.2.3.4:27017] 
2015-09-29T15:56:42.683+0200 I NETWORK  [ReplExecNetThread-3242] DBClientCursor::init call() failed
2015-09-29T15:56:42.683+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to primary:27017; Location10276 DBClientBase::findN: transport error: primary:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D6F686177504275334D5231345A474F554351326350474267525A4632507A5454) }
2015-09-29T15:56:42.684+0200 I REPL     [ReplicationExecutor] Standing for election
2015-09-29T15:56:42.739+0200 I REPL     [ReplicationExecutor] not electing self, secondary-2:27017 would veto with 'secondary-1:27017 is trying to elect itself but primary:27017 is already primary and more up-to-date'
2015-09-29T15:56:42.739+0200 I REPL     [ReplicationExecutor] not electing self, we are not freshest
2015-09-29T15:56:43.709+0200 I NETWORK  [conn62818] end connection 5.6.7.8:57459 (10 connections now open) 
2015-09-29T15:56:43.721+0200 I NETWORK  [initandlisten] connection accepted from 5.6.7.8:57462 #62820 (11 connections now open) 
2015-09-29T15:56:43.774+0200 I ACCESS   [conn62820] Successfully authenticated as principal __system on local
2015-09-29T15:56:44.246+0200 I REPL     [ReplicationExecutor] replSetElect voting yea for secondary-2:27017 (5)
2015-09-29T15:56:44.635+0200 I REPL     [ReplicationExecutor] Member secondary-2:27017 is now in state PRIMARY
2015-09-29T15:56:54.687+0200 I NETWORK  [ReplExecNetThread-3258] Socket recv() timeout  1.2.3.4:27017
2015-09-29T15:56:54.687+0200 I NETWORK  [ReplExecNetThread-3258] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [1.2.3.4:27017] 
2015-09-29T15:56:54.687+0200 I NETWORK  [ReplExecNetThread-3258] DBClientCursor::init call() failed
2015-09-29T15:56:54.687+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to primary:27017; Location10276 DBClientBase::findN: transport error: primary:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D664D4C49456F584D7050524A3231325A665A6A32557A4B674977307041537978) }
2015-09-29T15:56:55.079+0200 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: primary:27017
2015-09-29T15:56:55.079+0200 I -        [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for primary:27017 (1.2.3.4) failed) in destructor (kill)
2015-09-29T15:56:55.080+0200 I REPL     [ReplicationExecutor] syncing from: secondary-2:27017
2015-09-29T15:56:55.159+0200 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to secondary-2:27017
2015-09-29T15:56:55.169+0200 I REPL     [rsBackgroundSync] replSet our last op time fetched: Sep 29 15:56:49:2
2015-09-29T15:56:55.169+0200 I REPL     [rsBackgroundSync] replset source's GTE: Sep 29 15:56:52:1
2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] beginning rollback
2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] rollback 0
2015-09-29T15:56:55.175+0200 I REPL     [ReplicationExecutor] transition to ROLLBACK
2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] rollback 1
2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2015-09-29T15:56:55.199+0200 I NETWORK  [rsBackgroundSync] DBClientCursor::init call() failed
2015-09-29T15:56:55.208+0200 E REPL     [rsBackgroundSync] remote oplog empty or unreadable
2015-09-29T15:56:55.208+0200 I NETWORK  [SyncSourceFeedback] DBClientCursor::init call() failed
2015-09-29T15:56:55.208+0200 I -        [rsBackgroundSync] Fatal Assertion 18752
2015-09-29T15:56:55.208+0200 I -        [rsBackgroundSync] 
 
***aborting after fassert() failure

According to SERVER-20711, this is a normal shut down of the replica because of network errors.
However, temporary network errors are common in datacenters.
Moreover, in some cases, mongod handle some timeouts by waiting a bit :

2015-10-01T11:04:05.691+0200 I NETWORK [rsBackgroundSync] Socket recv() timeout 1.2.3.4:27017
2015-10-01T11:04:05.691+0200 I NETWORK [rsBackgroundSync] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [94.23.207.119:27017]
2015-10-01T11:04:05.691+0200 I - [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for primary:27017 (1.2.3.4) failed) in destructor (kill)
2015-10-01T11:04:05.691+0200 W REPL [rsBackgroundSync] rollback 2 exception 10278 dbclient error communicating with server: primary:27017; sleeping 1 min



 Comments   
Comment by Ramon Fernandez Marina [ 02/Oct/15 ]

Thanks for your report samuel.charron. I'm repurposing this ticket as an improvement request to make secondary nodes more resilient to transient network problems.

Regards,
Ramón.

Generated at Thu Feb 08 03:55:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.