Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-20739

Make secondaries more resilient to transient network problems during failover/rollback

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.0.6
    • Component/s: Networking, Replication
    • Labels:
      None
    • Replication

      We have a 3 machines mongodb setup : one primary and 2 secondaries.
      One of the secondaries crashed without explanation.

      Setup:
      primary : ip 1.2.3.4
      secondary-1 : the machine that crashed
      secondary-2 : ip 5.6.7.8

      On all machines : mongod version v3.0.6-rc0

      Log:

      2015-09-29T15:56:42.683+0200 I NETWORK  [ReplExecNetThread-3242] Socket recv() timeout  1.2.3.4:27017
      2015-09-29T15:56:42.683+0200 I NETWORK  [ReplExecNetThread-3242] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [1.2.3.4:27017] 
      2015-09-29T15:56:42.683+0200 I NETWORK  [ReplExecNetThread-3242] DBClientCursor::init call() failed
      2015-09-29T15:56:42.683+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to primary:27017; Location10276 DBClientBase::findN: transport error: primary:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D6F686177504275334D5231345A474F554351326350474267525A4632507A5454) }
      2015-09-29T15:56:42.684+0200 I REPL     [ReplicationExecutor] Standing for election
      2015-09-29T15:56:42.739+0200 I REPL     [ReplicationExecutor] not electing self, secondary-2:27017 would veto with 'secondary-1:27017 is trying to elect itself but primary:27017 is already primary and more up-to-date'
      2015-09-29T15:56:42.739+0200 I REPL     [ReplicationExecutor] not electing self, we are not freshest
      2015-09-29T15:56:43.709+0200 I NETWORK  [conn62818] end connection 5.6.7.8:57459 (10 connections now open) 
      2015-09-29T15:56:43.721+0200 I NETWORK  [initandlisten] connection accepted from 5.6.7.8:57462 #62820 (11 connections now open) 
      2015-09-29T15:56:43.774+0200 I ACCESS   [conn62820] Successfully authenticated as principal __system on local
      2015-09-29T15:56:44.246+0200 I REPL     [ReplicationExecutor] replSetElect voting yea for secondary-2:27017 (5)
      2015-09-29T15:56:44.635+0200 I REPL     [ReplicationExecutor] Member secondary-2:27017 is now in state PRIMARY
      2015-09-29T15:56:54.687+0200 I NETWORK  [ReplExecNetThread-3258] Socket recv() timeout  1.2.3.4:27017
      2015-09-29T15:56:54.687+0200 I NETWORK  [ReplExecNetThread-3258] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [1.2.3.4:27017] 
      2015-09-29T15:56:54.687+0200 I NETWORK  [ReplExecNetThread-3258] DBClientCursor::init call() failed
      2015-09-29T15:56:54.687+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to primary:27017; Location10276 DBClientBase::findN: transport error: primary:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D664D4C49456F584D7050524A3231325A665A6A32557A4B674977307041537978) }
      2015-09-29T15:56:55.079+0200 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: primary:27017
      2015-09-29T15:56:55.079+0200 I -        [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for primary:27017 (1.2.3.4) failed) in destructor (kill)
      2015-09-29T15:56:55.080+0200 I REPL     [ReplicationExecutor] syncing from: secondary-2:27017
      2015-09-29T15:56:55.159+0200 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to secondary-2:27017
      2015-09-29T15:56:55.169+0200 I REPL     [rsBackgroundSync] replSet our last op time fetched: Sep 29 15:56:49:2
      2015-09-29T15:56:55.169+0200 I REPL     [rsBackgroundSync] replset source's GTE: Sep 29 15:56:52:1
      2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] beginning rollback
      2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] rollback 0
      2015-09-29T15:56:55.175+0200 I REPL     [ReplicationExecutor] transition to ROLLBACK
      2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] rollback 1
      2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
      2015-09-29T15:56:55.199+0200 I NETWORK  [rsBackgroundSync] DBClientCursor::init call() failed
      2015-09-29T15:56:55.208+0200 E REPL     [rsBackgroundSync] remote oplog empty or unreadable
      2015-09-29T15:56:55.208+0200 I NETWORK  [SyncSourceFeedback] DBClientCursor::init call() failed
      2015-09-29T15:56:55.208+0200 I -        [rsBackgroundSync] Fatal Assertion 18752
      2015-09-29T15:56:55.208+0200 I -        [rsBackgroundSync] 
      
      ***aborting after fassert() failure
      

      According to SERVER-20711, this is a normal shut down of the replica because of network errors.
      However, temporary network errors are common in datacenters.
      Moreover, in some cases, mongod handle some timeouts by waiting a bit :

      2015-10-01T11:04:05.691+0200 I NETWORK [rsBackgroundSync] Socket recv() timeout 1.2.3.4:27017
      2015-10-01T11:04:05.691+0200 I NETWORK [rsBackgroundSync] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [94.23.207.119:27017]
      2015-10-01T11:04:05.691+0200 I - [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for primary:27017 (1.2.3.4) failed) in destructor (kill)
      2015-10-01T11:04:05.691+0200 W REPL [rsBackgroundSync] rollback 2 exception 10278 dbclient error communicating with server: primary:27017; sleeping 1 min
      

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            samuel.charron samuel charron
            Votes:
            1 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: