Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-42059

Mongo replication sync halfway SocketException: remote: 192.168.168.122:27017 error: 9001 socket exception [RECV_ERROR] server [192.168.168.122:27017]

    • Type: Icon: Question Question
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • Server Triage

      I got one mongo v2.4.10 server with 1.7TB data, I am trying to migrate and upgrade the mongo to mongo v.3.0.15 server

       

      I've setup a new mongo v.3.0.15 and configured replication for v3.0.15 to be secondary to sync with v.2.4.10 primary mongo.

       

      The secondary was in STARTUP2 and the sync was almost finish as I can check with the growth of my storage device for the new machine which running mongo v.3.0.15

       

      However there were some socket exceptions which caused both of my machine to resyn again from the start, just to ask anything I can configure or setup to prevent the error to happen again because I don't want to waste another 7 days to fail to sync up 1.7TB again.

       

      Below are some logs from my mongo:

      Primary mongo (v2.4.10):

      ```
      Wed Jul 3 10:03:59.196 [conn21] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [101.0.0.182:32829]
      ```

      Secondary mongo (v.3.0.15)
      ```

      ...
      2019-07-03T09:54:29.169+0800 I NETWORK [ReplExecNetThread-0] Socket recv() timeout 192.168.168.122:27017
      2019-07-03T09:54:29.169+0800 I NETWORK [ReplExecNetThread-0] SocketException: remote: 192.168.168.122:27017 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.168.122:27017]
      2019-07-03T09:54:29.169+0800 I NETWORK [ReplExecNetThread-0] DBClientCursor::init call() failed
      2019-07-03T09:54:29.169+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.168.122:27017; Location10276 DBClientBase::findN: transport error: 192.168.168.122:27017 ns: admin.$cmd query: { replSetHeartbeat: "ArchiverReplica", pv: 1, v: 1, from: "x.x.x.x:27017", fromId: 1, checkEmpty: false }

      2019-07-03T09:54:29.170+0800 W NETWORK [ReplExecNetThread-0] Failed to connect to 192.168.168.122:27017 after 1 milliseconds, giving up.
      2019-07-03T09:54:29.170+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.168.122:27017; Location18915 Failed attempt to connect to 192.168.168.122:27017; couldn't connect to server 192.168.168.122:27017 (192.168.168.122), connection attempt failed
      ...
      2019-07-03T10:07:41.452+0800 W NETWORK [ReplExecNetThread-0] Failed to connect to 192.168.168.122:27017 after 4995 milliseconds, giving up.
      2019-07-03T10:07:41.452+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.168.122:27017; Location18915 Failed attempt to connect to 192.168.168.122:27017; couldn't connect to server 192.168.168.122:27017 (192.168.168.122), connection attempt failed
      2019-07-03T10:07:43.602+0800 I REPL [ReplicationExecutor] Member 192.168.168.122:27017 is now in state PRIMARY
      ...
      2019-07-03T10:08:03.845+0800 I NETWORK [rsSync] Socket recv() errno:104 Connection reset by peer 192.168.168.122:27017
      2019-07-03T10:08:03.845+0800 I NETWORK [rsSync] SocketException: remote: 192.168.168.122:27017 error: 9001 socket exception [RECV_ERROR] server [192.168.168.122:27017]
      2019-07-03T10:08:03.853+0800 I NETWORK [rsSync] trying reconnect to 192.168.168.122:27017 (192.168.168.122) failed
      2019-07-03T10:08:03.928+0800 I NETWORK [rsSync] reconnect 192.168.168.122:27017 (192.168.168.122) ok
      2019-07-03T10:08:03.939+0800 E REPL [rsSync] 16465 recv failed while exhausting cursor
      2019-07-03T10:08:03.939+0800 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining
      2019-07-03T10:08:08.939+0800 I REPL [rsSync] initial sync pending
      2019-07-03T10:08:08.958+0800 I REPL [ReplicationExecutor] syncing from: 192.168.168.122:27017
      2019-07-03T10:08:09.204+0800 I REPL [rsSync] initial sync drop all databases
      2019-07-03T10:08:09.205+0800 I STORAGE [rsSync] dropAllDatabasesExceptLocal 3
      2019-07-03T10:08:09.221+0800 I JOURNAL [rsSync] journalCleanup...
      2019-07-03T10:08:09.221+0800 I JOURNAL [rsSync] removeJournalFiles
      2019-07-03T10:08:09.895+0800 I JOURNAL [rsSync] journalCleanup...
      2019-07-03T10:08:09.895+0800 I JOURNAL [rsSync] removeJournalFiles
      ...
      resyn from the begining .......

      ```

            Assignee:
            backlog-server-triage [HELP ONLY] Backlog - Triage Team
            Reporter:
            aaron.tai Aaron Tai Wei Han
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: