[SERVER-42059] Mongo replication sync halfway SocketException: remote: 192.168.168.122:27017 error: 9001 socket exception [RECV_ERROR] server [192.168.168.122:27017] Created: 03/Jul/19  Updated: 06/Dec/22  Resolved: 03/Jul/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Aaron Tai Wei Han Assignee: Backlog - Triage Team
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Server Triage
Participants:

 Description   

I got one mongo v2.4.10 server with 1.7TB data, I am trying to migrate and upgrade the mongo to mongo v.3.0.15 server

 

I've setup a new mongo v.3.0.15 and configured replication for v3.0.15 to be secondary to sync with v.2.4.10 primary mongo.

 

The secondary was in STARTUP2 and the sync was almost finish as I can check with the growth of my storage device for the new machine which running mongo v.3.0.15

 

However there were some socket exceptions which caused both of my machine to resyn again from the start, just to ask anything I can configure or setup to prevent the error to happen again because I don't want to waste another 7 days to fail to sync up 1.7TB again.

 

Below are some logs from my mongo:

Primary mongo (v2.4.10):

```
Wed Jul 3 10:03:59.196 [conn21] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [101.0.0.182:32829]
```

Secondary mongo (v.3.0.15)
```

...
2019-07-03T09:54:29.169+0800 I NETWORK [ReplExecNetThread-0] Socket recv() timeout 192.168.168.122:27017
2019-07-03T09:54:29.169+0800 I NETWORK [ReplExecNetThread-0] SocketException: remote: 192.168.168.122:27017 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.168.122:27017]
2019-07-03T09:54:29.169+0800 I NETWORK [ReplExecNetThread-0] DBClientCursor::init call() failed
2019-07-03T09:54:29.169+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.168.122:27017; Location10276 DBClientBase::findN: transport error: 192.168.168.122:27017 ns: admin.$cmd query: { replSetHeartbeat: "ArchiverReplica", pv: 1, v: 1, from: "x.x.x.x:27017", fromId: 1, checkEmpty: false }

2019-07-03T09:54:29.170+0800 W NETWORK [ReplExecNetThread-0] Failed to connect to 192.168.168.122:27017 after 1 milliseconds, giving up.
2019-07-03T09:54:29.170+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.168.122:27017; Location18915 Failed attempt to connect to 192.168.168.122:27017; couldn't connect to server 192.168.168.122:27017 (192.168.168.122), connection attempt failed
...
2019-07-03T10:07:41.452+0800 W NETWORK [ReplExecNetThread-0] Failed to connect to 192.168.168.122:27017 after 4995 milliseconds, giving up.
2019-07-03T10:07:41.452+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.168.122:27017; Location18915 Failed attempt to connect to 192.168.168.122:27017; couldn't connect to server 192.168.168.122:27017 (192.168.168.122), connection attempt failed
2019-07-03T10:07:43.602+0800 I REPL [ReplicationExecutor] Member 192.168.168.122:27017 is now in state PRIMARY
...
2019-07-03T10:08:03.845+0800 I NETWORK [rsSync] Socket recv() errno:104 Connection reset by peer 192.168.168.122:27017
2019-07-03T10:08:03.845+0800 I NETWORK [rsSync] SocketException: remote: 192.168.168.122:27017 error: 9001 socket exception [RECV_ERROR] server [192.168.168.122:27017]
2019-07-03T10:08:03.853+0800 I NETWORK [rsSync] trying reconnect to 192.168.168.122:27017 (192.168.168.122) failed
2019-07-03T10:08:03.928+0800 I NETWORK [rsSync] reconnect 192.168.168.122:27017 (192.168.168.122) ok
2019-07-03T10:08:03.939+0800 E REPL [rsSync] 16465 recv failed while exhausting cursor
2019-07-03T10:08:03.939+0800 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining
2019-07-03T10:08:08.939+0800 I REPL [rsSync] initial sync pending
2019-07-03T10:08:08.958+0800 I REPL [ReplicationExecutor] syncing from: 192.168.168.122:27017
2019-07-03T10:08:09.204+0800 I REPL [rsSync] initial sync drop all databases
2019-07-03T10:08:09.205+0800 I STORAGE [rsSync] dropAllDatabasesExceptLocal 3
2019-07-03T10:08:09.221+0800 I JOURNAL [rsSync] journalCleanup...
2019-07-03T10:08:09.221+0800 I JOURNAL [rsSync] removeJournalFiles
2019-07-03T10:08:09.895+0800 I JOURNAL [rsSync] journalCleanup...
2019-07-03T10:08:09.895+0800 I JOURNAL [rsSync] removeJournalFiles
...
resyn from the begining .......

```



 Comments   
Comment by Eric Sedor [ 03/Jul/19 ]

Unfortunately we aren't able to assist with this process here.

The SERVER project is for bugs and feature suggestions for active versions of the MongoDB server, and our support for MongoDB 3.0 ended in February 2018.

For assistance troubleshooting in this case, I encourage you to ask our community by posting on the mongodb-user group or on Stack Overflow with the mongodb tag.

Generated at Thu Feb 08 04:59:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.