[SERVER-30789] Unable to Initial Sync Big Database on non-Linux System due to mongod TCP Keepalive Constraint Created: 23/Aug/17  Updated: 09/Oct/17  Resolved: 13/Sep/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: WenniZ Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

We are initial syncing a big database (~70Gb) from a replica set member A to replica set member B. Both members are running mongod instance on Windows. Windows TCP connection interval is set to 30min.

And from this post:
https://docs.mongodb.com/v3.2/faq/diagnostics/
We understand that MongoDB will set its own TCP timeout interval to 10min on Windows OS.

Under these settings, we found we are not able to complete initial syncing, because:

1. We need more than 10 minutes to build index after the big database is copied from A to B.
2. Looks like MongoDB cannot ACK to TCP requests when building index
3. Consequently after building index, instance B will receive a TCP connection timeout error, and need to start over the whole initial sync
4. So stuck at this big database now.

Please suggest.

Log containing this error:

2017-08-23T07:59:00.164+0800 I STORAGE  [rsSync] 14456650 objects cloned so far from collection DB.COL
2017-08-23T07:59:04.007+0800 I STORAGE  [rsSync] clone DB.COL 14457727
2017-08-23T07:59:53.315+0800 I INDEX    [rsSync] build index on: stratus.position properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "DB.COL" }
2017-08-23T07:59:53.316+0800 I INDEX    [rsSync]         building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2017-08-23T07:59:56.019+0800 I -        [rsSync]   Index Build: 84400/14467992 0%
2017-08-23T07:59:59.000+0800 I -        [rsSync]   Index Build: 195300/14467992 1%
2017-08-23T08:00:02.002+0800 I -        [rsSync]   Index Build: 271400/14467992 1%
......
2017-08-23T08:10:21.002+0800 I -        [rsSync]   Index Build: 14266400/14467992 98%
2017-08-23T08:10:24.000+0800 I -        [rsSync]   Index Build: 14325000/14467992 99%
2017-08-23T08:10:27.002+0800 I -        [rsSync]   Index Build: 14408000/14467992 99%
2017-08-23T08:10:38.095+0800 I INDEX    [rsSync] build index done.  scanned 14467992 total records. 644 secs
2017-08-23T08:10:38.104+0800 I REPL     [rsSync] initial sync data copy, starting syncup
2017-08-23T08:10:38.106+0800 I REPL     [rsSync] oplog sync 1 of 3
2017-08-23T08:10:38.109+0800 I NETWORK  [rsSync] Socket  send() errno:10054 An existing connection was forcibly closed by the remote host. IP:port
2017-08-23T08:10:38.114+0800 I REPL     [rsSync] connection lost to hostname:port; is your tcp keepalive interval set appropriately?
2017-08-23T08:10:38.136+0800 E REPL     [rsSync] 9001 socket exception [FAILED_STATE] server [hostname:port(IP) failed]
2017-08-23T08:10:38.136+0800 E REPL     [rsSync] initial sync attempt failed, 8 attempts remaining



 Comments   
Comment by Ramon Fernandez Marina [ 13/Sep/17 ]

Apologies for the delay in the response. To the best of my understanding:

  • In our docs we're using the terms "keepalive", "keepalive time" and "keepalive period" interchangeably
  • This setting is the maximum window of time that the system will let a connection be idle without sending a package
  • If this window is set too big, MongoDB will ignore that setting and make it smaller to avoid issues like the one you describe

I'd recommend you check your values of KeepAliveTime and KeepAliveInterval to be sure, and lower KeepAliveTime if it's too large, but my my guess is that the clocks on these nodes you're trying to initial sync are diverging enough to cause the problem you're seeing, since we haven't had other users report socket timeouts due to large index builds after initial syncs.

Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Comment by WenniZ [ 27/Aug/17 ]

@ramon.fernandez
Please let me know if you need additional info.

Comment by WenniZ [ 23/Aug/17 ]

what version of MongoDB are you using?
3.2.12-7-gacfa77f

you mention Windows, can you please elaborate on the the exact version?
Windows Server 2012 R2

is 70Gb the data size or the size on disk?
It's the size on hard disk.

Comment by Ramon Fernandez Marina [ 23/Aug/17 ]

wekurtz,

  • what version of MongoDB are you using?
  • you mention Windows, can you please elaborate on the the exact version?
  • is 70Gb the data size or the size on disk?

We haven't heard of other users running into this issue so we would like to have a better understanding of what's happening and try to reproduce it on our end.

Thanks,
Ramón.

Generated at Thu Feb 08 04:25:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.