[SERVER-30728] Low Azure socket timeout may cause initial sync failure Created: 18/Aug/17  Updated: 27/Oct/17  Resolved: 14/Sep/17

Status: Closed
Project: Core Server
Component/s: Admin
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: WenniZ Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

Hi Team,

We are running a MongoDB instance on Azure VM with default settings. We notice that Azure VM tends to close socket connection if it's not active in several minutes. When we are trying to initial sync from MongoDB on Azure VM to another replica set member, syncing always fails because the connection will be dropped when there's no network traffic for several minutes (e.g., when the startup instance is building an index), and initial sync will start all over.

A sample log snippet:

[building index here...]
2017-08-17T19:06:30.632+0800 I NETWORK [rsSync] Socket recv() errno:10053 An established connection was aborted by the software in your host machine. [***.***.***.***:*****]
2017-08-17T19:06:30.632+0800 I NETWORK [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_ERROR] server [***.***.***.***:*****]
2017-08-17T19:06:30.632+0800 I NETWORK [rsSync] DBClientCursor::init call() failed
2017-08-17T19:06:30.640+0800 E REPL [rsSync] 13386 socket error for mapping query
2017-08-17T19:06:30.640+0800 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining
2017-08-17T19:06:35.641+0800 I REPL [rsSync] initial sync pending
2017-08-17T19:06:35.643+0800 I REPL [ReplicationExecutor] syncing from: <HOSTNAME>:*****
2017-08-17T19:06:36.454+0800 I REPL [rsSync] initial sync drop all databases
2017-08-17T19:06:36.454+0800 I STORAGE [rsSync] dropAllDatabasesExceptLocal 14
2017-08-17T19:06:43.928+0800 I REPL [rsSync] initial sync clone all databases

For MongoDB client, this can be resolved by set MaxConnectionIdleTime, but it seems there's no way to configure the same for replica sets, and hence Azure users (if not tweaking OS settings) will find it hard to sync data to another replica set out of Azure VM.

Can we have an option to either specify max connection time for replica set, or make the initial sync not fail completely on a single connection failure?



 Comments   
Comment by Ramon Fernandez Marina [ 14/Sep/17 ]

Thanks for the update wekurtz, and glad to hear you've found a solution. I've adjusted the issue summary to make it easier for others to find and I'm going to close it.

Regards,
Ramón.

Comment by WenniZ [ 25/Aug/17 ]

Team - per solution above I'm fine to close this issue.

Comment by WenniZ [ 21/Aug/17 ]

This should relates to Azure TCP timeout setting which is only 4 minutes by default.

A workaround is to increase Azure timeout to 30min in Azure Powershell:

Add-AzureRmAccount
$p = Get-AzureRmPublicIpAddress
$p.IdleTimeoutInMinutes = 30
Set-AzureRmPublicIpAddress -PublicIpAddress $p

By doing so I've eliminated disconnections for my database.

Comment by WenniZ [ 18/Aug/17 ]

Hi Ramón,

Thank you for the prompt reply. I'm on 3.2 currently.
Let me upgrade to 3.4.7 to see if I could replicate this error.

Comment by Ramon Fernandez Marina [ 18/Aug/17 ]

wekurtz, what version of MongoDB are your running? It seems Azure may have some settings that account for the behavior you're seeing, and would be useful for us to know if the most recent production release (3.4.7) exhibits the behavior you describe.

Thanks,
Ramón.

Generated at Thu Feb 08 04:24:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.