[CSHARP-821] Driver fails to reconnect with the server Created: 12/Sep/13  Updated: 04/Apr/15  Resolved: 04/Apr/15

Status: Closed
Project: C# Driver
Component/s: None
Affects Version/s: 1.8.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Khalid Salomão Assignee: Unassigned
Resolution: Done Votes: 3
Labels: connection, driver
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows .net 4 and .net 4.5 with 1.8.2 driver
Mongodb 2.4.6


Backwards Compatibility: Fully Compatible

 Description   

C# Driver fails to reconnect with the server in some situations when the MongoDb server is slow to respond and the connection times out or momentarily fails to connect to the server due to some temporary cluster unavailability.

We have a MongoDb 2.4.6 cluster running in a windows server on AWS with a normal (slow) EBS disk. In this scenario, we have noted several connectivity problems where the driver fails to reconnect after a brief (60 seconds or less) unavailability.

Note: this issue is related to the driver failling to reconnect to the MongoDb server after some kind of connectivity problem.

The error message in most cases is "Server instance

{0} is no longer connected.".

Exploring the driver code, I noted some possible issues:

On "MongoServerInstance.cs", on the method "MongoServerInstance.AcquireConnection" lines 382-394:
Whenever a ping fails ("MongoServerInstance.Ping", line 667 - 690), the connection is set to Disconnected by "SetState(MongoServerState.Disconnected);" (line 687).

So subsequent "AcquireConnection" calls fails with the message "Server instance {0}

is no longer connected." instead of trying to reconnect. So a brief connection problem can stop all future operations...

My suggestions is to change one line of code (line 386) on "MongoServerInstance.AcquireConnection" lines 382-394 code to:

internal MongoConnection AcquireConnection()
{
lock (_serverInstanceLock)
{
if (_permanentlyDisconnected)
{
var message = string.Format("Server instance

{0}

is no longer connected.", _address);
throw new InvalidOperationException(message);
}
}

return _connectionPool.AcquireConnection();
}

Server setup:
To ilustrate the current MongoDb cluster scenario, due to the low performance of the cluster setup, sometimes we face 3 kinds of connectivity problems related to:
1. sometimes we have a temporary (60 seconds or less) unavailability due to database file allocation (NTFS on a normal AWS EBS volume takes a long time to allocate a new file...)
2. sometimes there is an connectivity problem due to some application being located outside US and the MongoDb cluster being in US (something to be expected)...
3. once or twice the cluster had to reelect the primary member...



 Comments   
Comment by William Holroyd [ 11/Jun/14 ]

It's funny that Amazon was mentioned here. We observed this same issue, but the root cause ended up being a configuration issue. We found that between our Windows app server and our RedHat Mongo server were a considerable number of TCP retransmits on larger requests before it was just reset it due to no response. For smaller property/id shrubbery requests, they were fine, but when I wanted to store a blob, it all fell apart.

Our network sniff analysis from both machines showed that our Windows server had a low re-transmit timeout given the latency we were experiencing and it was expiring before the Mongo server was able to respond. We also noticed in some traces that the ECN/CWR flag had been set by Amazon's infrastructure, meaning Amazon's network had some sort of congestion going on since those flags are set by routers so long as both hosts support it. By default, Windows Server 2012 has a minimum RTO of 10ms for datacenter sourced traffic. We worked around the issue by using the custom template and adjusting RTO to 100 and the ICW to 10, at which point the connection stopped failing, but still saw some intermittent re-transmits. If we set the RTO to 300, the re-transmits complete disappeared.

As much as we'd like to have sub-10ms latency, in a virtualized public cloud environment, there isn't any real guarantee that latency is going to be perfect if the hosting company can and will migrate machines without notice, to different datacenters or between availability zones, depending on how you've configured things.

While I'm not calling this the end-all resolution, it got our test environments working again. You can play around with the following command and values...

netsh int tcp set supplemental custom 300 10 dctcp enabled 20

On another note, I don't know if those type of issues would bubble up through what the driver is using to explicitly retry certain failures, or even give that level of detail, since that's an OS TCP stack concern.

Comment by John Smith [ 18/Mar/14 ]

Sounds familiar, our problem seems to be the Ping method as well.
Please see https://jira.mongodb.org/browse/CSHARP-817

Comment by Moshe Shperling [ 20/Jan/14 ]

hi
I am having the same problem/exception.
Here are some details. Our mongo is running on windows server 2008. It is a replica set that consists of one single instance. The exception pops up when I am trying to pull a group (of even 20) products from the remote client. If i pull it from the same machine where Mongodb is sitting it works well. Also if i pull the items from some mongo client like robomongo or mongovue it works good.

Here is my flow:

  • connect to mongo;
  • pull column headers which are gonna be used for fields for a multiplefields collection. Headers is collection on its own.
  • pull a bulk of items from products collection <- here the exception fires.

here is some code:
MongoClientSettings mcSettings = new MongoClientSettings();

mcSettings.ConnectTimeout = new TimeSpan(0, 3, 0);
mcSettings.SocketTimeout = new TimeSpan(0, 3, 0);

var creds = MongoCredential.CreateMongoCRCredential("inventory", this.userName, this.password);
mcSettings.Credentials = new[]

{ creds }

;

MongoServerAddress server=new MongoServerAddress(this.host,this.port);
mcSettings.Server=server;

MongoClient mc = new MongoClient(mcSettings);

this.mongo = mc.GetServer();
this.mongo.Connect();

thanks in advance for any help

Comment by Khalid Salomão [ 12/Sep/13 ]

1. This situation is difficult to replicate.
the message "Server instance

{0}

is no longer connected." was one situation that I encountered and the application was unable to leave this state.

I have encountered another situation where a web service was stuck with the error "Unable to connect in the specified time frame of '00:00:00.0200000'.". But the connection timeout was set to 10 secs, and it was throwing this message for every operation for several hours. After a manual refresh of the IIS the web service it started working again, and is without incident for two weeks.

Note that the above web service has been working fine for several weeks, and normally recovering from connection errors. This issue happened only once on this time frame.

So the two above situations is kind of rare, but when we have several applications using MongoDb in different scenarios it start to be a really big issue!

2. No, only collection Find, Insert and Save operations

3. DNS names. The replica set is a one Primary (with more priority) (large), a secondary (medium) and an arbiter (small). They are in the same region but in different availability zones. Primary Preferred is used as read preference and the list of address is used to connect to the replica set.

Comment by Craig Wilson [ 12/Sep/13 ]

There are certainly many things that could be done and all your suggestions are a way of fixing the symptom. However, this situation should never happen. When one of the servers gets into a Disconnected state, it should be pinging every 10 seconds in an attempt to "heal" itself already. When it finally connects, it will get put back into the connected server list and eligible for continuation.

Ultimately, I'm not concerned about the times when it eventually reconnects. This is how it's supposed to work. I'm concerned about the times when it stays permanently disconnected. I have a couple questions:

  1. When you say you end up in this situation permanently, are you receiving the "Server instance {0}

    is no longer connected." message everytime? I'm wondering how you are getting into a state where the server has been chosen and never gets relinquished.

  2. Are you doing your work inside a request (db.RequestStart())?
  3. You mentioned Amazon. Are you using Elastic IPs? Are you using dns names? What does your replica set configuration look like.
Comment by Khalid Salomão [ 12/Sep/13 ]

Hi Craug,

Thanks for your fast response!

Comments to your questions:

Yes, sometimes, the connection is not restablished and stays that way (the current fix is to restart the application)!

We have experienced that kind of situation in different application that runs for a long period of time. Before looking into the code, I have tried some drastic fixes like calling "MongoServer.Disconnect" whenever an error happened, but it led to more instability...

Yes, we try to handle the errors whenever possible.
For example: some applications log the errors and put the object in a queue for retrying later. But since the connection is not restablished, errors keep happening...

About the possible fix:

I undestand the problem with the change of the Primary, indeed it would also keep the driver from reconnecting!

What about puting the driver in a state that it must try to connect from the start!

By using "DiscoveringMongoServerProxy.Discover" and recreating the server proxy...

This would be a slow reconnect, but at least the driver would be able to heal itself.

What do you think?

Regards,
Khalid

Comment by Craig Wilson [ 12/Sep/13 ]

Are you saying the server never comes back? We test these scenarios (and more) regularly and have not seen this behavior. Given that you are expecting errors, I assume you are handling them and retrying as necessary/possible?

Explanation below:

Once a server is put into the Disconnected state, it should not be eligible for AcquireConnection to be called on it. There is a race condition (not one we can prevent) where a server has been chosen and gets disconnected before acquire connection is called on it. That is the only time you should be seeing this error. In addition, this isn't a permanent state. Heartbeats happen every 10 seconds to reconnect a server.

Your solution above is simply to attempt to reconnect. This actually isn't a great solution because it ignores certain things. If you are talking to a primary and receive this error, it's likely that this particular is no longer the primary and, as such, waiting to get reconnected and then issue a write will simply fail on the server anyways. There are other scenarios as well.

Generated at Wed Feb 07 21:37:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.