[CSHARP-821] Driver fails to reconnect with the server Created: 12/Sep/13 Updated: 04/Apr/15 Resolved: 04/Apr/15 |
|
| Status: | Closed |
| Project: | C# Driver |
| Component/s: | None |
| Affects Version/s: | 1.8.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Khalid Salomão | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 3 |
| Labels: | connection, driver | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Windows .net 4 and .net 4.5 with 1.8.2 driver |
||
| Backwards Compatibility: | Fully Compatible |
| Description |
|
C# Driver fails to reconnect with the server in some situations when the MongoDb server is slow to respond and the connection times out or momentarily fails to connect to the server due to some temporary cluster unavailability. We have a MongoDb 2.4.6 cluster running in a windows server on AWS with a normal (slow) EBS disk. In this scenario, we have noted several connectivity problems where the driver fails to reconnect after a brief (60 seconds or less) unavailability. Note: this issue is related to the driver failling to reconnect to the MongoDb server after some kind of connectivity problem. The error message in most cases is "Server instance {0} is no longer connected.".Exploring the driver code, I noted some possible issues: On "MongoServerInstance.cs", on the method "MongoServerInstance.AcquireConnection" lines 382-394: Whenever a ping fails ("MongoServerInstance.Ping", line 667 - 690), the connection is set to Disconnected by "SetState(MongoServerState.Disconnected);" (line 687). So subsequent "AcquireConnection" calls fails with the message "Server instance {0} is no longer connected." instead of trying to reconnect. So a brief connection problem can stop all future operations... My suggestions is to change one line of code (line 386) on "MongoServerInstance.AcquireConnection" lines 382-394 code to: internal MongoConnection AcquireConnection() is no longer connected.", _address); return _connectionPool.AcquireConnection(); Server setup: |
| Comments |
| Comment by William Holroyd [ 11/Jun/14 ] |
|
It's funny that Amazon was mentioned here. We observed this same issue, but the root cause ended up being a configuration issue. We found that between our Windows app server and our RedHat Mongo server were a considerable number of TCP retransmits on larger requests before it was just reset it due to no response. For smaller property/id shrubbery requests, they were fine, but when I wanted to store a blob, it all fell apart. Our network sniff analysis from both machines showed that our Windows server had a low re-transmit timeout given the latency we were experiencing and it was expiring before the Mongo server was able to respond. We also noticed in some traces that the ECN/CWR flag had been set by Amazon's infrastructure, meaning Amazon's network had some sort of congestion going on since those flags are set by routers so long as both hosts support it. By default, Windows Server 2012 has a minimum RTO of 10ms for datacenter sourced traffic. We worked around the issue by using the custom template and adjusting RTO to 100 and the ICW to 10, at which point the connection stopped failing, but still saw some intermittent re-transmits. If we set the RTO to 300, the re-transmits complete disappeared. As much as we'd like to have sub-10ms latency, in a virtualized public cloud environment, there isn't any real guarantee that latency is going to be perfect if the hosting company can and will migrate machines without notice, to different datacenters or between availability zones, depending on how you've configured things. While I'm not calling this the end-all resolution, it got our test environments working again. You can play around with the following command and values... netsh int tcp set supplemental custom 300 10 dctcp enabled 20 On another note, I don't know if those type of issues would bubble up through what the driver is using to explicitly retry certain failures, or even give that level of detail, since that's an OS TCP stack concern. |
| Comment by John Smith [ 18/Mar/14 ] |
|
Sounds familiar, our problem seems to be the Ping method as well. |
| Comment by Moshe Shperling [ 20/Jan/14 ] |
|
hi Here is my flow:
here is some code: mcSettings.ConnectTimeout = new TimeSpan(0, 3, 0); var creds = MongoCredential.CreateMongoCRCredential("inventory", this.userName, this.password); ; MongoServerAddress server=new MongoServerAddress(this.host,this.port); MongoClient mc = new MongoClient(mcSettings); this.mongo = mc.GetServer(); thanks in advance for any help |
| Comment by Khalid Salomão [ 12/Sep/13 ] |
|
1. This situation is difficult to replicate. is no longer connected." was one situation that I encountered and the application was unable to leave this state. I have encountered another situation where a web service was stuck with the error "Unable to connect in the specified time frame of '00:00:00.0200000'.". But the connection timeout was set to 10 secs, and it was throwing this message for every operation for several hours. After a manual refresh of the IIS the web service it started working again, and is without incident for two weeks. Note that the above web service has been working fine for several weeks, and normally recovering from connection errors. This issue happened only once on this time frame. So the two above situations is kind of rare, but when we have several applications using MongoDb in different scenarios it start to be a really big issue! 2. No, only collection Find, Insert and Save operations 3. DNS names. The replica set is a one Primary (with more priority) (large), a secondary (medium) and an arbiter (small). They are in the same region but in different availability zones. Primary Preferred is used as read preference and the list of address is used to connect to the replica set. |
| Comment by Craig Wilson [ 12/Sep/13 ] |
|
There are certainly many things that could be done and all your suggestions are a way of fixing the symptom. However, this situation should never happen. When one of the servers gets into a Disconnected state, it should be pinging every 10 seconds in an attempt to "heal" itself already. When it finally connects, it will get put back into the connected server list and eligible for continuation. Ultimately, I'm not concerned about the times when it eventually reconnects. This is how it's supposed to work. I'm concerned about the times when it stays permanently disconnected. I have a couple questions:
|
| Comment by Khalid Salomão [ 12/Sep/13 ] |
|
Hi Craug, Thanks for your fast response! Comments to your questions: Yes, sometimes, the connection is not restablished and stays that way (the current fix is to restart the application)! We have experienced that kind of situation in different application that runs for a long period of time. Before looking into the code, I have tried some drastic fixes like calling "MongoServer.Disconnect" whenever an error happened, but it led to more instability... Yes, we try to handle the errors whenever possible. About the possible fix: I undestand the problem with the change of the Primary, indeed it would also keep the driver from reconnecting! What about puting the driver in a state that it must try to connect from the start! By using "DiscoveringMongoServerProxy.Discover" and recreating the server proxy... This would be a slow reconnect, but at least the driver would be able to heal itself. What do you think? Regards, |
| Comment by Craig Wilson [ 12/Sep/13 ] |
|
Are you saying the server never comes back? We test these scenarios (and more) regularly and have not seen this behavior. Given that you are expecting errors, I assume you are handling them and retrying as necessary/possible? Explanation below: Once a server is put into the Disconnected state, it should not be eligible for AcquireConnection to be called on it. There is a race condition (not one we can prevent) where a server has been chosen and gets disconnected before acquire connection is called on it. That is the only time you should be seeing this error. In addition, this isn't a permanent state. Heartbeats happen every 10 seconds to reconnect a server. Your solution above is simply to attempt to reconnect. This actually isn't a great solution because it ignores certain things. If you are talking to a primary and receive this error, it's likely that this particular is no longer the primary and, as such, waiting to get reconnected and then issue a write will simply fail on the server anyways. There are other scenarios as well. |