[CSHARP-340] Unable to connect when Primary switches Created: 15/Oct/11 Updated: 07/Apr/23 Resolved: 15/Mar/12 |
|
| Status: | Closed |
| Project: | C# Driver |
| Component/s: | None |
| Affects Version/s: | 1.2 |
| Fix Version/s: | 1.4 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Stuart Johnson | Assignee: | Robert Stam |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Client app: Linux, Mono 2.10.5 Connection string: mongodb://mongo2.xxxxxx.com,mongo1.xxxxx.com,mongo0.xxxxx.com |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
The Primary server on my replicaset became slow, due to a runaway process on the server. Eventually it failed altogether, and a new Primary was elected. First of all I got this: MongoDB.Driver.MongoConnectionException: Unable to connect to the primary member of the replica set: Timeout waiting for a MongoConnection.. ---> System.TimeoutException: Timeout waiting for a MongoConnection. And then this, even after the server problem was fixed, and the Primary was switched back to the original server: System.TimeoutException: Timeout waiting for a MongoConnection. I had to restart my client applications. |
| Comments |
| Comment by Robert Stam [ 15/Mar/12 ] |
|
Marking as fixed since this is really just a special case of |
| Comment by Robert Stam [ 15/Mar/12 ] |
|
Even though this ticket was originally presented in the context of the primary switching, based on the exception received this is actually just a special case of |
| Comment by Robert Stam [ 20/Feb/12 ] |
|
This issue ("Timeout waiting for a MongoConnection") was occasionally seen in 1.2 and should no longer occur in 1.3 or 1.3.1. We will be closing this issue soon unless we hear reports of this still occurring in 1.3.1. Note that it can sometimes be perfectly normal to get this exception. This could happen if your application has many more threads than the size of the connection pool, in which case it is possible for threads to get this exception under heavy load simply because the connection pool is not big enough. |
| Comment by Robert Stam [ 20/Feb/12 ] |
|
This issue ("Timeout waiting for a MongoConnection") was occasionally seen in 1.2 and should no longer occur in 1.3 or 1.3.1. We will be closing this issue soon unless we hear reports of this still occurring in 1.3.1. Note that it can sometimes be perfectly normal to get this exception. This could happen if your application has many more threads than the size of the connection pool, in which case it is possible for threads to get this exception under heavy load simply because the connection pool is not big enough. |
| Comment by Robert Stam [ 10/Feb/12 ] |
|
What version of the server were you running? Could you have encountered this issue: https://jira.mongodb.org/browse/SERVER-4405 The driver expects the primary server to close all open sockets when it steps down from being the primary. |
| Comment by chris rozacki [ 09/Feb/12 ] |
|
I just observed similar nuance. My replicas swap frequently. I was doing reads and writes at the same time and today I was not able to continue reading due to: 'QueryFailure flag was not master and slaveok=false (response was { "$err" : "not master and slaveok=false", "code" : 13435 }) However I can see from my logs that updates were fine. Driver 1.3.0.4309 |
| Comment by Robert Stam [ 16/Dec/11 ] |
|
Good point. I will repeat the test killing the mongod process instead of |
| Comment by Robert Stam [ 16/Nov/11 ] |
|
I closed it because I was unable to reproduce it. Can you try to reproduce with the latest source code for the driver from github? This also may have been a duplicate of |
| Comment by kuku [ 16/Nov/11 ] |
|
Why close? I also has this problem? My drive is 1.3 |
| Comment by kuku [ 16/Nov/11 ] |
|
Why close? I also has this problem? My drive is 1.3 |
| Comment by Robert Stam [ 07/Nov/11 ] |
|
I have not been able to reproduce this. This might be a duplicate of Most of the comments on this JIRA refer to older versions of the C# driver. As far as I can tell no one is experiencing this with v1.3. If no one reports that they are seeing this with 1.3 (or even better with master) I will close this issue soon. |
| Comment by kuku [ 07/Nov/11 ] |
|
Looking forward to update as early as possible, thank you |
| Comment by Robert Stam [ 01/Nov/11 ] |
|
We are still trying to reproduce this. It will be fixed in v1.4. You can watch this issue for updates. |
| Comment by kuku [ 01/Nov/11 ] |
|
I use is V1.3, is in V1.4 update bug? |
| Comment by Robert Stam [ 31/Oct/11 ] |
|
This bug was reported against v1.2 of the C# driver. Since then v1.3 has been released. Are you using the latest version? |
| Comment by kuku [ 31/Oct/11 ] |
|
I also met the problem, how to solve? |
| Comment by Stuart Johnson [ 20/Oct/11 ] |
|
I'm using 1.2.0.4274 as released on the 14th September. Last time I looked (V1.0 I think), the driver is not using asynchronous socket methods. In my experience with C# sockets, they can block under some circumstances indefinitely. Is that a possibility? I learnt my lesson when I did a lot of work with GPRS networks, which were forever giving me blocking headaches. The only way to make it 100% reliable, was switch to .NET's async model. |
| Comment by Robert Stam [ 20/Oct/11 ] |
|
The packets going back and forth to the one secondary are repeated Connect attempts (sets of three messages: isMaster, buildInfo and ping). There should have been connection attempts to all three members of the replica set. Probably the connection pools to the other two members are jammed. You confirmed that you never call Disconnect, but looking more closely at the C# driver source code there are a few cases where the driver itself calls Disconnect as part of error recovery, so it could be that Disconnect is being called after all, just not by you. If you have the ability to set a breakpoint or add a logging/tracing statement to Disconnect we could find out if it's ever being called. If it is, then that would explain the "Timeout waiting for a MongoConnection" exception. Also, just wanted to confirm that you are using only the 1.2 driver, not the latest from github, right? |
| Comment by Stuart Johnson [ 20/Oct/11 ] |
|
Robert, I have run wireshark. I tried a read operation on the failed client. A DNS lookup was performed on all three Mongo servers, and then a packet starts being sent back and forth forth with one of the Secondarys. I will e-mail you the wireshark file. |
| Comment by Robert Stam [ 20/Oct/11 ] |
|
I might be able to simulate a sluggish server by evaling a javascript function that takes a long time to return, and maybe also by setting a really short client side timeout. Your last paragraph is also really helpful. It hints at a possible failure mode. |
| Comment by Stuart Johnson [ 20/Oct/11 ] |
|
Hi Robert, I am unable to reproduce this too. I downed the Primary, and everything switched over as it should have. That includes your test app, and all the long-running server apps that failed last time (for which have not been touched since the restart after the previous incident). The only difference I can see, is the Primary originally died of a slow death, which is hard if not impossible to test. It left the replicaset in limbo for several minutes. I just tried fooling around with the firewall, to intentionally block access to the primary, but it all came back up again when I put everything back the way it should be. The only significant thing I can see different from code and yours, is I call Connect() with a timeout when the app starts for the first time. I just noticed writing this, I left my laptop on, and it's still running client software originally effected from last weeks failure. If I select an operation that reads from a Mongo collection, it immediately returns "Unable to connect to the primary member of the replica set: Timeout waiting for a MongoConnection.". Its like it's not even trying, yet I know if I restart this app, it will work. I'll see if I can get wireshark on there, to see if it's doing anything. |
| Comment by Stuart Johnson [ 17/Oct/11 ] |
|
ok, thanks for trying. I will try your code out as soon as I can. I need to wait until it will create the least disruption. |
| Comment by Robert Stam [ 17/Oct/11 ] |
|
I've repeated the test but this time instead of using stepDown I actually killed the mongod process of the current primary. The output is now different, but still looks normal. I'm attaching the new output as the file ConsoleWindowOutput-2.txt. In particular, note the appearance of the new status: ConnectedToSubset. This means that some of the members are offline (because I killed the mongod process), but there are enough members online for the replica set to be operational. When I bring the killed mongod process back online the status changes back to Connected. Let me know if you figure out a way I can reproduce the exception you are seeing. |
| Comment by Stuart Johnson [ 17/Oct/11 ] |
|
Thanks Robert I will try this as soon as I can. Doing a stepdown() may not faithfully reproduce the problem, as the Primary failed, and then was shutdown preventing socket connections to that server (one of the servers in the connection url). |
| Comment by Robert Stam [ 17/Oct/11 ] |
|
I am unable to reproduce this. I'm attaching my test program in case you have time to test it in your environment. The test program creates 10 threads that call the Count command once a second. If successful it writes the value of the count and the address of the current primary to the console. If it fails it writes the exception message to the console. The output is also attached as the ConsoleWindowOutput.txt file. I did notice that sometimes elections take a long time, but that's not a driver issue. During the time that the driver was reporting that it couldn't connect to the primary, repeated use of the rs.status() command in a mongo shell version confirmed that there was no primary at that moment. I tried the test program against both the 1.2 driver and the master branch from git with identical results. My replica set configuration was two normal members and one arbiter. I would cause the primary to switch by using the rs.stepDown() against the current primary. |
| Comment by Stuart Johnson [ 16/Oct/11 ] |
|
There are two different C# applications connecting to the server. Both use the same C# driver, but have different uses. One writes 1 to 10 inserts a second, with the occasional query (this is a server app using Mono), and the other is a desktop Windows app, with random reads and writes when a user requests it. There are 17 connections to MongoDB overall on average. The Primary became very slow, accepting socket connections, but with operations timing out. After a few minutes a new Primary was elected. I manually took the original Primary server out of service for maintenance, and restarted all C# clients so that they would find the new Primary. Once I fixed the Mongo server problem, I fired it back up and had it re-elected as the Primary. I then had to restart all C# apps again. Hope that helps. |
| Comment by Robert Stam [ 16/Oct/11 ] |
|
I will try to reproduce. Let me know if there is any additional information that could help me reproduce this (number of client threads, activity level, operation mix, etc...) Also, did it take just one election for this to happen? |
| Comment by Stuart Johnson [ 16/Oct/11 ] |
|
No, I never call Disconnect or Reconnect. On previous incarnations of the driver, I would get errors, but it would eventually resolve itself. This time, restart the client apps. |
| Comment by Robert Stam [ 16/Oct/11 ] |
|
The exception you are getting looks a lot like: https://jira.mongodb.org/browse/CSHARP-333 Are you calling Disconnect or Reconnect in your exception handler? If so, just removing the calls to Disconnect or Reconnect should resolve the problem ( |