[CSHARP-340] Unable to connect when Primary switches Created: 15/Oct/11  Updated: 07/Apr/23  Resolved: 15/Mar/12

Status: Closed
Project: C# Driver
Component/s: None
Affects Version/s: 1.2
Fix Version/s: 1.4

Type: Bug Priority: Critical - P2
Reporter: Stuart Johnson Assignee: Robert Stam
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Client app: Linux, Mono 2.10.5
MongoDB 2.0.0, three server replicaset.

Connection string: mongodb://mongo2.xxxxxx.com,mongo1.xxxxx.com,mongo0.xxxxx.com


Attachments: Text File ConsoleWindowOutput-2.txt     Text File ConsoleWindowOutput.txt     Zip Archive TestReplicaSetFailover.zip    
Issue Links:
Related
related to CSHARP-406 Deadlock and TimeoutException when ac... Closed

 Description   

The Primary server on my replicaset became slow, due to a runaway process on the server. Eventually it failed altogether, and a new Primary was elected.
My client applications failed to switch to the new Primary.

First of all I got this:

MongoDB.Driver.MongoConnectionException: Unable to connect to the primary member of the replica set: Timeout waiting for a MongoConnection.. ---> System.TimeoutException: Timeout waiting for a MongoConnection.
at MongoDB.Driver.Internal.MongoConnectionPool.AcquireConnection (MongoDB.Driver.MongoDatabase database) [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoServerInstance.Connect (Boolean slaveOk) [0x00000] in <filename unknown>:0
— End of inner exception stack trace —
at MongoDB.Driver.Internal.ReplicaSetConnector.Connect (TimeSpan timeout, ConnectWaitFor waitFor) [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoServer.Connect (TimeSpan timeout, ConnectWaitFor waitFor) [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoServer.Connect (ConnectWaitFor waitFor) [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoServer.ChooseServerInstance (Boolean slaveOk) [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoServer.AcquireConnection (MongoDB.Driver.MongoDatabase database, Boolean slaveOk) [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoCollection.InsertBatch (System.Type nominalType, IEnumerable documents, MongoDB.Driver.MongoInsertOptions options) [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoCollection.Insert (System.Type nominalType, System.Object document, MongoDB.Driver.MongoInsertOptions options) [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoCollection.Insert (System.Type nominalType, System.Object document, MongoDB.Driver.SafeMode safeMode) [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoCollection.Insert (System.Type nominalType, System.Object document) [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoCollection.Insert[BsonDocument] (MongoDB.Bson.BsonDocument document) [0x00000] in <filename unknown>:0
at MsgExchange.Messages.svr_Accounts.Account.UpdateGPS (MsgExchange.Messages.GPS.TrackerMessage trackerMessage) [0x00000] in <filename unknown>:0 ~~

And then this, even after the server problem was fixed, and the Primary was switched back to the original server:

System.TimeoutException: Timeout waiting for a MongoConnection.
at MongoDB.Driver.Internal.MongoConnectionPool.AcquireConnection (MongoDB.Driver.MongoDatabase database) [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoServerInstance.AcquireConnection (MongoDB.Driver.MongoDatabase database) [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoServer.AcquireConnection (MongoDB.Driver.MongoDatabase database, Boolean slaveOk) [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoCursorEnumerator`1[MongoDB.Bson.BsonDocument].AcquireConnection () [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoCursorEnumerator`1[MongoDB.Bson.BsonDocument].GetFirst () [0x00000] in <filename unknown>:0
at MongoDB.Driver.MongoCursorEnumerator`1[MongoDB.Bson.BsonDocument].MoveNext () [0x00000] in <filename unknown>:0
at MsgServer.Collections.Accounts.clsStoreForwardUtils.SendStoreForward (MsgExchange.Messages.svr_Accounts.Account account) [0x00000] in <filename unknown>:0 ~~

I had to restart my client applications.



 Comments   
Comment by Robert Stam [ 15/Mar/12 ]

Marking as fixed since this is really just a special case of CSHARP-406, which is now fixed.

Comment by Robert Stam [ 15/Mar/12 ]

Even though this ticket was originally presented in the context of the primary switching, based on the exception received this is actually just a special case of CSHARP-406.

Comment by Robert Stam [ 20/Feb/12 ]

This issue ("Timeout waiting for a MongoConnection") was occasionally seen in 1.2 and should no longer occur in 1.3 or 1.3.1.

We will be closing this issue soon unless we hear reports of this still occurring in 1.3.1.

Note that it can sometimes be perfectly normal to get this exception. This could happen if your application has many more threads than the size of the connection pool, in which case it is possible for threads to get this exception under heavy load simply because the connection pool is not big enough.

Comment by Robert Stam [ 20/Feb/12 ]

This issue ("Timeout waiting for a MongoConnection") was occasionally seen in 1.2 and should no longer occur in 1.3 or 1.3.1.

We will be closing this issue soon unless we hear reports of this still occurring in 1.3.1.

Note that it can sometimes be perfectly normal to get this exception. This could happen if your application has many more threads than the size of the connection pool, in which case it is possible for threads to get this exception under heavy load simply because the connection pool is not big enough.

Comment by Robert Stam [ 10/Feb/12 ]

What version of the server were you running? Could you have encountered this issue:

https://jira.mongodb.org/browse/SERVER-4405

The driver expects the primary server to close all open sockets when it steps down from being the primary.

Comment by chris rozacki [ 09/Feb/12 ]

I just observed similar nuance. My replicas swap frequently. I was doing reads and writes at the same time and today I was not able to continue reading due to: 'QueryFailure flag was not master and slaveok=false (response was

{ "$err" : "not master and slaveok=false", "code" : 13435 }

)

However I can see from my logs that updates were fine.

Driver 1.3.0.4309

Comment by Robert Stam [ 16/Dec/11 ]

Good point. I will repeat the test killing the mongod process instead of
using stepDown.

Comment by Robert Stam [ 16/Nov/11 ]

I closed it because I was unable to reproduce it. Can you try to reproduce with the latest source code for the driver from github? This also may have been a duplicate of CSHARP-345, so that might be why I can't reproduce it.

Comment by kuku [ 16/Nov/11 ]

Why close? I also has this problem? My drive is 1.3

Comment by kuku [ 16/Nov/11 ]

Why close? I also has this problem? My drive is 1.3

Comment by Robert Stam [ 07/Nov/11 ]

I have not been able to reproduce this. This might be a duplicate of CSHARP-345...

Most of the comments on this JIRA refer to older versions of the C# driver. As far as I can tell no one is experiencing this with v1.3. If no one reports that they are seeing this with 1.3 (or even better with master) I will close this issue soon.

Comment by kuku [ 07/Nov/11 ]

Looking forward to update as early as possible, thank you

Comment by Robert Stam [ 01/Nov/11 ]

We are still trying to reproduce this. It will be fixed in v1.4. You can watch this issue for updates.

Comment by kuku [ 01/Nov/11 ]

I use is V1.3, is in V1.4 update bug?

Comment by Robert Stam [ 31/Oct/11 ]

This bug was reported against v1.2 of the C# driver. Since then v1.3 has been released. Are you using the latest version?

Comment by kuku [ 31/Oct/11 ]

I also met the problem, how to solve?

Comment by Stuart Johnson [ 20/Oct/11 ]

I'm using 1.2.0.4274 as released on the 14th September.

Last time I looked (V1.0 I think), the driver is not using asynchronous socket methods. In my experience with C# sockets, they can block under some circumstances indefinitely. Is that a possibility? I learnt my lesson when I did a lot of work with GPRS networks, which were forever giving me blocking headaches. The only way to make it 100% reliable, was switch to .NET's async model.

Comment by Robert Stam [ 20/Oct/11 ]

The packets going back and forth to the one secondary are repeated Connect attempts (sets of three messages: isMaster, buildInfo and ping). There should have been connection attempts to all three members of the replica set. Probably the connection pools to the other two members are jammed.

You confirmed that you never call Disconnect, but looking more closely at the C# driver source code there are a few cases where the driver itself calls Disconnect as part of error recovery, so it could be that Disconnect is being called after all, just not by you. If you have the ability to set a breakpoint or add a logging/tracing statement to Disconnect we could find out if it's ever being called. If it is, then that would explain the "Timeout waiting for a MongoConnection" exception.

Also, just wanted to confirm that you are using only the 1.2 driver, not the latest from github, right?

Comment by Stuart Johnson [ 20/Oct/11 ]

Robert, I have run wireshark. I tried a read operation on the failed client. A DNS lookup was performed on all three Mongo servers, and then a packet starts being sent back and forth forth with one of the Secondarys. I will e-mail you the wireshark file.

Comment by Robert Stam [ 20/Oct/11 ]

I might be able to simulate a sluggish server by evaling a javascript function that takes a long time to return, and maybe also by setting a really short client side timeout.

Your last paragraph is also really helpful. It hints at a possible failure mode.

Comment by Stuart Johnson [ 20/Oct/11 ]

Hi Robert, I am unable to reproduce this too. I downed the Primary, and everything switched over as it should have. That includes your test app, and all the long-running server apps that failed last time (for which have not been touched since the restart after the previous incident).

The only difference I can see, is the Primary originally died of a slow death, which is hard if not impossible to test. It left the replicaset in limbo for several minutes. I just tried fooling around with the firewall, to intentionally block access to the primary, but it all came back up again when I put everything back the way it should be.

The only significant thing I can see different from code and yours, is I call Connect() with a timeout when the app starts for the first time.
var server = MongoServer.Create(connectionString);
server.Connect(TimeSpan.FromSeconds(5));
Dont know if that was significant.

I just noticed writing this, I left my laptop on, and it's still running client software originally effected from last weeks failure. If I select an operation that reads from a Mongo collection, it immediately returns "Unable to connect to the primary member of the replica set: Timeout waiting for a MongoConnection.". Its like it's not even trying, yet I know if I restart this app, it will work. I'll see if I can get wireshark on there, to see if it's doing anything.

Comment by Stuart Johnson [ 17/Oct/11 ]

ok, thanks for trying. I will try your code out as soon as I can. I need to wait until it will create the least disruption.

Comment by Robert Stam [ 17/Oct/11 ]

I've repeated the test but this time instead of using stepDown I actually killed the mongod process of the current primary. The output is now different, but still looks normal. I'm attaching the new output as the file ConsoleWindowOutput-2.txt.

In particular, note the appearance of the new status: ConnectedToSubset. This means that some of the members are offline (because I killed the mongod process), but there are enough members online for the replica set to be operational. When I bring the killed mongod process back online the status changes back to Connected.

Let me know if you figure out a way I can reproduce the exception you are seeing.

Comment by Stuart Johnson [ 17/Oct/11 ]

Thanks Robert I will try this as soon as I can.

Doing a stepdown() may not faithfully reproduce the problem, as the Primary failed, and then was shutdown preventing socket connections to that server (one of the servers in the connection url).

Comment by Robert Stam [ 17/Oct/11 ]

I am unable to reproduce this. I'm attaching my test program in case you have time to test it in your environment. The test program creates 10 threads that call the Count command once a second. If successful it writes the value of the count and the address of the current primary to the console. If it fails it writes the exception message to the console.

The output is also attached as the ConsoleWindowOutput.txt file. I did notice that sometimes elections take a long time, but that's not a driver issue. During the time that the driver was reporting that it couldn't connect to the primary, repeated use of the rs.status() command in a mongo shell version confirmed that there was no primary at that moment.

I tried the test program against both the 1.2 driver and the master branch from git with identical results.

My replica set configuration was two normal members and one arbiter. I would cause the primary to switch by using the rs.stepDown() against the current primary.

Comment by Stuart Johnson [ 16/Oct/11 ]

There are two different C# applications connecting to the server. Both use the same C# driver, but have different uses. One writes 1 to 10 inserts a second, with the occasional query (this is a server app using Mono), and the other is a desktop Windows app, with random reads and writes when a user requests it. There are 17 connections to MongoDB overall on average.

The Primary became very slow, accepting socket connections, but with operations timing out. After a few minutes a new Primary was elected. I manually took the original Primary server out of service for maintenance, and restarted all C# clients so that they would find the new Primary. Once I fixed the Mongo server problem, I fired it back up and had it re-elected as the Primary. I then had to restart all C# apps again.

Hope that helps.

Comment by Robert Stam [ 16/Oct/11 ]

I will try to reproduce. Let me know if there is any additional information that could help me reproduce this (number of client threads, activity level, operation mix, etc...) Also, did it take just one election for this to happen?

Comment by Stuart Johnson [ 16/Oct/11 ]

No, I never call Disconnect or Reconnect. On previous incarnations of the driver, I would get errors, but it would eventually resolve itself. This time, restart the client apps.

Comment by Robert Stam [ 16/Oct/11 ]

The exception you are getting looks a lot like:

https://jira.mongodb.org/browse/CSHARP-333

Are you calling Disconnect or Reconnect in your exception handler? If so, just removing the calls to Disconnect or Reconnect should resolve the problem (CSHARP-333 is fixed in the upcoming 1.3 release of the C# driver).

Generated at Wed Feb 07 21:36:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.