[CSHARP-654] Client does not recover automatically from certain failures Created: 25/Dec/12  Updated: 20/Mar/14  Resolved: 15/Mar/13

Status: Closed
Project: C# Driver
Component/s: None
Affects Version/s: 1.6.1
Fix Version/s: 1.8

Type: Bug Priority: Major - P3
Reporter: Kalin Gyokov Assignee: Craig Wilson
Resolution: Done Votes: 1
Labels: driver
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

The client is a Windows Service using .NET 4.0 and running on Windows Server 2003 R2


Attachments: File Ping.aspx     Text File mongo01.txt     Text File mongo02.txt     Text File mongo03.txt     Text File primary.txt    
Issue Links:
Duplicate
is duplicated by CSHARP-754 Driver could not reconnect to the ser... Closed
Related
is related to CSHARP-678 An existing connection was forcibly c... Closed

 Description   

We have a mongodb replica set with 1 master and 2 slaves. All slaves can be become masters.

At one point the client, a Windows Service using .NET 4.0, stopped working. When I checked the logs, there was a large number of these exceptions:

MongoDB.Driver.MongoConnectionException: Unable to connect to a member of the replica set matching the read preference Primary\r\n at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ThrowConnectionException(ReadPreference readPreference) in c:\\projects\\mongo-csharp-driver\\Driver\\Internal
MultipleInstanceMongoServerProxy.cs:line 509\r\n at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.Connect(TimeSpan timeout, ReadPreference readPreference) in c:\\projects\\mongo-csharp-driver\\Driver\\Internal
MultipleInstanceMongoServerProxy.cs:line 224\r\n at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ChooseServerInstance(ReadPreference readPreference) in c:\\projects\\mongo-csharp-driver\\Driver\\Internal
MultipleInstanceMongoServerProxy.cs:line 168\r\n at MongoDB.Driver.Internal.DiscoveringMongoServerProxy.ChooseServerInstance(ReadPreference readPreference) in c:\\projects\\mongo-csharp-driver\\Driver\\Internal
DiscoveringMongoServerProxy.cs:line 126\r\n at MongoDB.Driver.MongoServer.AcquireConnection(MongoDatabase database, ReadPreference readPreference) in c:\\projects\\mongo-csharp-driver\\Driver\\Core
MongoServer.cs:line 1051\r\n at MongoDB.Driver.MongoCursorEnumerator`1.AcquireConnection() in c:\\projects\\mongo-csharp-driver\\Driver\\Core
MongoCursorEnumerator.cs:line 299\r\n at MongoDB.Driver.MongoCursorEnumerator`1.GetFirst() in c:\\projects\\mongo-csharp-driver\\Driver\\Core
MongoCursorEnumerator.cs:line 312\r\n at MongoDB.Driver.MongoCursorEnumerator`1.MoveNext() in c:\\projects\\mongo-csharp-driver\\Driver\\Core
MongoCursorEnumerator.cs:line 230\r\n at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source)\r\n at MongoDB.Driver.MongoCollection.FindOneAs[TDocument](IMongoQuery query) in c:\\projects\\mongo-csharp-driver\\Driver\\Core
MongoCollection.cs:line 570\r\n at MongoDB.Driver.MongoCollection.RunCommandAs(Type commandResultType, IMongoCommand command) in c:\\projects\\mongo-csharp-driver\\Driver\\Core
MongoCollection.cs:line 1658\r\n at MongoDB.Driver.MongoCollection.RunCommandAs(Type commandResultType, IMongoCommand command) in c:\\projects\\mongo-csharp-driver\\Driver\\Core
MongoCollection.cs:line 1680\r\n at MongoDB.Driver.MongoCollection.RunCommandAs[TCommandResult](IMongoCommand command) in c:\\projects\\mongo-csharp-driver\\Driver\\Core
MongoCollection.cs:line 1650\r\n at MongoDB.Driver.MongoCollection.FindAndModify(IMongoQuery query, IMongoSortBy sortBy, IMongoUpdate update, IMongoFields fields, Boolean returnNew, Boolean upsert) in c:\\projects\\mongo-csharp-driver\\Driver\\Core
MongoCollection.cs:line 461\r\n at MongoDB.Driver.MongoCollection.FindAndModify(IMongoQuery query, IMongoSortBy sortBy, IMongoUpdate update, Boolean returnNew, Boolean upsert) in c:\\projects\\mongo-csharp-driver\\Driver\\Core
MongoCollection.cs:line 440\r\n at MongoDB.Driver.MongoCollection.FindAndModify(IMongoQuery query, IMongoSortBy sortBy, IMongoUpdate update, Boolean returnNew) in c:\\projects\\mongo-csharp-driver\\Driver\\Core
MongoCollection.cs:line 421\r\n at MongoDB.Driver.MongoCollection.FindAndModify(IMongoQuery query, IMongoSortBy sortBy, IMongoUpdate update) in c:\\projects\\mongo-csharp-driver\\Driver\\Core
MongoCollection.cs:line 404\r\n
...
(Note that the actual read preference of the collection is ReadPreferenceMode.Secondary, but that has not caused any problems with FindAndModify until now)

We were not able to see anything wrong with the replica set itself, so we simply restarted the client. The exceptions immediately went away and the client continued to work perfectly fine after that.

I cannot determine what caused the client to begin throwing those exceptions. It might have been some temporary network failure. The real problem is that the driver was not able to recover from that failure by itself. I would expect the driver to be more robust and to be able to resume normal operation without the need for a restart.

Is there a code workaround for this issue? Would it help if we set the Read preference mode to SecodaryPreferred or to Primary?



 Comments   
Comment by auto [ 15/Mar/13 ]

Author:

{u'date': u'2013-03-15T17:16:47Z', u'name': u'Craig Wilson', u'email': u'craiggwilson@gmail.com'}

Message: CSHARP-654: fixed bug for when network reliability causes us to miss a reconnection by one of the members.
Branch: master
https://github.com/mongodb/mongo-csharp-driver/commit/2f316c7eec9382bdf823e9197a5c284777a6cbc1

Comment by Stefano Ricciardi [ 15/Jan/13 ]

Craig,

removing the disconnect from the code resolved the issue for us both for "primary only" and "replica set" mode.

Thank you for helping root causing our problem and keep up with the good work!

Comment by Kalin Gyokov [ 14/Jan/13 ]

So far we have only seen this once. We will keep monitoring the service in case it happens again. Maybe we'll be able to reproduce it in our dev environment.

Comment by Craig Wilson [ 14/Jan/13 ]

I agree that these are different issues and as such I will leave this opened.

I have been unable to produce your issue Kalin. I've performed the steps you have just described a number of times and the primary always shows back up. Was this an isolated occurence or is this happening frequently?

Comment by Kalin Gyokov [ 14/Jan/13 ]

It appears that there may be two separate issues. Stefano says that "under a somewhat increased load [the driver] cannot connect to the server for some of the concurrent requests".
In my case, the driver completely fails to come back up until the client is restarted. It also appears that Stefano is using Connect and Disconnect. In our code, we simply instantiate a MongoCollection once when the client Windows Service is started. We then call FindAndMofiy at a rate of 3-10 times per second. Here is pretty much how we instantiate our collection:

var ourConnStr = "mongodb://MONGO01,MONGO02,MONGO03/OurDatabase";
var ourDb = MongoDatabase.Create( ourConnStr );
var ourColl = ourDb.GetCollection( "OurCollection" );
Then we just call ourColl.FindAndModify(...) on that instance a few times per second. There is nothing more to it.

Here is a summary of what might have happened, based on the logs. You are probably already aware of this, but I just wanted to summarize it:
1. The connection between all replicas failed.
2. For a few minutes, there was no primary since each replica was isolated from the rest.
3. When the network came back up, the driver was able to immediately connect to the set. However, it could not find a primary because one had not yet been elected and therefore the driver failed.
4. The replica set re-elected the original primary (MONGO01). However, the driver was unable to detect that and continued to fail as if there was no primary yet.

Comment by Craig Wilson [ 14/Jan/13 ]

Yes. A replica set connection creates N connection pools where N is the number of members in your replica set. Hence, setting up and tearing these down is much more expensive.

Comment by Stefano Ricciardi [ 14/Jan/13 ]

Ok, I'll schedule this change ASAP.

Do you think this might explain the difference in performance between the "replica-set mode" and the "primary only mode"?

Comment by Craig Wilson [ 14/Jan/13 ]

Ok, the config looks fine. I was concerned fsync was turned on or replicasToWrite was high.

I'd suggest changing the code to get rid of the disconnect calls. Effectively, all this means is to get rid of the finally blocks. After that is done, please report back and let us know whether that corrected the problem.

Comment by Stefano Ricciardi [ 14/Jan/13 ]

We pretty much used that code as is. We might have to rewrite part of this code then.

The relevant sections from the web.config should be as follow:
<connectionStrings>
<add name="MongoSessionServices" connectionString="mongodb://454575-app1:27017,454576-app2:27017,406734-fdproc3:27017/?replicaSet=ump_replSet" />
</connectionStrings>

[...]
<sessionState mode="Custom" customProvider="MongoSessionStateProvider">
<providers>
<add name="MongoSessionStateProvider" type="MongoSessionStateStore.MongoSessionStateStore" connectionStringName="MongoSessionServices" writeExceptionsToEventLog="false" fsync="false" replicasToWrite="0" />
</providers>
</sessionState>

Comment by Craig Wilson [ 14/Jan/13 ]

Have you modified this session store code at all? With the assumption that you have not...

The problem I'm seeing is the call to conn.Disconnect(). You can search in the code here (https://github.com/AdaTheDev/MongoDB-ASP.NET-Session-State-Store/blob/master/MongoSessionStateStore/MongoSessionStateStore.cs#L286) for Disconnect and see 6 places it is used.

You can see my blog post on what Disconnect is and why it is generally a bad idea to invoke here: http://craiggwilson.wordpress.com/2012/09/23/disconnecting-with-the-mongodb-driver/

Basically, everytime a new request comes in, a new connection is created and begins to be used. However, concurrently with that, another request could have killed off all connections. Hence, you are reaping any benefits of a connection pool that can scale to handle the load.

Let me know if you have changed this code at all? Also, what does your web.config look like?

Comment by Stefano Ricciardi [ 14/Jan/13 ]

Our session provider is based on the following code:
https://github.com/AdaTheDev/MongoDB-ASP.NET-Session-State-Store

I am attaching you a sample page that clients are loading (Ping.aspx).

Clients run in the cloud (Amazon). They are based on an internal load testing tool (based on node.js) which unfortunately I cannot share

Nothing fancy anyway, they just issue HTTP requests to the aspx page and sleep for a random interval between 1 and 3 seconds. Each thread issues 100K requests before stopping. For my tests I had run 175 concurrent threads.

Hope this helps.

Comment by Craig Wilson [ 14/Jan/13 ]

It may be the same issue. Can I ask you for your sample program, both the web page as well as the client?. I just spun up 300 threads and, once again, was unable to make it fail...

Comment by Stefano Ricciardi [ 14/Jan/13 ]

Some data from my tests:

  • 175 threads, each calling a sample ASP page 100K times using a session handled by Mongo

With connection string pointing to replica set:
7356 errors due to "Unable to connect"

With connection string pointing to primary only:
226 errors due to "Unable to connect"

There seems to be something about connecting to the replica set that make errors soar.

Comment by Stefano Ricciardi [ 14/Jan/13 ]

Craig,

my problem is not that the driver does not pick up the primary again, but that under a somewhat increased load it cannot connect to the server for some of the concurrent requests.

I am trying now with a connection string pointing directly to the primary server to see whether this makes any difference.

As I mentioned in my first comment, this might not be the same issue as the original poster.

Comment by Craig Wilson [ 11/Jan/13 ]

So, I've spent most of day trying to reproduce this. While I have found some interesting behavior I'm going to fix, the primary always comes back. The unusual behavior is the amount of time it takes to for this to happen, but as I said, the driver always picks up the primary again, the errors stop, and normal function resumes. Also, using a SecondaryPreferred ReadPreference shows me that a read query almost never fails, so this is only a problem for writes.

I'm going to keep trying this weekend, but if either of you are able to get a sample program up that shows the Primary never coming back, it be immensely helpful.

Comment by Stefano Ricciardi [ 11/Jan/13 ]

Craig,

driver version is 1.7.0.4714
mongo version is 2.2.0

Connection string: mongodb://454575-app1:27017,454576-app2:27017,406734-fdproc3:27017/?replicaSet=ump_replSet

the scenario ~150 client threads which sleep randomly for 1 to 3 seconds and then issues a request to a web server which uses Mongo to store session data.
After running for a few seconds some of the requests (not all of them) fail with the following stack trace:

[MongoConnectionException: Unable to connect to a member of the replica set matching the read preference Primary]
MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ThrowConnectionException(ReadPreference readPreference) in C:\build\mongo-csharp-driver\Driver\Internal\MultipleInstanceMongoServerProxy.cs:509
MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.Connect(TimeSpan timeout, ReadPreference readPreference) in C:\build\mongo-csharp-driver\Driver\Internal\MultipleInstanceMongoServerProxy.cs:224
MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ChooseServerInstance(ReadPreference readPreference) in C:\build\mongo-csharp-driver\Driver\Internal\MultipleInstanceMongoServerProxy.cs:168
MongoDB.Driver.MongoServer.AcquireConnection(MongoDatabase database, ReadPreference readPreference) in C:\build\mongo-csharp-driver\Driver\Core\MongoServer.cs:1057
MongoDB.Driver.MongoCollection.Update(IMongoQuery query, IMongoUpdate update, MongoUpdateOptions options) in C:\build\mongo-csharp-driver\Driver\Core\MongoCollection.cs:1549
MongoDB.Driver.MongoCollection.Update(IMongoQuery query, IMongoUpdate update, WriteConcern writeConcern) in C:\build\mongo-csharp-driver\Driver\Core\MongoCollection.cs:1575
MongoSessionStateStore.MongoSessionStateStore.GetSessionStoreItem(Boolean lockRecord, HttpContext context, String id, Boolean& locked, TimeSpan& lockAge, Object& lockId, SessionStateActions& actionFlags) +3537
MongoSessionStateStore.MongoSessionStateStore.GetItemExclusive(HttpContext context, String id, Boolean& locked, TimeSpan& lockAge, Object& lockId, SessionStateActions& actionFlags) +258
System.Web.SessionState.SessionStateModule.GetSessionStateItem() +115
System.Web.SessionState.SessionStateModule.BeginAcquireState(Object source, EventArgs e, AsyncCallback cb, Object extraData) +768
System.Web.AsyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() +96
System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously) +184

See attachment "primary.txt" above.

Comment by Craig Wilson [ 11/Jan/13 ]

Stefano, do you have your logs? Also, could you answer the questions in my previous comment? Also, could you post your connection string (changing anything sensitive).

Comment by Stefano Ricciardi [ 11/Jan/13 ]

Upvoted since I am experiencing similar issues under heavy load. Not 100% sure it's the same issue, but I want to keep this in my radar.

Comment by Craig Wilson [ 25/Dec/12 ]

The driver should have responded correctly. We test scenarios like this and our tests are passing. There, of course, might be some scenarios we failed to test and you happened to hit one of them. Changing your read preference should not affect this at all. All writes go to primary, regardless of your read preference. FindAndModify qualifies as a write even though there is a query component.

1) What driver version are you using?
2) What version of mongodb are you running?
3) Can you attach the server logs for all your replica set members from right before the initial exception was thrown to a little bit after a new primary was reelected?

Thanks for reporting...

Generated at Wed Feb 07 21:37:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.