[CSHARP-654] Client does not recover automatically from certain failures Created: 25/Dec/12 Updated: 20/Mar/14 Resolved: 15/Mar/13 |
|
| Status: | Closed |
| Project: | C# Driver |
| Component/s: | None |
| Affects Version/s: | 1.6.1 |
| Fix Version/s: | 1.8 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kalin Gyokov | Assignee: | Craig Wilson |
| Resolution: | Done | Votes: | 1 |
| Labels: | driver | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
The client is a Windows Service using .NET 4.0 and running on Windows Server 2003 R2 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
We have a mongodb replica set with 1 master and 2 slaves. All slaves can be become masters. At one point the client, a Windows Service using .NET 4.0, stopped working. When I checked the logs, there was a large number of these exceptions: MongoDB.Driver.MongoConnectionException: Unable to connect to a member of the replica set matching the read preference Primary\r\n at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ThrowConnectionException(ReadPreference readPreference) in c:\\projects\\mongo-csharp-driver\\Driver\\Internal We were not able to see anything wrong with the replica set itself, so we simply restarted the client. The exceptions immediately went away and the client continued to work perfectly fine after that. I cannot determine what caused the client to begin throwing those exceptions. It might have been some temporary network failure. The real problem is that the driver was not able to recover from that failure by itself. I would expect the driver to be more robust and to be able to resume normal operation without the need for a restart. Is there a code workaround for this issue? Would it help if we set the Read preference mode to SecodaryPreferred or to Primary? |
| Comments |
| Comment by auto [ 15/Mar/13 ] |
|
Author: {u'date': u'2013-03-15T17:16:47Z', u'name': u'Craig Wilson', u'email': u'craiggwilson@gmail.com'}Message: |
| Comment by Stefano Ricciardi [ 15/Jan/13 ] |
|
Craig, removing the disconnect from the code resolved the issue for us both for "primary only" and "replica set" mode. Thank you for helping root causing our problem and keep up with the good work! |
| Comment by Kalin Gyokov [ 14/Jan/13 ] |
|
So far we have only seen this once. We will keep monitoring the service in case it happens again. Maybe we'll be able to reproduce it in our dev environment. |
| Comment by Craig Wilson [ 14/Jan/13 ] |
|
I agree that these are different issues and as such I will leave this opened. I have been unable to produce your issue Kalin. I've performed the steps you have just described a number of times and the primary always shows back up. Was this an isolated occurence or is this happening frequently? |
| Comment by Kalin Gyokov [ 14/Jan/13 ] |
|
It appears that there may be two separate issues. Stefano says that "under a somewhat increased load [the driver] cannot connect to the server for some of the concurrent requests". var ourConnStr = "mongodb://MONGO01,MONGO02,MONGO03/OurDatabase"; Here is a summary of what might have happened, based on the logs. You are probably already aware of this, but I just wanted to summarize it: |
| Comment by Craig Wilson [ 14/Jan/13 ] |
|
Yes. A replica set connection creates N connection pools where N is the number of members in your replica set. Hence, setting up and tearing these down is much more expensive. |
| Comment by Stefano Ricciardi [ 14/Jan/13 ] |
|
Ok, I'll schedule this change ASAP. Do you think this might explain the difference in performance between the "replica-set mode" and the "primary only mode"? |
| Comment by Craig Wilson [ 14/Jan/13 ] |
|
Ok, the config looks fine. I was concerned fsync was turned on or replicasToWrite was high. I'd suggest changing the code to get rid of the disconnect calls. Effectively, all this means is to get rid of the finally blocks. After that is done, please report back and let us know whether that corrected the problem. |
| Comment by Stefano Ricciardi [ 14/Jan/13 ] |
|
We pretty much used that code as is. We might have to rewrite part of this code then. The relevant sections from the web.config should be as follow: [...] |
| Comment by Craig Wilson [ 14/Jan/13 ] |
|
Have you modified this session store code at all? With the assumption that you have not... The problem I'm seeing is the call to conn.Disconnect(). You can search in the code here (https://github.com/AdaTheDev/MongoDB-ASP.NET-Session-State-Store/blob/master/MongoSessionStateStore/MongoSessionStateStore.cs#L286) for Disconnect and see 6 places it is used. You can see my blog post on what Disconnect is and why it is generally a bad idea to invoke here: http://craiggwilson.wordpress.com/2012/09/23/disconnecting-with-the-mongodb-driver/ Basically, everytime a new request comes in, a new connection is created and begins to be used. However, concurrently with that, another request could have killed off all connections. Hence, you are reaping any benefits of a connection pool that can scale to handle the load. Let me know if you have changed this code at all? Also, what does your web.config look like? |
| Comment by Stefano Ricciardi [ 14/Jan/13 ] |
|
Our session provider is based on the following code: I am attaching you a sample page that clients are loading (Ping.aspx). Clients run in the cloud (Amazon). They are based on an internal load testing tool (based on node.js) which unfortunately I cannot share Nothing fancy anyway, they just issue HTTP requests to the aspx page and sleep for a random interval between 1 and 3 seconds. Each thread issues 100K requests before stopping. For my tests I had run 175 concurrent threads. Hope this helps. |
| Comment by Craig Wilson [ 14/Jan/13 ] |
|
It may be the same issue. Can I ask you for your sample program, both the web page as well as the client?. I just spun up 300 threads and, once again, was unable to make it fail... |
| Comment by Stefano Ricciardi [ 14/Jan/13 ] |
|
Some data from my tests:
With connection string pointing to replica set: With connection string pointing to primary only: There seems to be something about connecting to the replica set that make errors soar. |
| Comment by Stefano Ricciardi [ 14/Jan/13 ] |
|
Craig, my problem is not that the driver does not pick up the primary again, but that under a somewhat increased load it cannot connect to the server for some of the concurrent requests. I am trying now with a connection string pointing directly to the primary server to see whether this makes any difference. As I mentioned in my first comment, this might not be the same issue as the original poster. |
| Comment by Craig Wilson [ 11/Jan/13 ] |
|
So, I've spent most of day trying to reproduce this. While I have found some interesting behavior I'm going to fix, the primary always comes back. The unusual behavior is the amount of time it takes to for this to happen, but as I said, the driver always picks up the primary again, the errors stop, and normal function resumes. Also, using a SecondaryPreferred ReadPreference shows me that a read query almost never fails, so this is only a problem for writes. I'm going to keep trying this weekend, but if either of you are able to get a sample program up that shows the Primary never coming back, it be immensely helpful. |
| Comment by Stefano Ricciardi [ 11/Jan/13 ] |
|
Craig, driver version is 1.7.0.4714 Connection string: mongodb://454575-app1:27017,454576-app2:27017,406734-fdproc3:27017/?replicaSet=ump_replSet the scenario ~150 client threads which sleep randomly for 1 to 3 seconds and then issues a request to a web server which uses Mongo to store session data. [MongoConnectionException: Unable to connect to a member of the replica set matching the read preference Primary] See attachment "primary.txt" above. |
| Comment by Craig Wilson [ 11/Jan/13 ] |
|
Stefano, do you have your logs? Also, could you answer the questions in my previous comment? Also, could you post your connection string (changing anything sensitive). |
| Comment by Stefano Ricciardi [ 11/Jan/13 ] |
|
Upvoted since I am experiencing similar issues under heavy load. Not 100% sure it's the same issue, but I want to keep this in my radar. |
| Comment by Craig Wilson [ 25/Dec/12 ] |
|
The driver should have responded correctly. We test scenarios like this and our tests are passing. There, of course, might be some scenarios we failed to test and you happened to hit one of them. Changing your read preference should not affect this at all. All writes go to primary, regardless of your read preference. FindAndModify qualifies as a write even though there is a query component. 1) What driver version are you using? Thanks for reporting... |