[CSHARP-183] Driver does not handle nodes which are down gracefully Created: 24/Mar/11 Updated: 02/Apr/15 Resolved: 12/Sep/11 |
|
| Status: | Closed |
| Project: | C# Driver |
| Component/s: | None |
| Affects Version/s: | 0.11 |
| Fix Version/s: | 1.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | J W Lee | Assignee: | Robert Stam |
| Resolution: | Done | Votes: | 4 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
The v1.0.0.4098 version of the driver in git does not handle nodes which are down and throws a Mongo exception instead of trying other nodes which are up. Steps to reproduce: I'm not sure how this should be handled though, are there guidelines for drivers for handling downed nodes? It would seem bizzare if Mongo itself is resistant to node failures but the drivers aren't. |
| Comments |
| Comment by Robert Stam [ 12/Sep/11 ] |
|
This has been worked on and tested with v1.2. Note that when a replica set member goes offline (or there is a change of primary) the application might still receive a small number of exceptions, but everything should return to normal after that. |
| Comment by Aristarkh Zagorodnikov [ 31/Mar/11 ] |
|
Checked it again, |
| Comment by Aristarkh Zagorodnikov [ 31/Mar/11 ] |
|
Last time I checked, getting server down while reading from a cursor leaded to IOException, will recheck it again. |
| Comment by J W Lee [ 31/Mar/11 ] |
|
@Aristarkh Zagorodnikov There is a MongoConnectionException in the latest driver which I am using to figure out if I should wait and retry the operation or not, so |
| Comment by Aristarkh Zagorodnikov [ 30/Mar/11 ] |
|
While I agree with the idea that retry policy should be controlled by the client application, I think that some help from the driver would be nice. I think that the driver might lend a hand for #1 and #2, probably in a separate namespace or even separate assembly, providing request wrappers as extension methods or utlility functions accepting delegates (anonymous methods along with extension methods aren't in .NET 2.0 IIRC, and last time I checked the driver targets the 2.0, so this might be a problem), that have reasonable defaults for interval and timeout values (i.e. 15ms default interval, timeout matching global one) that would allow writing some code like. When I used BerkeleyDB, I created a set of wrappers to retry a set operations in optional context of transaction to assist in deadlock resolution (BerkeleyDB has explicit deadlock handing, so retrying operations are done entirely on client) – it helped a lot. I guess that having something similar would make writing robust .NET applications much easier, especially since mongodb clusters are so quick to recover from a failure as a whole. We did some testing in a Web environment – with proper retry mechanism, users might only notice a slight response increase of time for a few seconds, instead of facing at the "oompf, error 500" page for the same tame. |
| Comment by Aristarkh Zagorodnikov [ 30/Mar/11 ] |
|
|
| Comment by J W Lee [ 29/Mar/11 ] |
|
That makes sense, I'll modify the code to keep retrying then. On a related note it seems that the secondaryconnectionpool property will throw an exception if empty. I'm using this property to check if there are nodes that are down (compare secondaryconnectionpool servers with replicaset servers) and do a server.reconnect() to recover all the nodes in the connection pool, or is there a better way? In MongoServer.cs: public IList<MongoConnectionPool> SecondaryConnectionPools { } |
| Comment by Robert Stam [ 24/Mar/11 ] |
|
I will try to reproduce this, but one comment I can go ahead and make is that the driver does NOT guarantee that you won't get any exceptions. What it DOES guarantee is that if you keep retrying the operation it will eventually succeed. When the primary goes down there is a period of time while the election is taking place that there is no primary. During that period of time no operations will succeed. The first retry after a new primary has been elected will succeed. So if you are saying that even after repeated retries the operation continues to fail indefinitely, then it would be a bug. But if what you are saying is that you saw an exception that would not be a bug. That's what you should be seeing. Are you catching the exception and retrying? |