[CSHARP-183] Driver does not handle nodes which are down gracefully Created: 24/Mar/11  Updated: 02/Apr/15  Resolved: 12/Sep/11

Status: Closed
Project: C# Driver
Component/s: None
Affects Version/s: 0.11
Fix Version/s: 1.2

Type: Bug Priority: Major - P3
Reporter: J W Lee Assignee: Robert Stam
Resolution: Done Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

The v1.0.0.4098 version of the driver in git does not handle nodes which are down and throws a Mongo exception instead of trying other nodes which are up.
This also means it does not check the state of a node in while populating the seedlist. Mongo does not try other nodes when a node is down after the Mongo connection is create.

Steps to reproduce:
1. Setup a replicaset of >1 nodes.
2. Turn off a single node
3. Create Mongo connection with a number of nodes in the connection string
4. Try and insert or query a document in any collection
5. MongoConnectionException is thrown

I'm not sure how this should be handled though, are there guidelines for drivers for handling downed nodes? It would seem bizzare if Mongo itself is resistant to node failures but the drivers aren't.



 Comments   
Comment by Robert Stam [ 12/Sep/11 ]

This has been worked on and tested with v1.2. Note that when a replica set member goes offline (or there is a change of primary) the application might still receive a small number of exceptions, but everything should return to normal after that.

Comment by Aristarkh Zagorodnikov [ 31/Mar/11 ]

Checked it again, CSHARP-187, along with IOExceptions (reported them to CSHARP-153).

Comment by Aristarkh Zagorodnikov [ 31/Mar/11 ]

Last time I checked, getting server down while reading from a cursor leaded to IOException, will recheck it again.

Comment by J W Lee [ 31/Mar/11 ]

@Aristarkh Zagorodnikov

There is a MongoConnectionException in the latest driver which I am using to figure out if I should wait and retry the operation or not, so CSHARP-153 is should be considered fixed I believe.

Comment by Aristarkh Zagorodnikov [ 30/Mar/11 ]

While I agree with the idea that retry policy should be controlled by the client application, I think that some help from the driver would be nice.
I believe that in terms of retryability, operations could be broken in three broad classes:
1. reads – apart from creating extra load on the database, little might go wrong with extra reading unless the volume is too large
2. "absolute" writes (set X to Y where Z, delete where Z) – again only extra load might cause problems
3. "relative" writes (set X = X + Y where Z) – here retrying operation might damage the database contents (logically)
While operations of kinds #1 and #2 are easily auto-retried (an interval and limit should be imposed), the kind #3 is tricky and could only either rewritten as #2 (using some kind of versioning logic built into the query), or autoretried at the client level (the client rereads current state and retries the global operation that originated the request to the database).

I think that the driver might lend a hand for #1 and #2, probably in a separate namespace or even separate assembly, providing request wrappers as extension methods or utlility functions accepting delegates (anonymous methods along with extension methods aren't in .NET 2.0 IIRC, and last time I checked the driver targets the 2.0, so this might be a problem), that have reasonable defaults for interval and timeout values (i.e. 15ms default interval, timeout matching global one) that would allow writing some code like.

When I used BerkeleyDB, I created a set of wrappers to retry a set operations in optional context of transaction to assist in deadlock resolution (BerkeleyDB has explicit deadlock handing, so retrying operations are done entirely on client) – it helped a lot. I guess that having something similar would make writing robust .NET applications much easier, especially since mongodb clusters are so quick to recover from a failure as a whole. We did some testing in a Web environment – with proper retry mechanism, users might only notice a slight response increase of time for a few seconds, instead of facing at the "oompf, error 500" page for the same tame.

Comment by Aristarkh Zagorodnikov [ 30/Mar/11 ]

CSHARP-153 is related to this, since retrying on every exception might be a bad policy, and it appears that currently, cursor operations can fail with arbitrary exceptions including IOException, SocketException, etc.

Comment by J W Lee [ 29/Mar/11 ]

That makes sense, I'll modify the code to keep retrying then. On a related note it seems that the secondaryconnectionpool property will throw an exception if empty. I'm using this property to check if there are nodes that are down (compare secondaryconnectionpool servers with replicaset servers) and do a server.reconnect() to recover all the nodes in the connection pool, or is there a better way?

In MongoServer.cs:

public IList<MongoConnectionPool> SecondaryConnectionPools {
get

{ return secondaryConnectionPools.AsReadOnly(); }

}

Comment by Robert Stam [ 24/Mar/11 ]

I will try to reproduce this, but one comment I can go ahead and make is that the driver does NOT guarantee that you won't get any exceptions. What it DOES guarantee is that if you keep retrying the operation it will eventually succeed. When the primary goes down there is a period of time while the election is taking place that there is no primary. During that period of time no operations will succeed. The first retry after a new primary has been elected will succeed.

So if you are saying that even after repeated retries the operation continues to fail indefinitely, then it would be a bug. But if what you are saying is that you saw an exception that would not be a bug. That's what you should be seeing.

Are you catching the exception and retrying?

Generated at Wed Feb 07 21:36:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.