[JAVA-677] Improve driver robustness on node restart Created: 25/Oct/12 Updated: 21/Sep/16 Resolved: 21/Sep/16 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | Connection Management |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Vincent Sevel | Assignee: | Unassigned |
| Resolution: | Won't Fix | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Description |
|
a client using the java driver encounters exceptions when reading from a server that gets restarted gracefully, although another node is up and available for reads. during the course of running the test, I get exceptions on the client side from time to time. so far I identified these 2:
|
| Comments |
| Comment by Jeffrey Yemin [ 21/Sep/16 ] |
|
Much of the work for this has been completed. In particular:
What's not done is that the driver will not do any retries on either socket exceptions, or server selection failure or not master errors (when the driver writes to a secondary that the driver still thinks is a primary). Adding retry support to the driver will be complicated, and if we do it it will be an effort that we undertake in all drivers and should therefore have a ticket in the DRIVERS project. |
| Comment by Jeffrey Yemin [ 23/Aug/13 ] |
|
At this point, we've decided not to have retry logic in the 3.0 driver. If we do add it at a later point, we'll make the logic configurable instead of having 3 retries hardcoded (and only doing it for OP_QUERY, but not query-like commands like count, etc. |
| Comment by Chris Lewis [ 13/Mar/13 ] |
|
This scenario also occurs during a failover. Assuming for a moment that secondary reads are not appropriate for every application, what options are left for recovery? The driver hands off the responsibility to the caller through a generic MongoException. Philosophical differences on how errors should be handled aside, this leaves only one choice for primary-bound applications: test the string of the exception message for "not talking to master and retries used up". If one were to rely on that then a possible course of action would be to wait (ie block) for the situation to be resolved. As unfortunate as that would be, one can see why applications would need to do this: so survive a stepdown in the middle of a client request. So, if the responsibility of blocking is going to be passed up the stack, then it seems fair to expect a more specific, reliable exception type so that those needing such behavior need not resort to absurd and unreliable string tests. |
| Comment by Jeffrey Yemin [ 25/Oct/12 ] |
|
The way to resolve this issue is by ensuring that DBTCPConnector.call (as of 2.9.2, DBTCPConnector.innerCall), always uses a different server for each of the three retry attempts. In the case of primaryPreferred read preference, the driver is currently trying the same server (the one that was primary for the first try) for both of the retries, instead of switching to a secondary. For ReadPreference.primary(), I do not think it's a good idea to block while waiting for the new primary to be elected, as this election could take long enough that threads on the client will be backed up. And in some cases, a new primary may not be elected (in the absence of a quorum). But for any of the other preferences, we should do this. |