[JAVA-677] Improve driver robustness on node restart Created: 25/Oct/12  Updated: 21/Sep/16  Resolved: 21/Sep/16

Status: Closed
Project: Java Driver
Component/s: Connection Management
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Vincent Sevel Assignee: Unassigned
Resolution: Won't Fix Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File ReproducerPrimaryPreferredSystemTest.java     Text File ReproducerPrimaryPreferredSystemTest.log    
Issue Links:
Related

 Description   

a client using the java driver encounters exceptions when reading from a server that gets restarted gracefully, although another node is up and available for reads.
I use a topology of 2 nodes, with one being set as a fixed primary for my tests. on the client side I use read mode primaryPreferred. during the test, I constantly query a collection while repeatdly restarting the primary using the windows service. the expectation is that a read never fails because: 1) the primary is gracefully restarted and 2) the secondary is always up and running.

during the course of running the test, I get exceptions on the client side from time to time. so far I identified these 2:

java.lang.NullPointerException
    at com.mongodb.DBTCPConnector$MyPort.error(DBTCPConnector.java:470)
    at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:281)
    at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:290)
    at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:275)
    at com.mongodb.DBCollection.findOne(DBCollection.java:727)
    at com.mongodb.DBCollection.findOne(DBCollection.java:669)
    at com.lodh.arte.test.robustmongo.ReproducerPrimaryPreferredSystemTest.readSecondary(ReproducerPrimaryPreferredSystemTest.java:87)

com.mongodb.MongoException: not talking to master and retries used up
    at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:304)
    at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:306)
    at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:306)
    at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:290)
    at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:275)
    at com.mongodb.DBCollection.findOne(DBCollection.java:727)
    at com.mongodb.DBCollection.findOne(DBCollection.java:669)
    at com.lodh.arte.test.robustmongo.ReproducerPrimaryPreferredSystemTest.readSecondary(ReproducerPrimaryPreferredSystemTest.java:87)



 Comments   
Comment by Jeffrey Yemin [ 21/Sep/16 ]

Much of the work for this has been completed. In particular:

  • No more NullPointerException
  • No more "not talking to master" until a configurable server selection timeout has been hit

What's not done is that the driver will not do any retries on either socket exceptions, or server selection failure or not master errors (when the driver writes to a secondary that the driver still thinks is a primary).

Adding retry support to the driver will be complicated, and if we do it it will be an effort that we undertake in all drivers and should therefore have a ticket in the DRIVERS project.

Comment by Jeffrey Yemin [ 23/Aug/13 ]

At this point, we've decided not to have retry logic in the 3.0 driver. If we do add it at a later point, we'll make the logic configurable instead of having 3 retries hardcoded (and only doing it for OP_QUERY, but not query-like commands like count, etc.

Comment by Chris Lewis [ 13/Mar/13 ]

This scenario also occurs during a failover. Assuming for a moment that secondary reads are not appropriate for every application, what options are left for recovery? The driver hands off the responsibility to the caller through a generic MongoException. Philosophical differences on how errors should be handled aside, this leaves only one choice for primary-bound applications: test the string of the exception message for "not talking to master and retries used up". If one were to rely on that then a possible course of action would be to wait (ie block) for the situation to be resolved. As unfortunate as that would be, one can see why applications would need to do this: so survive a stepdown in the middle of a client request.

So, if the responsibility of blocking is going to be passed up the stack, then it seems fair to expect a more specific, reliable exception type so that those needing such behavior need not resort to absurd and unreliable string tests.

Comment by Jeffrey Yemin [ 25/Oct/12 ]

The way to resolve this issue is by ensuring that DBTCPConnector.call (as of 2.9.2, DBTCPConnector.innerCall), always uses a different server for each of the three retry attempts. In the case of primaryPreferred read preference, the driver is currently trying the same server (the one that was primary for the first try) for both of the retries, instead of switching to a secondary.

For ReadPreference.primary(), I do not think it's a good idea to block while waiting for the new primary to be elected, as this election could take long enough that threads on the client will be backed up. And in some cases, a new primary may not be elected (in the absence of a quorum). But for any of the other preferences, we should do this.

Generated at Thu Feb 08 08:52:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.