[JAVA-2285] Unable to reconnect to a replica set after a failover due to stale electionId Created: 22/Aug/16  Updated: 13/Sep/16  Resolved: 13/Sep/16

Status: Closed
Project: Java Driver
Component/s: None
Affects Version/s: 3.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dude Dou Assignee: Jeffrey Yemin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

mongodb 3.0.2 replica set, three nodes , java driver: 3.2.0, centos 7.0


Attachments: Text File idps.log    

 Description   

the java driver will try to connect to the replicate set infinitely when the replica set restart after the driver client starting.
the driver's log:

2016-08-22 07:55:25,238 [cluster-ClusterId{value='57b680b7c9e77c00067b4846', description='null'}-10.1.245.5:37017] INFO  org.mongodb.driver.cluster(71) - Monitor thread successfully connected to server with description ServerDescription{address=10.1.245.5:37017, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 3]}, minWireVersion=0, maxWireVersion=3, electionId=57baa776aa4cc97ba6377d74, maxDocumentSize=16777216, roundTripTimeNanos=370494, setName='rs0', canonicalAddress=10.1.245.5:37017, hosts=[10.1.245.6:37017, 10.1.245.5:37017], passives=[], arbiters=[10.1.245.7:37017], primary='10.1.245.5:37017', tagSet=TagSet{[]}}
2016-08-22 07:55:25,238 [cluster-ClusterId{value='57b680b7c9e77c00067b4846', description='null'}-10.1.245.5:37017] INFO  org.mongodb.driver.cluster(71) - Invalidating potential primary 10.1.245.5:37017 whose election id 57baa776aa4cc97ba6377d74 is less than the max election id seen so far 57baa83768f8cf45a0d87054



 Comments   
Comment by Jeffrey Yemin [ 13/Sep/16 ]

As we haven't heard back from you in quite some time, I'm closing this issue but will re-open if you come back with further information.

Comment by Jeffrey Yemin [ 23/Aug/16 ]

Hi Dude,

Thanks for the reproduction steps. So far I have not been able to reproduce the same behavior as you have, but I can see from the logs you provided what looks like a real problem.

In order to help get to the root cause, can you re-execute the reproduction scenario and attach new client logs as well as server logs for all three replica set members for the time period of the test?

Regards,
Jeff

Comment by Dude Dou [ 23/Aug/16 ]

We have reproduced this scenario by the following steps:
1. stop the replica set nodes( one primary node, one secondary node and one arbiter node) with any order, the client is ok.
2, start the arbiter node first, then the secondary node and last the primary node
3. repeat step 1 and 2 again, the client will loop forever.

the logs is in attach

Comment by Jeffrey Yemin [ 22/Aug/16 ]

Hi Dude,

Thanks for the report. From the client logs you provided, this looks to be related to a feature in the driver for detecting stale primaries. In your case, the driver sees the 10.1.245.5:37017 as a stale primary because its reported electionId is less than an electionId detected earlier.

In order to determine if there is an actual bug in either the driver or the server, please:

  • Provide the full client logs for the org.mongodb.driver.cluster logging component
  • Elaborate on what you mean by a 'replica set restart'. What exactly was restarted and in what order? Is there evidence of a primary failover occurring during the restarts?

Thanks,
Jeff

Generated at Thu Feb 08 08:56:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.