[SERVER-27551] QuorumChecker should retry requests that fail Created: 30/Dec/16  Updated: 02/Jul/20  Resolved: 24/Jan/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.4.3, 3.5.2

Type: Improvement Priority: Major - P3
Reporter: Judah Schvimer Assignee: Judah Schvimer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-49305 Remove reconfig retries in our tests Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.4
Sprint: Repl 2017-01-23, Repl 2017-02-13
Participants:
Linked BF Score: 0

 Description   

The QuorumChecker sends out heartbeat requests to check for quorum on initiate and reconfig commands. If a request fails, especially due to something like an ExceededTimeout, we should retry the request.



 Comments   
Comment by Githook User [ 06/Feb/17 ]

Author:

{u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}

Message: SERVER-27551 added retries to replSetInitiate in ReplSetTest

(cherry picked from commit f312d2b232df9b18bbd6a85169162dc61e5316f1)
Branch: v3.4
https://github.com/mongodb/mongo/commit/0e7a8e4d38e52e451c35c55b2414591a7b25ad19

Comment by Githook User [ 06/Feb/17 ]

Author:

{u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}

Message: SERVER-27551 added retries to replSetInitiate call in python test fixture

(cherry picked from commit 9710251a203ed703055a8435058183c2ddaa4222)
Branch: v3.4
https://github.com/mongodb/mongo/commit/352ee027ec0a823f18209b54f14f9e856c577268

Comment by Githook User [ 24/Jan/17 ]

Author:

{u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}

Message: SERVER-27551 added retries to replSetInitiate in ReplSetTest
Branch: master
https://github.com/mongodb/mongo/commit/f312d2b232df9b18bbd6a85169162dc61e5316f1

Comment by Githook User [ 24/Jan/17 ]

Author:

{u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}

Message: SERVER-27551 added retries to replSetInitiate call in python test fixture
Branch: master
https://github.com/mongodb/mongo/commit/9710251a203ed703055a8435058183c2ddaa4222

Comment by Judah Schvimer [ 18/Jan/17 ]

After discussing with max.hirschhorn, we will retry the initiate call 3 times in both the python test fixtures and ReplSetTest.

It seems at first glance that this issue is occurring mostly on OSX and Windows 2008 Debug machines which tend to be slower. The python fixtures also start up multiple servers at once which may create more load on those machines and slow down the heartbeats. This implies that it is likely a slow machine problem and that covering it up with a retry is not a bad idea. replSetInitiate is also not something users do very often, so any lost test coverage there is okay if it reduces CI noise.

Comment by Eric Milkie [ 06/Jan/17 ]

Hmm - the Python fixtures might fail more often due to their relative timing occurrence within a test. That is, they may tend to run more immediately after a heavy I/O load from setting up the testing infrastructure, versus when a given ReplSetTest runs. Just a theory there.

Comment by Judah Schvimer [ 05/Jan/17 ]

For an inexplicable reason, this is more of a problem for the python fixtures than ReplSetTest, but that could be the better place to do it quickly.

Comment by Eric Milkie [ 03/Jan/17 ]

To address build failures, we could consider simply retrying the initiate command again if it fails due to ExceededTimeout, in ReplSetTest.

Comment by Eric Milkie [ 31/Dec/16 ]

Why would we retry the request for any other errors besides ExceededTimeout?
Also, I don't see what benefit retrying will have. I would expect that if the first request timed out, the second attempt would also likely time out.
On initiate, if a quorum cannot respond without timing out, that doesn't bode well for keeping a primary up with a majority of voting nodes either.

Generated at Thu Feb 08 04:15:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.