[SERVER-27551] QuorumChecker should retry requests that fail Created: 30/Dec/16 Updated: 02/Jul/20 Resolved: 24/Jan/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.4.3, 3.5.2 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Judah Schvimer |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Backport Requested: |
v3.4
|
||||||||||||||||
| Sprint: | Repl 2017-01-23, Repl 2017-02-13 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||
| Description |
|
The QuorumChecker sends out heartbeat requests to check for quorum on initiate and reconfig commands. If a request fails, especially due to something like an ExceededTimeout, we should retry the request. |
| Comments |
| Comment by Githook User [ 06/Feb/17 ] |
|
Author: {u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}Message: (cherry picked from commit f312d2b232df9b18bbd6a85169162dc61e5316f1) |
| Comment by Githook User [ 06/Feb/17 ] |
|
Author: {u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}Message: (cherry picked from commit 9710251a203ed703055a8435058183c2ddaa4222) |
| Comment by Githook User [ 24/Jan/17 ] |
|
Author: {u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}Message: |
| Comment by Githook User [ 24/Jan/17 ] |
|
Author: {u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}Message: |
| Comment by Judah Schvimer [ 18/Jan/17 ] |
|
After discussing with max.hirschhorn, we will retry the initiate call 3 times in both the python test fixtures and ReplSetTest. It seems at first glance that this issue is occurring mostly on OSX and Windows 2008 Debug machines which tend to be slower. The python fixtures also start up multiple servers at once which may create more load on those machines and slow down the heartbeats. This implies that it is likely a slow machine problem and that covering it up with a retry is not a bad idea. replSetInitiate is also not something users do very often, so any lost test coverage there is okay if it reduces CI noise. |
| Comment by Eric Milkie [ 06/Jan/17 ] |
|
Hmm - the Python fixtures might fail more often due to their relative timing occurrence within a test. That is, they may tend to run more immediately after a heavy I/O load from setting up the testing infrastructure, versus when a given ReplSetTest runs. Just a theory there. |
| Comment by Judah Schvimer [ 05/Jan/17 ] |
|
For an inexplicable reason, this is more of a problem for the python fixtures than ReplSetTest, but that could be the better place to do it quickly. |
| Comment by Eric Milkie [ 03/Jan/17 ] |
|
To address build failures, we could consider simply retrying the initiate command again if it fails due to ExceededTimeout, in ReplSetTest. |
| Comment by Eric Milkie [ 31/Dec/16 ] |
|
Why would we retry the request for any other errors besides ExceededTimeout? |