[SERVER-31461] Resmoke stepdown hook should deal with NotMaster errors Created: 09/Oct/17 Updated: 30/Nov/17 Resolved: 30/Nov/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Ian Boros | Assignee: | Max Hirschhorn |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Sprint: | TIG 2017-10-23, TIG 2017-11-13 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Right now it seems like the stepdown thread's main loop doesn't wait for a new primary to be elected before sending another replSetStepDown command. This means that it's possible to send a replSetStepDown command to a server that's not a primary, and thus to receive a NotMaster error. Here's an example of a patch build where this happens (search for "not primary"): I think the StepDownThread should deal with these NotMaster errors and ignore them, just as it does with "connection failure" errors. Another solution would be for the thread to wait until a primary is elected before stepping a node down. |
| Comments |
| Comment by Max Hirschhorn [ 30/Nov/17 ] | ||||||
|
We're planning to switch to using ReplicaSetFixture.get_primary() instead of relying on PyMongo to do server selection as part of the changes for | ||||||
| Comment by Ian Boros [ 13/Oct/17 ] | ||||||
|
Per max.hirschhorn's comment here I'll change the error handling code to deal with several other types of exceptions. | ||||||
| Comment by Max Hirschhorn [ 09/Oct/17 ] | ||||||
|
The StepdownThread attempts to leverage pymongo's server discovery in order to send the "replSetStepDown" command to the current primary. It doesn't appear that the primary of the CSRS decided to step down on its own as we were about to send the "replSetStepDown" command, so I don't have an explanation for why PyMongo raised an OperationFailure exception.
I think the lack of log messages from the StepdownThread is going to hurt our ability to diagnose this issue that we may want to address that first.
|