[SERVER-31461] Resmoke stepdown hook should deal with NotMaster errors Created: 09/Oct/17  Updated: 30/Nov/17  Resolved: 30/Nov/17

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Ian Boros Assignee: Max Hirschhorn
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-30979 Run the fuzzer with CSRS primary step... Closed
Duplicate
is duplicated by SERVER-31532 Change error message to start with "n... Closed
Related
Sprint: TIG 2017-10-23, TIG 2017-11-13
Participants:

 Description   

Right now it seems like the stepdown thread's main loop doesn't wait for a new primary to be elected before sending another replSetStepDown command. This means that it's possible to send a replSetStepDown command to a server that's not a primary, and thus to receive a NotMaster error. Here's an example of a patch build where this happens (search for "not primary"):

https://evergreen.mongodb.com/task_log_raw/mongodb_mongo_master_windows_64_2k8_ssl_jstestfuzz_concurrent_sharded_continuous_stepdown_patch_9e72a50f1ede62ab9f5899cf8f10dd93ca0c45d1_59d7e4d5e3c3312e74002b4d_17_10_06_20_18_01/0?type=T&text=true

I think the StepDownThread should deal with these NotMaster errors and ignore them, just as it does with "connection failure" errors.

Another solution would be for the thread to wait until a primary is elected before stepping a node down.



 Comments   
Comment by Max Hirschhorn [ 30/Nov/17 ]

We're planning to switch to using ReplicaSetFixture.get_primary() instead of relying on PyMongo to do server selection as part of the changes for SERVER-30979, so the work described by this ticket is no longer necessary.

Comment by Ian Boros [ 13/Oct/17 ]

Per max.hirschhorn's comment here I'll change the error handling code to deal with several other types of exceptions.

Comment by Max Hirschhorn [ 09/Oct/17 ]

The StepdownThread attempts to leverage pymongo's server discovery in order to send the "replSetStepDown" command to the current primary. It doesn't appear that the primary of the CSRS decided to step down on its own as we were about to send the "replSetStepDown" command, so I don't have an explanation for why PyMongo raised an OperationFailure exception.

[ShardedClusterFixture:job0:configsvr:node1] 2017-10-06T22:00:59.976+0000 I REPL     [replexec-28] transition to PRIMARY
[ShardedClusterFixture:job0:configsvr:node0] 2017-10-06T22:01:01.364+0000 I REPL     [replexec-39] transition to SECONDARY
[ShardedClusterFixture:job0:configsvr:node0] 2017-10-06T22:01:02.297+0000 I COMMAND  [conn712] Attempting to step down in response to replSetStepDown command
[ShardedClusterFixture:job0:configsvr:node0] 2017-10-06T22:01:02.297+0000 I COMMAND  [conn712] command admin.$cmd command: replSetStepDown { replSetStepDown: 10, force: true, $db: "admin" } numYields:0 reslen:299 locks:{} protocol:op_query 40ms
[ShardedClusterFixture:job0:configsvr:node1] 2017-10-06T22:01:03.321+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted

I think the lack of log messages from the StepdownThread is going to hurt our ability to diagnose this issue that we may want to address that first.

No handlers could be found for logger "ContinuousStepdown:job0"

Generated at Thu Feb 08 04:27:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.