[SERVER-40252] Signaling 1-node replica set to shut down now takes an extra 10 seconds Created: 21/Mar/19  Updated: 08/Jan/24  Resolved: 15/Apr/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Max Hirschhorn Assignee: Backlog - Service Architecture
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-40335 Don't wait for election handoff in Re... Closed
Problem/Incident
is caused by SERVER-38994 Step down on SIGTERM Closed
Related
is related to SERVER-40335 Don't wait for election handoff in Re... Closed
Assigned Teams:
Service Arch
Operating System: ALL
Sprint: Service Arch 2019-03-25
Participants:

 Description   

It seems like attempting to run ReplicationCoordinator::stepDown() is unnecessary when the replica set configuration is known to only contain one node electable as primary. The extra time it takes to shut down the replica set is mildly annoying for certain aspects of my local development workflow.

2019-03-21T03:09:22.009-0400 I  CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2019-03-21T03:09:22.009-0400 I  REPL     [RstlKillOpthread] Starting to kill user operations
2019-03-21T03:09:22.009-0400 I  REPL     [RstlKillOpthread] Stopped killing user operations
2019-03-21T03:09:32.020-0400 I  REPL     [RstlKillOpthread] Starting to kill user operations
2019-03-21T03:09:32.020-0400 I  REPL     [RstlKillOpthread] Stopped killing user operations
2019-03-21T03:09:32.020-0400 I  STORAGE  [signalProcessingThread] Failed to stepDown in non-command initiated shutdown path ExceededTimeLimit: No electable secondaries caught up as of 2019-03-21T03:09:32.020-0400. Please use the replSetStepDown command with the argument {force: true} to force node to step down.
2019-03-21T03:09:32.020-0400 I  NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...



 Comments   
Comment by Mira Carey [ 15/Apr/19 ]

Closing this out after the change made in SERVER-40335.

I think that satisfies the intent of this ticket

Comment by Vesselina Ratcheva (Inactive) [ 25/Mar/19 ]

I think the fix Jason pointed out in the topology coordinator is the way to go implementation-wise (it can also be made in isSafeToStepDown), provided we come to a consensus about user-facing behavior. In the same spirit as the proposition SERVER-40335, I would also propose making a parameter to gate that new behavior directly in topo instead.

Comment by Mira Carey [ 25/Mar/19 ]

After some reflection (and conversation with max.hirschhorn), I'm going to features we're not sure of this, for now.

If we don't want to tackle allowing shutdown in more configurations, we should probably just make the timeout configurable (and make it 0 for most tests). I've opened SERVER-40335 to explore that avenue.

Comment by Andy Schwerin [ 22/Mar/19 ]

Absolutely. My point is we shouldn't fix this regression by trading it for another user-facing behavior change without considering it.

Comment by Danny Hatcher (Inactive) [ 22/Mar/19 ]

If SERVER-38994 is what caused this, there is an argument to be made that as-is it's a client-facing regression (albeit a small one).

Comment by Andy Schwerin [ 22/Mar/19 ]

I am reluctant to change the user-facing behavior of the stepDown and shutDown commands in this instance to make our tests run faster. I made a conscious decision to require the user to force shutdown whenever there is no other electable node. At the very least, we should let product weigh in. We might also have to update the documentation.

Comment by Max Hirschhorn [ 21/Mar/19 ]

FWIW, I filed this ticket because of my use of 1-node replica sets locally, but I think the change should apply to any replica set where electableCount == 1. Stepping down a single voting replica set may still be useful for testing purposes, i.e. to have the primary actually transition to state SECONDARY, but to just skip the election handoff part.

Comment by Mira Carey [ 21/Mar/19 ]

I think the fix here is to make repl coordinator stepDown, or topology coordinator attemptStepDown, return quickly if the configured set has 1 node.

That would fix the slowness on sigterm, and make the shutdown command do something sane for 1 node repl sets.

At a glance, I'd probably change https://github.com/mongodb/mongo/blob/2a4d8ed5bb64af081b887f17dabf298831866b1d/src/mongo/db/repl/topology_coordinator.cpp#L2237

bool TopologyCoordinator::_canCompleteStepDownAttempt(Date_t now, Date_t waitUntil, bool force) {
    const bool forceNow = force && (now >= waitUntil);
    if (forceNow) {
        return true;
    }
 
    return isSafeToStepDown();
}

so that there is an additional check for single node sets

Comment by Judah Schvimer [ 21/Mar/19 ]

This feels pretty costly in terms of evergreen time spent.

CC mira.carey@mongodb.com for any thoughts.

Comment by Max Hirschhorn [ 21/Mar/19 ]

I would vote for changing the replSetStepDown command because you also cannot use the shutdown command without force=true to shut down a 1-node replica set.

Generated at Thu Feb 08 04:54:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.