[SERVER-38994] Step down on SIGTERM Created: 14/Jan/19  Updated: 29/Oct/23  Resolved: 15/Mar/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.6.15, 4.0.8, 4.1.10

Type: Bug Priority: Major - P3
Reporter: Alyson Cabral (Inactive) Assignee: Mira Carey
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Documented
is documented by DOCS-12561 Docs for SERVER-38994: Step down on S... Closed
Duplicate
is duplicated by SERVER-23293 Allow more graceful shutdown from sig... Closed
Problem/Incident
causes SERVER-40252 Signaling 1-node replica set to shut ... Closed
Related
related to SERVER-39424 Test that DDL operations can't succee... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.0, v3.6
Sprint: Service Arch 2019-03-11, Service Arch 2019-03-25
Participants:
Linked BF Score: 50

 Description   

Kubernetes uses SIGTERM to spin down containers, and our sysv init and systemd unit files use SIGTERM to shutdown the mongodb service. It would greatly reduce election time and prevent potential data loss due to replicaset rollback if we were able to use the election handoff path of stepdown instead of waiting for the full electionTimeoutMillis.



 Comments   
Comment by Githook User [ 27/Sep/19 ]

Author:

{'name': 'Rahul Sundararaman', 'email': 'rahul.sundararaman@mongodb.com'}

Message: SERVER-38994 step down on SIGTERM

Check to see if we've entered shutdown from a shutdown command. If not,
and if the replication machinery is up, attempt a shutdown in the style
of a default shutdown command.

(cherry picked from commit 9f5b13ee93e7eaeafa97ebd1d2d24c66b93cc974)
Branch: v3.6
https://github.com/mongodb/mongo/commit/34217d3b595b172180603e48e17421a330e04a81

Comment by David Bartley [ 13/Aug/19 ]

Would it be possible to backport this to 3.6 as well?

Comment by Githook User [ 25/Mar/19 ]

Author:

{'email': 'jcarey@argv.me', 'name': 'Jason Carey', 'username': 'hanumantmk'}

Message: SERVER-38994 step down on SIGTERM

Check to see if we've entered shutdown from a shutdown command. If not,
and if the replication machinery is up, attempt a shutdown in the style
of a default shutdown command.

(cherry picked from commit 9f5b13ee93e7eaeafa97ebd1d2d24c66b93cc974)
Branch: v4.0
https://github.com/mongodb/mongo/commit/6b5def2ef0f4c798c67a04f390e81aa9f3bb9415

Comment by Githook User [ 15/Mar/19 ]

Author:

{'name': 'Jason Carey', 'username': 'hanumantmk', 'email': 'jcarey@argv.me'}

Message: SERVER-38994 step down on SIGTERM

Check to see if we've entered shutdown from a shutdown command. If not,
and if the replication machinery is up, attempt a shutdown in the style
of a default shutdown command.
Branch: master
https://github.com/mongodb/mongo/commit/9f5b13ee93e7eaeafa97ebd1d2d24c66b93cc974

Comment by Eric Milkie [ 15/Jan/19 ]

Unless I'm misunderstanding something, in your first sentence you say "no data is lost", yet in your second sentence you note that this is in fact not true today with the an "orderly shutdown" via SIGTERM as if a majority of the secondaries have not yet caught up, then data is lost ("rolled back", which is a misnomer IMO – committed data is lost). That's awful behavior, IMO.

There is some confusion here; I never said "no data is lost", I said "no committed data is lost", and it's an important point. Committed data is never lost; we define "committed" as data that cannot be rolled back.

Indeed any situation that results in data being rolled back is unfortunate, but such situations only affect users doing writes with write concern less than "w:majority". Such applications are already in danger of having writes roll back at any moment if there is an unplanned election in the replica set.

I do believe the shutdown command does a stepdown; we recently fixed some of the default timeout parameters that were used for this code path.

I'm in favor of proceeding with this ticket, I just want to understand what the proposed solution will be, in light of what behaviors we currently have today.

Comment by Mira Carey [ 15/Jan/19 ]

schwerin, I think that's not correct. See: https://github.com/mongodb/mongo/blob/b489828d0c176e90e47724f6771610227b29f117/src/mongo/db/commands/shutdown_d.cpp#L65-L68

        try {
            repl::ReplicationCoordinator::get(opCtx)->stepDown(
                opCtx, force, Seconds(timeoutSecs), Seconds(120));
        } catch (const DBException& e) {

Regarding having a longer period where a cluster fails to accept writes, it might be a reasonable compromise to set a timeout + use the force flag to step down. That way you get something that doesn't take materially longer than today, but manages a clean handoff when the cluster has little lag.

Comment by Andy Schwerin [ 15/Jan/19 ]

Note to potential implementers: the shutdown command path does not naturally do a stepdown, today.

Comment by Matt Lord (Inactive) [ 15/Jan/19 ]

AFAIUI, if you don't do an election handoff, then it's handled as a failure by the replicaset – an election is called for after a majority of the remaining members have no communication with the old primary in the default 10 second window (electionTimeMillis). How can you have a "normal shutdown" that's at the same time handled as a failure by the overall system? 

Unless I'm misunderstanding something, in your first sentence you say "no data is lost", yet in your second sentence you note that this is in fact not necessarily true today with the an "orderly shutdown" via SIGTERM as if a majority of the secondaries have not yet caught up, then locally-but-not-majority-committed data is lost ("rolled back", which is a bit misnomer IMO). Side note: This is the default behavior today as w:majority is not the default, and it's bad default behavior, IMO. I would prefer we always choose data safety over performance by default and let production engineers and DBAs explicitly choose when to prefer performance over safety. 

Imagine a not-too-uncommon case where a user believes that machine failures are very rare events for them and they don't wish to take the constant write latency penalty of w:majority and are OK with the trade-off of potentially losing the last N writes when the rare failure occurs. I guess that they would be very (unpleasantly) surprised to learn that in practice many "planned maintenance" operations are in effect treated as machine failures too. 

Comment by Eric Milkie [ 15/Jan/19 ]

I would argue that the normal shutdown does indeed do an election handoff, it's just not as fast as it could be. The replication subsystem does not handle the shutdown as a failure; everything operates as designed: No committed data is lost and no data is corrupted.

Arguably, some users may prefer the current behavior, since introducing a stepdown as part of SIGTERM handling can potentially expand the window of time where a cluster will not accept writes, since the primary will now wait for a majority of secondaries to catch up during stepdown in an attempt to protect as many writes as possible from being rolled back.

Comment by Matt Lord (Inactive) [ 15/Jan/19 ]

milkie, I would argue that if the normal shutdown – and SIGTERM is a normal/standard shutdown method – doesn't do an election handoff, and the replication subsystem thus treats/handles the shutdown as a failure, then it's not clean. It's clean as far as the single node storage subsystem is concerned, we know that. But as a distributed database, I would argue that the collective replication and storage system is just as important. 

What am I missing? 

Comment by Eric Milkie [ 15/Jan/19 ]

I'm not sure why this is a bug. Upon the receipt of any of SIGHUP, SIGINT, SIGTERM, or SIGXCPU, the server shuts down cleanly. (In contrast, the server shuts down uncleanly on receipt of SIGKILL.)
Making election handoff more efficient in the face of terminal signals is a worthy goal, but it is an Improvement.
Also, we should consider Windows here, since it already has a different exit path for this type of thing, and it would be weird if Windows didn't do election handoff yet Linux did.

Comment by Matt Lord (Inactive) [ 14/Jan/19 ]

I'd prefer we treat this as a bug (changing Type for now) and backport it to at least 4.0, as a `kill [-15] <PID>` is generally considered a standard way to terminate a process properly – 1) that's generally what is done in SysV init scripts, and in fact that's what ours does 2) that's what systemd does without an explicit ExecStop option, which we don't specify – so we really should effectively take the same steps we do for the shutdown command. Having different shutdown paths is a breeding ground for many subtle bugs and unexpected behaviors – with the noted issue for our own Kubernetes operator being a critical example. 

Generated at Thu Feb 08 04:50:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.