[SERVER-38994] Step down on SIGTERM Created: 14/Jan/19 Updated: 29/Oct/23 Resolved: 15/Mar/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.15, 4.0.8, 4.1.10 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Alyson Cabral (Inactive) | Assignee: | Mira Carey |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||||||||||||||||||||||||||||||
| Sprint: | Service Arch 2019-03-11, Service Arch 2019-03-25 | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 50 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
Kubernetes uses SIGTERM to spin down containers, and our sysv init and systemd unit files use SIGTERM to shutdown the mongodb service. It would greatly reduce election time and prevent potential data loss due to replicaset rollback if we were able to use the election handoff path of stepdown instead of waiting for the full electionTimeoutMillis. |
| Comments |
| Comment by Githook User [ 27/Sep/19 ] | ||||
|
Author: {'name': 'Rahul Sundararaman', 'email': 'rahul.sundararaman@mongodb.com'}Message: Check to see if we've entered shutdown from a shutdown command. If not, (cherry picked from commit 9f5b13ee93e7eaeafa97ebd1d2d24c66b93cc974) | ||||
| Comment by David Bartley [ 13/Aug/19 ] | ||||
|
Would it be possible to backport this to 3.6 as well? | ||||
| Comment by Githook User [ 25/Mar/19 ] | ||||
|
Author: {'email': 'jcarey@argv.me', 'name': 'Jason Carey', 'username': 'hanumantmk'}Message: Check to see if we've entered shutdown from a shutdown command. If not, (cherry picked from commit 9f5b13ee93e7eaeafa97ebd1d2d24c66b93cc974) | ||||
| Comment by Githook User [ 15/Mar/19 ] | ||||
|
Author: {'name': 'Jason Carey', 'username': 'hanumantmk', 'email': 'jcarey@argv.me'}Message: Check to see if we've entered shutdown from a shutdown command. If not, | ||||
| Comment by Eric Milkie [ 15/Jan/19 ] | ||||
There is some confusion here; I never said "no data is lost", I said "no committed data is lost", and it's an important point. Committed data is never lost; we define "committed" as data that cannot be rolled back. Indeed any situation that results in data being rolled back is unfortunate, but such situations only affect users doing writes with write concern less than "w:majority". Such applications are already in danger of having writes roll back at any moment if there is an unplanned election in the replica set. I do believe the shutdown command does a stepdown; we recently fixed some of the default timeout parameters that were used for this code path. I'm in favor of proceeding with this ticket, I just want to understand what the proposed solution will be, in light of what behaviors we currently have today. | ||||
| Comment by Mira Carey [ 15/Jan/19 ] | ||||
|
schwerin, I think that's not correct. See: https://github.com/mongodb/mongo/blob/b489828d0c176e90e47724f6771610227b29f117/src/mongo/db/commands/shutdown_d.cpp#L65-L68
Regarding having a longer period where a cluster fails to accept writes, it might be a reasonable compromise to set a timeout + use the force flag to step down. That way you get something that doesn't take materially longer than today, but manages a clean handoff when the cluster has little lag. | ||||
| Comment by Andy Schwerin [ 15/Jan/19 ] | ||||
|
Note to potential implementers: the shutdown command path does not naturally do a stepdown, today. | ||||
| Comment by Matt Lord (Inactive) [ 15/Jan/19 ] | ||||
|
AFAIUI, if you don't do an election handoff, then it's handled as a failure by the replicaset – an election is called for after a majority of the remaining members have no communication with the old primary in the default 10 second window (electionTimeMillis). How can you have a "normal shutdown" that's at the same time handled as a failure by the overall system? Unless I'm misunderstanding something, in your first sentence you say "no data is lost", yet in your second sentence you note that this is in fact not necessarily true today with the an "orderly shutdown" via SIGTERM as if a majority of the secondaries have not yet caught up, then locally-but-not-majority-committed data is lost ("rolled back", which is a bit misnomer IMO). Side note: This is the default behavior today as w:majority is not the default, and it's bad default behavior, IMO. I would prefer we always choose data safety over performance by default and let production engineers and DBAs explicitly choose when to prefer performance over safety. Imagine a not-too-uncommon case where a user believes that machine failures are very rare events for them and they don't wish to take the constant write latency penalty of w:majority and are OK with the trade-off of potentially losing the last N writes when the rare failure occurs. I guess that they would be very (unpleasantly) surprised to learn that in practice many "planned maintenance" operations are in effect treated as machine failures too. | ||||
| Comment by Eric Milkie [ 15/Jan/19 ] | ||||
|
I would argue that the normal shutdown does indeed do an election handoff, it's just not as fast as it could be. The replication subsystem does not handle the shutdown as a failure; everything operates as designed: No committed data is lost and no data is corrupted. Arguably, some users may prefer the current behavior, since introducing a stepdown as part of SIGTERM handling can potentially expand the window of time where a cluster will not accept writes, since the primary will now wait for a majority of secondaries to catch up during stepdown in an attempt to protect as many writes as possible from being rolled back. | ||||
| Comment by Matt Lord (Inactive) [ 15/Jan/19 ] | ||||
|
milkie, I would argue that if the normal shutdown – and SIGTERM is a normal/standard shutdown method – doesn't do an election handoff, and the replication subsystem thus treats/handles the shutdown as a failure, then it's not clean. It's clean as far as the single node storage subsystem is concerned, we know that. But as a distributed database, I would argue that the collective replication and storage system is just as important. What am I missing? | ||||
| Comment by Eric Milkie [ 15/Jan/19 ] | ||||
|
I'm not sure why this is a bug. Upon the receipt of any of SIGHUP, SIGINT, SIGTERM, or SIGXCPU, the server shuts down cleanly. (In contrast, the server shuts down uncleanly on receipt of SIGKILL.) | ||||
| Comment by Matt Lord (Inactive) [ 14/Jan/19 ] | ||||
|
I'd prefer we treat this as a bug (changing Type for now) and backport it to at least 4.0, as a `kill [-15] <PID>` is generally considered a standard way to terminate a process properly – 1) that's generally what is done in SysV init scripts, and in fact that's what ours does 2) that's what systemd does without an explicit ExecStop option, which we don't specify – so we really should effectively take the same steps we do for the shutdown command. Having different shutdown paths is a breeding ground for many subtle bugs and unexpected behaviors – with the noted issue for our own Kubernetes operator being a critical example. |