[SERVER-3695] Running shutdown command on Primary while all secondaries are fsync locked and not caught up says that no secondaries were within 10 seconds of primaries optime, even if they were Created: 26/Aug/11  Updated: 16/Jan/20  Resolved: 02/Dec/14

Status: Closed
Project: Core Server
Component/s: Replication, Usability
Affects Version/s: 2.0.0-rc0
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Spencer Brody (Inactive) Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: sync
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-43988 shutdown ({force:false}) should refus... Closed
is related to SERVER-16377 Require 'force' argument to stepdown ... Closed
Participants:

 Description   

If you try to run the shutdownServer command on a primary, it won't step down unless there is a secondary totally caught up. The error message it reports, however, is "shutdownServer failed: no secondaries within 10 seconds of my optime". Either the error message should be updated or the behavior should be changed to match the message.



 Comments   
Comment by Spencer Brody (Inactive) [ 02/Dec/14 ]

It has been determined that this is the desired behavior - any stepdown/shutdown that leaves your system without a usable PRIMARY should require 'force' to run.

Comment by Spencer Brody (Inactive) [ 01/Dec/14 ]

New message from 2.8-rc2-pre with 3 node set with 2 secondaries, both fsync-locked and with a pending write waiting for replication: "shutdownServer failed: No electable secondaries caught up as of 2014-12-01T17:57:51.915-0500"

With a single node set however this seems to be a problem again. Digging in now.

Comment by Eric Milkie [ 02/Sep/14 ]

Parking with Spencer to look at reproducing after the 2.7 refactoring is complete and we've switched off the Legacy coordinator.
I note that the erroneous error message text quoted in this bug report is actually only included in the Legacy and not the new Replication Coordinator, which is a bit concerning.

Comment by Spencer Brody (Inactive) [ 26/Aug/11 ]

I think this actually only happens if the secondaries are fsync locked. If they're just behind, then it will work if they're within 10 seconds of the primary. If the secondaries are all locked, however, then even if they're only a fraction of a second behind, the shutdown will immediately fail. Probably the error message should just be updated for this edge case.

Generated at Thu Feb 08 03:03:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.