[SERVER-5921] Lightweight method to mark a replica as unhealthy Created: 24/May/12  Updated: 06/Dec/22  Resolved: 14/Jun/18

Status: Closed
Project: Core Server
Component/s: Admin, Replication
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Jon Hoffman Assignee: Backlog - Replication Team
Resolution: Won't Fix Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-14576 mongod automatic shutdown on stdin close Backlog
is related to SERVER-14983 Ability to immediately mark the node ... Open
Assigned Teams:
Replication
Participants:

 Description   

At foursquare we implemented a method to mark a replica in a set as unhealthy. Here's how it works:

the mongod monitors for the presence of a kill file. If the file is present, then the mongod will make itself ineligible to be primary of a replica set, including stepping down if it's already primary. It will also return its kill-file status via serverStatus so the mongoS can refuse to send queries to a killed secondary.

Some more details:

  • mongod returns an additional healthStatus object from serverStatus. that object contains an "ok" boolean, a descriptive "msg" message, and a boolean named "killfile" indicating if a kill file is present
  • mongoS's existence replica set polling thread now polls the mongod's serverStatus instead of isMaster. If healthStatus.ok is false or serverStatus times out N times in a row, the mongoS stops sending requests to that secondary. If the host is a primary, nothing happens at the mongoS level.
  • every second, a KillFileWatcher thread in the mongod checks for the presence of a kill file. If the file is present, three things happen:
    1. the mongod's serverStatus will return a healhStatus block with ok=false; msg=a message indicating the presence of the kill file and its contents if any; and killfile=true.
    2. the mongod will mark itself as not electable by forcing ReplSetImpl::iAmPotentiallyHot to return false.
    3. If the mongod is a primary in a replica set, it will issue a stepdown(60) so someone else can take over.

Additional details:

  • the health status is returned from mongod to mongos via the db.adminCommand("connPoolStats"), and there is now a flag that adjust the polling frequency of this command on mongos.
  • since secondary querying is affected, we'd have to hack the driver to make this work for non-sharded clusters.
  • there is the case where all replicas report a failing health status. when this happens, the primary steps down and no primary is elected. this is assumed to be a catastrophic case that deserves attention, so we're living with this.

We think it would be generally useful to build this functionality into the mainline code. Our customizations are here:

https://github.com/foursquare/mongo/commit/6ff5bd021d98f25406b74b1cb89d276d0b403ce2



 Comments   
Comment by Spencer Brody (Inactive) [ 14/Jun/18 ]

You can already stepdown a node and tell it not to re-run for primary.  Touching a file in the file system isn't noticeably easier than just running the stepdown command.  This seems like extra complexity that doesn't belong in the server.

Generated at Thu Feb 08 03:10:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.