Loading...

XML

Word

Printable

JSON

Type: New Feature
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: Needs Further Definition
Affects Version/s: 2.6.4
Component/s: Replication, Usability
Labels:
None

Assigned Teams:

Replication
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

In production systems there is a common problem scenario when a particular underprovisioned mongod node becomes overloaded by a sudden load peak, causing service timeouts on the application level. In this case a plain retry strategy will add more load to the already overloaded system, frequently causing massive service degradation and prolonged periods of outage.

The ideal, yet very complicated way to gracefully handle the situation is for drivers or mongoS to detect that the node is overloaded and route user traffic away from it (there is a ticket for that, I think). This way the node will be able to process its backlog and successfully get through the issue, without disrupting the service further.

A simpler solution is to defer the judgement to the operator, and give him the ability to immediately mark the node as unable to service user queries.

Functionally, it should be a trigger that meets the following requirements:

must not block on anything, and must take effect immediately upon triggering
In case of error, it must be reported
If activated, mongoS and drivers must not send new user queries to that node
serverStatus and/or rs.status() must report the state of the trigger
The trigger should not require a new connection to be established
A special override should be supported to allow execution on Primary servers causing them to step down

As it currently stands, we have a replSetMaintenance command, which doesn't yet satisfy 1, 5, or 6. In particular for (1), setting maintenance mode requires a global write lock, which it grabs after locking the replset mutex (see ~~SERVER-14982~~).

This ticket is filed to provide a functional spec for the feature that we currently don't have. It should give us flexibility with choosing a solution, should we decide to implement something else, other than the replSetMaintenance command.

related to

SERVER-14982 replSetMaintenance command should not block

Closed

SERVER-5921 Lightweight method to mark a replica as unhealthy

Closed

SERVER-13925 Allow replSetMaintenance on primary servers

Backlog

SERVER-16349 Expose replSetMaintenance status counter in db.serverStatus()

Closed

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: Alexander Komyagin (Inactive)
Participants:: [DO NOT USE] Backlog - Replication Team, Alexander Komyagin, Eric Milkie, Scott Hernandez
Votes:: 1 Vote for this issue
Watchers:: 16 Start watching this issue

Created:: Aug 21 2014 04:32:56 AM UTC
Updated:: Dec 06 2022 05:02:30 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates

PagerDuty