In production systems there is a common problem scenario when a particular underprovisioned mongod node becomes overloaded by a sudden load peak, causing service timeouts on the application level. In this case a plain retry strategy will add more load to the already overloaded system, frequently causing massive service degradation and prolonged periods of outage.
The ideal, yet very complicated way to gracefully handle the situation is for drivers or mongoS to detect that the node is overloaded and route user traffic away from it (there is a ticket for that, I think). This way the node will be able to process its backlog and successfully get through the issue, without disrupting the service further.
A simpler solution is to defer the judgement to the operator, and give him the ability to immediately mark the node as unable to service user queries.
Functionally, it should be a trigger that meets the following requirements:
- must not block on anything, and must take effect immediately upon triggering
- In case of error, it must be reported
- If activated, mongoS and drivers must not send new user queries to that node
- serverStatus and/or rs.status() must report the state of the trigger
- The trigger should not require a new connection to be established
- A special override should be supported to allow execution on Primary servers causing them to step down
As it currently stands, we have a replSetMaintenance command, which doesn't yet satisfy 1, 5, or 6. In particular for (1), setting maintenance mode requires a global write lock, which it grabs after locking the replset mutex (see
This ticket is filed to provide a functional spec for the feature that we currently don't have. It should give us flexibility with choosing a solution, should we decide to implement something else, other than the replSetMaintenance command.