[SERVER-14983] Ability to immediately mark the node as unable to service user queries Created: 21/Aug/14 Updated: 06/Dec/22 |
|
| Status: | Open |
| Project: | Core Server |
| Component/s: | Replication, Usability |
| Affects Version/s: | 2.6.4 |
| Fix Version/s: | Needs Further Definition |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Alexander Komyagin | Assignee: | Backlog - Replication Team |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
In production systems there is a common problem scenario when a particular underprovisioned mongod node becomes overloaded by a sudden load peak, causing service timeouts on the application level. In this case a plain retry strategy will add more load to the already overloaded system, frequently causing massive service degradation and prolonged periods of outage. The ideal, yet very complicated way to gracefully handle the situation is for drivers or mongoS to detect that the node is overloaded and route user traffic away from it (there is a ticket for that, I think). This way the node will be able to process its backlog and successfully get through the issue, without disrupting the service further. A simpler solution is to defer the judgement to the operator, and give him the ability to immediately mark the node as unable to service user queries. Functionally, it should be a trigger that meets the following requirements:
As it currently stands, we have a replSetMaintenance command, which doesn't yet satisfy 1, 5, or 6. In particular for (1), setting maintenance mode requires a global write lock, which it grabs after locking the replset mutex (see This ticket is filed to provide a functional spec for the feature that we currently don't have. It should give us flexibility with choosing a solution, should we decide to implement something else, other than the replSetMaintenance command. |
| Comments |
| Comment by Alexander Komyagin [ 21/Aug/14 ] |
|
Ideally it should not require a network request, but as Eric pointed it, it's not mandatory. I apologize for omitting context here. This ticket was filed to provide a functional spec for the feature that we currently don't have. I filed separate tickets for specific issues with the setMaintenance command (see SERVER-13925 and |
| Comment by Eric Milkie [ 21/Aug/14 ] |
|
I think a network request will be fine, since it will take a network request for drivers (or other nodes) to determine that the state of the node has changed, anyway. |
| Comment by Scott Hernandez (Inactive) [ 21/Aug/14 ] |
|
That sounds like a separate issue wrt locking pre 2.7. Are you saying we need a mechanism which does not require a network request? |
| Comment by Eric Milkie [ 21/Aug/14 ] |
|
I think "stepdown or maintenance mode" doesn't yet satisfy 1, 5, or 6. In particular for (1), setting maintenance mode requires a global write lock, which it grabs after locking the replset mutex. |
| Comment by Scott Hernandez (Inactive) [ 21/Aug/14 ] |
|
How is this different than stepdown for the primary, or maintenance mode? |