Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Networking
Labels:
None

Assigned Teams:

Service Arch
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Currently each time we get a network error* in the ARS we tell the ReplicaSetMonitor that the host is down, which prevents it from sending any more requests to that node until we do the next scan, without actually triggering another scan. This has a number of problems. For one thing, there is nothing to ensure that the error actually happened after the RSM last updated the state of that host, so the host may have recovered, been marked as up, then marked as down again. Additionally, due to the maxConnecting limits, hitting a timeout trying to get a connection out of the pool doesn't necessarily even mean there is anything wrong with the host. It may just be that all of the connections were in use and that the connections that became available just happened to be handed to luckier requests.

All in all, the current logic just means that following a failure, or temporary overload, we will make things worse while we are recovering on the path to health. By telling the RSM that the host is down, we force requests to go to another host, which may in turn, cause it to seem down, making the problem even worse.

* We also treat NotMaster errors the same way. It looks like the code is trying to do something different, but markHostNotMaster and markHostUnreachable both just call ReplicaSetMonitor::failedHost().

Assignee:: [DO NOT USE] Backlog - Service Architecture
Reporter:: Mathias Stearn
Participants:: [DO NOT USE] Backlog - Service Architecture, Mathias Stearn, Ratika Gandhi
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Jun 04 2019 07:06:42 PM UTC
Updated:: Dec 06 2022 02:58:08 AM UTC
Resolved:: Feb 10 2020 04:08:17 PM UTC

Details

Description

Attachments

Activity

People

Dates