[SERVER-41514] ShardRemote::updateReplicaSetMonitor() should trigger a scan rather than unconditionally marking a host as down Created: 04/Jun/19  Updated: 06/Dec/22  Resolved: 10/Feb/20

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Mathias Stearn Assignee: Backlog - Service Architecture
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Service Arch
Participants:

 Description   

Currently each time we get a network error* in the ARS we tell the ReplicaSetMonitor that the host is down, which prevents it from sending any more requests to that node until we do the next scan, without actually triggering another scan. This has a number of problems. For one thing, there is nothing to ensure that the error actually happened after the RSM last updated the state of that host, so the host may have recovered, been marked as up, then marked as down again. Additionally, due to the maxConnecting limits, hitting a timeout trying to get a connection out of the pool doesn't necessarily even mean there is anything wrong with the host. It may just be that all of the connections were in use and that the connections that became available just happened to be handed to luckier requests.

All in all, the current logic just means that following a failure, or temporary overload, we will make things worse while we are recovering on the path to health. By telling the RSM that the host is down, we force requests to go to another host, which may in turn, cause it to seem down, making the problem even worse.

* We also treat NotMaster errors the same way. It looks like the code is trying to do something different, but markHostNotMaster and markHostUnreachable both just call ReplicaSetMonitor::failedHost().



 Comments   
Comment by Ratika Gandhi [ 10/Feb/20 ]

Irrelevant in the world in which RSM is SDAM compatible. 

Comment by Ratika Gandhi [ 15/Jul/19 ]

Check how Drivers deal with this problem. 

Generated at Thu Feb 08 04:57:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.