Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-41514

ShardRemote::updateReplicaSetMonitor() should trigger a scan rather than unconditionally marking a host as down



    • Type: Improvement
    • Status: Open
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: Backlog
    • Component/s: Networking
    • Labels:


      Currently each time we get a network error* in the ARS we tell the ReplicaSetMonitor that the host is down, which prevents it from sending any more requests to that node until we do the next scan, without actually triggering another scan. This has a number of problems. For one thing, there is nothing to ensure that the error actually happened after the RSM last updated the state of that host, so the host may have recovered, been marked as up, then marked as down again. Additionally, due to the maxConnecting limits, hitting a timeout trying to get a connection out of the pool doesn't necessarily even mean there is anything wrong with the host. It may just be that all of the connections were in use and that the connections that became available just happened to be handed to luckier requests.

      All in all, the current logic just means that following a failure, or temporary overload, we will make things worse while we are recovering on the path to health. By telling the RSM that the host is down, we force requests to go to another host, which may in turn, cause it to seem down, making the problem even worse.

      * We also treat NotMaster errors the same way. It looks like the code is trying to do something different, but markHostNotMaster and markHostUnreachable both just call ReplicaSetMonitor::failedHost().




            • Votes:
              0 Vote for this issue
              5 Start watching this issue


              • Created: