Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-23355

ReplicaSetMonitor: getHostOrRefresh hangs on non-responsive hosts, leading to timeout and inappropriate failures

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major - P3 Major - P3
    • 4.1.11
    • None
    • Sharding
    • None
    • Fully Compatible
    • ALL
    • Service Arch 2019-01-28, Service Arch 2019-02-11, Service Arch 2019-02-25, Service Arch 2019-03-11, Service Arch 2019-04-08, Service Arch 2019-04-22
    • 0

    Description

      ReplicaSetMonitor::getHostOrRefresh will repeatably attempt to refresh the replica set connections until it finds a host that responds with a match to its ReadPreferenceSetting. It does this in a while loop, until timeout is reached. The logic ordering in the while loop is such that it will fail to find a matching host and then check for a timeout before searching again.

      The problem is that it goes through the connections serially, and hangs for 5 seconds on a connection that doesn't respond.

      The scenario is which this is an issue:

      • Command is issued with read pref mode "primary"
      • Config primary (20267) goes down
      • getHostOrRefresh is called, but can't get host matching primary, keeps retrying the connections...
        .... keeps retrying for a time, and then:
      • getHostOrRefresh checks first host 20266: not primary
      • Config 20266 becomes primary.
      • getHostOrRefresh checks next host 20267, gets stuck for 5 seconds because 20267 is still down.
      • getHostOrRefresh checks next host 20268, not primary
      • getHostOrRefresh checks for timeout, and exits,
      • Command returns error/failure.

      Command fails when there has been a config primary up and working for the past 5 seconds. Not desirable.

      Relevant code in ReplicaSetMonitor::getHostOrRefresh:

      ...
          while (true) {
              // We might not have found any matching hosts due to the scan, which just completed may have
              // seen stale data from before we joined. Therefore we should participate in a new scan to
              // make sure all hosts are contacted at least once (possibly by other threads) before this
              // function gives up.
              Refresher refresher(startOrContinueRefresh());
       
              HostAndPort out = refresher.refreshUntilMatches(criteria);
              if (!out.empty())
                  return out;
       
              if (!isSetUsable()) {
                  return Status(ErrorCodes::ReplicaSetNotFound,
                                str::stream() << "None of the hosts for replica set " << getName()
                                              << " could be contacted.");
              }
       
              const Milliseconds remaining = maxWait - (Date_t::now() - startTimeMs);
       
              if (remaining < kFindHostMaxBackOffTime) {
                  break;
              }
       
              // Back-off so we don't spam the replica set hosts too much
              sleepFor(kFindHostMaxBackOffTime);
          }
      ...
      

      Relevant failure logs

       
      [js_test:recovering_slaveok] 2016-03-18T22:30:36.605+0000 s20269| 2016-03-18T22:30:30.699+0000 D ASIO     [conn1] startCommand: RemoteCommand 169 -- target:ip-10-63-79-92:20267 db:config expDate:2016-03-18T22:31:00.699+0000 cmd:{ update: "collections", updates: [ { q: { _id: "test.foo" }, u: { _id: "test.foo", lastmodEpoch: ObjectId('56ec81841e56700a43aaf54d'), lastmod: new Date(4294967296), dropped: false, key: { _id: 1.0 }, unique: false }, multi: false, upsert: true } ], writeConcern: { w: "majority", wtimeout: 15000 }, maxTimeMS: 30000 }
      ....
      [js_test:recovering_slaveok] 2016-03-18T22:30:48.839+0000 c20267| 2016-03-18T22:30:30.714+0000 I COMMAND  [conn13] Attempting to step down in response to replSetStepDown command
      ....
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.313+0000 s20269| 2016-03-18T22:30:37.484+0000 I NETWORK  [conn1] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [10.63.79.92:20267]
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.316+0000 s20269| 2016-03-18T22:30:37.484+0000 D -        [conn1] User Assertion: 6:network error while attempting to run command 'ismaster' on host 'ip-10-63-79-92:20267'
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.317+0000 s20269| 2016-03-18T22:30:37.484+0000 I NETWORK  [conn1] Detected bad connection created at 1458340230761664 microSec, clearing pool for ip-10-63-79-92:20267 of 0 connections
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.317+0000 s20269| 2016-03-18T22:30:37.484+0000 D NETWORK  [conn1] Marking host ip-10-63-79-92:20267 as failed
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.317+0000 s20269| 2016-03-18T22:30:37.484+0000 W NETWORK  [conn1] No primary detected for set recovering_slaveok-configRS
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.318+0000 s20269| 2016-03-18T22:30:37.984+0000 D NETWORK  [conn1] Starting new refresh of replica set recovering_slaveok-configRS
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.318+0000 s20269| 2016-03-18T22:30:37.984+0000 D NETWORK  [conn1] creating new connection to:ip-10-63-79-92:20267                                                     <- starts contacting hosts and gets stuck
      ....
      [js_test:recovering_slaveok] 2016-03-18T22:30:43.823+0000 c20266| 2016-03-18T22:30:37.985+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted                                  <- new primary is up
      ....
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.318+0000 s20269| 2016-03-18T22:30:37.984+0000 D NETWORK  [Balancer] polling for status of connection to 10.63.79.92:20268, no events
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.319+0000 s20269| 2016-03-18T22:30:37.985+0000 D NETWORK  [Balancer] polling for status of connection to 10.63.79.92:20266, no events
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.319+0000 s20269| 2016-03-18T22:30:38.018+0000 D COMMAND  [ConnectBG] BackgroundJob starting: ConnectBG
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.319+0000 s20269| 2016-03-18T22:30:38.024+0000 D NETWORK  [conn1] connected to server ip-10-63-79-92:20267 (10.63.79.92)
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.320+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: recovering_slaveok-rs1
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.348+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] Starting new refresh of replica set recovering_slaveok-rs1
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.348+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20263, no events
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.349+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20264, no events
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.349+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20265, no events
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.350+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: recovering_slaveok-rs0
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.350+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] Starting new refresh of replica set recovering_slaveok-rs0
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.351+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20260, no events
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.351+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20261, no events
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.351+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20262, no events
      [js_test:recovering_slaveok] 2016-03-18T22:30:40.352+0000 s20269| 2016-03-18T22:30:38.788+0000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: recovering_slaveok-configRS
      ....
      .... 5 seconds pass before hearing from s20269 again
      ....
      [js_test:recovering_slaveok] 2016-03-18T22:30:44.664+0000 s20269| 2016-03-18T22:30:43.023+0000 I NETWORK  [conn1] Socket recv() timeout  10.63.79.92:20267
      [js_test:recovering_slaveok] 2016-03-18T22:30:44.665+0000 s20269| 2016-03-18T22:30:43.023+0000 I NETWORK  [conn1] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [10.63.79.92:20267]
      [js_test:recovering_slaveok] 2016-03-18T22:30:44.677+0000 s20269| 2016-03-18T22:30:43.024+0000 D -        [conn1] User Assertion: 6:network error while attempting to run command 'isMaster' on host 'ip-10-63-79-92:20267'
      [js_test:recovering_slaveok] 2016-03-18T22:30:44.677+0000 s20269| 2016-03-18T22:30:43.024+0000 D -        [conn1] User Assertion: 13328:connection pool: connect failed ip-10-63-79-92:20267 : network error while attempting to run command 'isMaster' on host 'ip-10-63-79-92:20267'
      [js_test:recovering_slaveok] 2016-03-18T22:30:44.678+0000 s20269| 2016-03-18T22:30:43.024+0000 D NETWORK  [conn1] Marking host ip-10-63-79-92:20267 as failed
      [js_test:recovering_slaveok] 2016-03-18T22:30:44.678+0000 s20269| 2016-03-18T22:30:43.024+0000 W NETWORK  [conn1] No primary detected for set recovering_slaveok-configRS
      [js_test:recovering_slaveok] 2016-03-18T22:30:44.679+0000 s20269| 2016-03-18T22:30:43.024+0000 D -        [conn1] User Assertion: 133:collection metadata write failed :: caused by :: FailedToSatisfyReadPreference: could not find host matching read preference { mode: "primary" } for set recovering_slaveok-configRS
      [js_test:recovering_slaveok] 2016-03-18T22:30:44.679+0000 s20269| 2016-03-18T22:30:43.024+0000 D NETWORK  [conn1] Starting new refresh of replica set recovering_slaveok-configRS
      [js_test:recovering_slaveok] 2016-03-18T22:30:44.680+0000 s20269| 2016-03-18T22:30:43.024+0000 D NETWORK  [conn1] creating new connection to:ip-10-63-79-92:20267
      ....
      ....
      ....
      [js_test:recovering_slaveok] 2016-03-18T22:30:45.247+0000 {
      [js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000   "code" : 133,
      [js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000   "ok" : 0,
      [js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000   "errmsg" : "collection metadata write failed :: caused by :: FailedToSatisfyReadPreference: could not find host matching read preference { mode: \"primary\" } for set recovering_slaveok-configRS"
      [js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000 }
      

      Attachments

        Activity

          People

            jesse@mongodb.com A. Jesse Jiryu Davis
            dianna.hohensee@mongodb.com Dianna Hohensee (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: