Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-35132

Regression: all connections to {{mongos}} forced to reconnect during failover for clients with tight deadlines

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.1.8
    • Affects Version/s: 3.2.20, 3.4.15, 3.6.4
    • Component/s: Sharding
    • Labels:
      None
    • Fully Compatible
    • ALL
      1. Set socketTimeout to 2s
      2. Set $maxTimeMS to 1s for all queries
      3. Throw lots of queries at cluster
      4. rs.stepDown()
    • Service Arch 2019-01-14, Service Arch 2019-01-28

      For clients of mongos that have tight deadlines, such as those that expect all queries to take less than 1s and who have maxTimeMS and socketTimeout set appropriately (1s and 2s respectively in our testing), a failover will force all connections from the client bound for the shard in transition to close and be re­established. This can be problematic for environments with lots of connections (in addition to high throughput) as establishing connections can be expensive (e.g., thread create/destroy per connection).

      It should be noted that while setting socketTimeout to be greater than the failover period would allow existing connections to persist rather than time out, this is also not a solution. Instead this either causes the app to excessively queue operations on its side waiting for existing connections to free up or open new connections in the interim to service those operations, again consuming excessive resources on mongos's side and inhibiting the timely feedback required by the application.

      Prior to 3.2, this was not an issue as mongos would immediately pass back ReplicaSetMonitor no master found for set errors to the client, allowing it to decide how to handle retries while reusing existing connections. Since 3.2, however, the client connection to mongos will hang while trying to find an acceptable replica set member until its configured timeout (20s in versions >= 3.4 and 11s in version 3.2), or until an acceptable member becomes available, with no way to control that timeout.
      Digging through the code in version 3.2, we can see that this issue is understood by the developers and that there is an intention to address it (see src/mongo/client/remote_command_targeter.cpp):

      // This value is used if the operation doesn't have a user-specified max wait time. It should be
      // closer to (preferably higher than) the replication electionTimeoutMillis in order to ensure that
      // lack of primary due to replication election does not cause findHost failures.
      const Seconds kDefaultFindHostMaxWaitTime(11);
      
      ...
      
      Milliseconds RemoteCommandTargeter::selectFindHostMaxWaitTime(OperationContext* txn) {
          // TODO: Get remaining max time from 'txn'.
          Milliseconds remainingMaxTime(0);
          if (remainingMaxTime > Milliseconds::zero()) {
              return std::min(remainingMaxTime - kFindHostTimeoutPad,
                              Milliseconds(kDefaultFindHostMaxWaitTime));
          }
      
          return kDefaultFindHostMaxWaitTime;
      }
      

      Here we see the default time to wait for an acceptable server is 11s, that the intention is to allow the client to influence this time (e.g., This value is used if the operation doesn't have a user-specified max wait time.) presumably using $maxTimeMS, and that this is to be implemented (e.g., TODO: Get remaining max time from 'txn'). We also see this acknowledged in other parts of the code in 3.2 (see src/mongo/s/query/async_results_merger.cpp):

          // TODO: Pass down an OperationContext* to use here.
          auto findHostStatus = shard->getTargeter()->findHost(
              readPref, RemoteCommandTargeter::selectFindHostMaxWaitTime(nullptr));
      

      This code gets executed via the following code path:

      • mongo/client/replica_set_monitor.cpp:520ReplicaSetMonitor::Refresher::getNextStep
      • mongo/client/replica_set_monitor.cpp:815ReplicaSetMonitor::Refresher::_refreshUntilMatches
      • mongo/client/replica_set_monitor.h:274ReplicaSetMonitor::Refresher::refreshUntilMatches
      • mongo/client/replica_set_monitor.cpp:317ReplicaSetMonitor::getHostOrRefresh
        • 500ms backoff here
      • mongo/client/remote_command_targeter_rs.cpp:61RemoteCommandTargeterRS::findHost
      • mongo/s/query/async_results_merger.cpp:652AsyncResultsMerger::RemoteCursorData::resolveShardIdToHostAndPort
        • RemoteCommandTargeter::selectFindHostMaxWaitTime called to retrieve 11s max wait time here
      • mongo/s/query/async_results_merger.cpp:256AsyncResultsMerger::askForNextBatch_inlock
      • mongo/s/query/async_results_merger.cpp:315AsyncResultsMerger::nextEvent
      • mongo/s/query/router_stage_merge.cpp:43RouterStageMerge::next
      • mongo/s/query/cluster_client_cursor_impl.cpp:75ClusterClientCursorImpl::next
      • mongo/s/query/cluster_find.cpp:196runQueryWithoutRetrying
      • mongo/s/query/cluster_find.cpp:348ClusterFind::runQuery

      In versions >= 3.4, RemoteCommandTargeter::selectFindHostMaxWaitTime disappears, but the problem remains and is instead hard-coded to 20s in various places (see src/mongo/s/query/async_results_merger.cpp):

          // TODO: Pass down an OperationContext* to use here.
          auto findHostStatus = shard->getTargeter()->findHostWithMaxWait(readPref, Seconds{20});
      

      We see further evidence of the intention to fix this issue in src/mongo/client/remote_command_targeter_rs.cpp:

              // Enforce a 20-second ceiling on the time spent looking for a host. This conforms with the
              // behavior used throughout mongos prior to version 3.4, but is not fundamentally desirable.
              // See comment in remote_command_targeter.h for details.
              if (clock->now() - startDate > Seconds{20}) {
                  return host;
              }
      

      And in src/mongo/client/remote_command_targeter.h, as mentioned in the previous comment:

          /**
           * Finds a host matching readPref blocking up to 20 seconds or until the given operation is
           * interrupted or its deadline expires.
           *
           * TODO(schwerin): Once operation max-time behavior is more uniformly integrated into sharding,
           * remove the 20-second ceiling on wait time.
           */
          virtual StatusWith<HostAndPort> findHost(OperationContext* txn,
                                                   const ReadPreferenceSetting& readPref) = 0;
      

      The code path is only slightly different in 3.4:

      • mongo/client/replica_set_monitor.cpp:481ReplicaSetMonitor::Refresher::getNextStep
      • mongo/client/replica_set_monitor.cpp:797ReplicaSetMonitor::Refresher::_refreshUntilMatches
      • mongo/client/replica_set_monitor.h:294ReplicaSetMonitor::Refresher::refreshUntilMatches
      • mongo/client/replica_set_monitor.cpp:266ReplicaSetMonitor::getHostOrRefresh
        • 500ms backoff here
      • mongo/client/remote_command_targeter_rs.cpp:63RemoteCommandTargeterRS::findHostWithMaxWait
      • mongo/s/query/async_results_merger.cpp:692AsyncResultsMerger::RemoteCursorData::resolveShardIdToHostAndPort
        • hard coded to wait for 20s here
      • mongo/s/query/async_results_merger.cpp:261AsyncResultsMerger::askForNextBatch_inlock
      • mongo/s/query/async_results_merger.cpp:324AsyncResultsMerger::nextEvent
      • mongo/s/query/router_stage_merge.cpp:43RouterStageMerge::next
      • mongo/s/query/cluster_client_cursor_impl.cpp:75ClusterClientCursorImpl::next
      • mongo/s/query/cluster_find.cpp:153runQueryWithoutRetrying
      • mongo/s/query/cluster_find.cpp:305ClusterFind::runQuery

      As noted above, this appears to be the same behavior in 3.6, although we have not tested this behavior with 3.6 yet.

      Given the developer comments and current undesirable behavior, we would like to see this issue addressed and/or understand what roadblocks are currently preventing implementation of a solution.

            Assignee:
            mathias@mongodb.com Mathias Stearn
            Reporter:
            gregbanks Gregory Banks
            Votes:
            1 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated:
              Resolved: