[SERVER-35132] Regression: all connections to {{mongos}} forced to reconnect during failover for clients with tight deadlines Created: 21/May/18  Updated: 29/Oct/23  Resolved: 24/Jan/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.20, 3.4.15, 3.6.4
Fix Version/s: 4.1.8

Type: Bug Priority: Major - P3
Reporter: Gregory Banks Assignee: Mathias Stearn
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:
  1. Set socketTimeout to 2s
  2. Set $maxTimeMS to 1s for all queries
  3. Throw lots of queries at cluster
  4. rs.stepDown()
Sprint: Service Arch 2019-01-14, Service Arch 2019-01-28
Participants:

 Description   

For clients of mongos that have tight deadlines, such as those that expect all queries to take less than 1s and who have maxTimeMS and socketTimeout set appropriately (1s and 2s respectively in our testing), a failover will force all connections from the client bound for the shard in transition to close and be re­established. This can be problematic for environments with lots of connections (in addition to high throughput) as establishing connections can be expensive (e.g., thread create/destroy per connection).

It should be noted that while setting socketTimeout to be greater than the failover period would allow existing connections to persist rather than time out, this is also not a solution. Instead this either causes the app to excessively queue operations on its side waiting for existing connections to free up or open new connections in the interim to service those operations, again consuming excessive resources on mongos's side and inhibiting the timely feedback required by the application.

Prior to 3.2, this was not an issue as mongos would immediately pass back ReplicaSetMonitor no master found for set errors to the client, allowing it to decide how to handle retries while reusing existing connections. Since 3.2, however, the client connection to mongos will hang while trying to find an acceptable replica set member until its configured timeout (20s in versions >= 3.4 and 11s in version 3.2), or until an acceptable member becomes available, with no way to control that timeout.
Digging through the code in version 3.2, we can see that this issue is understood by the developers and that there is an intention to address it (see src/mongo/client/remote_command_targeter.cpp):

// This value is used if the operation doesn't have a user-specified max wait time. It should be
// closer to (preferably higher than) the replication electionTimeoutMillis in order to ensure that
// lack of primary due to replication election does not cause findHost failures.
const Seconds kDefaultFindHostMaxWaitTime(11);
 
...
 
Milliseconds RemoteCommandTargeter::selectFindHostMaxWaitTime(OperationContext* txn) {
    // TODO: Get remaining max time from 'txn'.
    Milliseconds remainingMaxTime(0);
    if (remainingMaxTime > Milliseconds::zero()) {
        return std::min(remainingMaxTime - kFindHostTimeoutPad,
                        Milliseconds(kDefaultFindHostMaxWaitTime));
    }
 
    return kDefaultFindHostMaxWaitTime;
}

Here we see the default time to wait for an acceptable server is 11s, that the intention is to allow the client to influence this time (e.g., This value is used if the operation doesn't have a user-specified max wait time.) presumably using $maxTimeMS, and that this is to be implemented (e.g., TODO: Get remaining max time from 'txn'). We also see this acknowledged in other parts of the code in 3.2 (see src/mongo/s/query/async_results_merger.cpp):

    // TODO: Pass down an OperationContext* to use here.
    auto findHostStatus = shard->getTargeter()->findHost(
        readPref, RemoteCommandTargeter::selectFindHostMaxWaitTime(nullptr));

This code gets executed via the following code path:

  • mongo/client/replica_set_monitor.cpp:520ReplicaSetMonitor::Refresher::getNextStep
  • mongo/client/replica_set_monitor.cpp:815ReplicaSetMonitor::Refresher::_refreshUntilMatches
  • mongo/client/replica_set_monitor.h:274ReplicaSetMonitor::Refresher::refreshUntilMatches
  • mongo/client/replica_set_monitor.cpp:317ReplicaSetMonitor::getHostOrRefresh
    • 500ms backoff here
  • mongo/client/remote_command_targeter_rs.cpp:61RemoteCommandTargeterRS::findHost
  • mongo/s/query/async_results_merger.cpp:652AsyncResultsMerger::RemoteCursorData::resolveShardIdToHostAndPort
    • RemoteCommandTargeter::selectFindHostMaxWaitTime called to retrieve 11s max wait time here
  • mongo/s/query/async_results_merger.cpp:256AsyncResultsMerger::askForNextBatch_inlock
  • mongo/s/query/async_results_merger.cpp:315AsyncResultsMerger::nextEvent
  • mongo/s/query/router_stage_merge.cpp:43RouterStageMerge::next
  • mongo/s/query/cluster_client_cursor_impl.cpp:75ClusterClientCursorImpl::next
  • mongo/s/query/cluster_find.cpp:196runQueryWithoutRetrying
  • mongo/s/query/cluster_find.cpp:348ClusterFind::runQuery

In versions >= 3.4, RemoteCommandTargeter::selectFindHostMaxWaitTime disappears, but the problem remains and is instead hard-coded to 20s in various places (see src/mongo/s/query/async_results_merger.cpp):

    // TODO: Pass down an OperationContext* to use here.
    auto findHostStatus = shard->getTargeter()->findHostWithMaxWait(readPref, Seconds{20});

We see further evidence of the intention to fix this issue in src/mongo/client/remote_command_targeter_rs.cpp:

        // Enforce a 20-second ceiling on the time spent looking for a host. This conforms with the
        // behavior used throughout mongos prior to version 3.4, but is not fundamentally desirable.
        // See comment in remote_command_targeter.h for details.
        if (clock->now() - startDate > Seconds{20}) {
            return host;
        }

And in src/mongo/client/remote_command_targeter.h, as mentioned in the previous comment:

    /**
     * Finds a host matching readPref blocking up to 20 seconds or until the given operation is
     * interrupted or its deadline expires.
     *
     * TODO(schwerin): Once operation max-time behavior is more uniformly integrated into sharding,
     * remove the 20-second ceiling on wait time.
     */
    virtual StatusWith<HostAndPort> findHost(OperationContext* txn,
                                             const ReadPreferenceSetting& readPref) = 0;

The code path is only slightly different in 3.4:

  • mongo/client/replica_set_monitor.cpp:481ReplicaSetMonitor::Refresher::getNextStep
  • mongo/client/replica_set_monitor.cpp:797ReplicaSetMonitor::Refresher::_refreshUntilMatches
  • mongo/client/replica_set_monitor.h:294ReplicaSetMonitor::Refresher::refreshUntilMatches
  • mongo/client/replica_set_monitor.cpp:266ReplicaSetMonitor::getHostOrRefresh
    • 500ms backoff here
  • mongo/client/remote_command_targeter_rs.cpp:63RemoteCommandTargeterRS::findHostWithMaxWait
  • mongo/s/query/async_results_merger.cpp:692AsyncResultsMerger::RemoteCursorData::resolveShardIdToHostAndPort
    • hard coded to wait for 20s here
  • mongo/s/query/async_results_merger.cpp:261AsyncResultsMerger::askForNextBatch_inlock
  • mongo/s/query/async_results_merger.cpp:324AsyncResultsMerger::nextEvent
  • mongo/s/query/router_stage_merge.cpp:43RouterStageMerge::next
  • mongo/s/query/cluster_client_cursor_impl.cpp:75ClusterClientCursorImpl::next
  • mongo/s/query/cluster_find.cpp:153runQueryWithoutRetrying
  • mongo/s/query/cluster_find.cpp:305ClusterFind::runQuery

As noted above, this appears to be the same behavior in 3.6, although we have not tested this behavior with 3.6 yet.

Given the developer comments and current undesirable behavior, we would like to see this issue addressed and/or understand what roadblocks are currently preventing implementation of a solution.



 Comments   
Comment by Githook User [ 24/Jan/19 ]

Author:

{'username': 'benety', 'email': 'benety@mongodb.com', 'name': 'Benety Goh'}

Message: SERVER-35132 add requires_replication tag
Branch: master
https://github.com/mongodb/mongo/commit/a8b513ff6e2e3db87179fcb2f99499f19d47e8dc

Comment by Mathias Stearn [ 24/Jan/19 ]

The work done in SERVER-35689 made replica set targeting honor maxTimeMS in our most common code paths, including for inserts. This was expanded with SERVER-35688 to cover many more cases and should now be complete.

Comment by Githook User [ 23/Jan/19 ]

Author:

{'email': 'mathias@10gen.com', 'name': 'Mathias Stearn', 'username': 'RedBeard0531'}

Message: SERVER-35132 Test that $maxTimeMS is honored during rs targeting
Branch: master
https://github.com/mongodb/mongo/commit/05e7c83d95ae64cd0547d20f88efe8e7cb5839ee

Comment by Ramon Fernandez Marina [ 31/May/18 ]

Thanks for your detailed report gregbanks, this ticket has been sent to the Sharding Team for evaluation.

Regards,
Ramón.

Generated at Thu Feb 08 04:38:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.