[SERVER-49998] MongoS task executor stalls intermittently while establishing connections Created: 29/Jul/20  Updated: 27/Oct/23  Resolved: 05/Oct/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.16
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Patrick Hensley Assignee: Dmitry Agranat
Resolution: Community Answered Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2020-08-10 at 12.54.08.png     PNG File setup-latency-client-connections-3-hours.png     PNG File setup-latency-client-connections-48-hours.png     PNG File setup-latency-client-connections.png     File setup-latency.bt     Text File setup-latency.log    
Operating System: ALL
Steps To Reproduce:

We have not yet been able to observe or reproduce the issue in a non-production environment.

Participants:

 Description   

Description

Periodically our MongoS (version 3.6.16) task executors experience intermittent increased latency while establishing new connections to MongoD. The latency occurs between the ASIOConnection::setup's call to strand()->dispatch() and the dispatch lambda being invoked. This can delay the call to NetworkInterfaceASIO::_connect for up to 20 seconds.

The latency seems to max out at 20 seconds, indicating it may involve the options.refreshTimeout which is also used as a timeout during connection establishment.

There is no external cause or correlation we can find. Stalls occur randomly / intermittent, and seem to occur during otherwise normal conditions.

Runtime Environment

We run MongoS in Kubernetes 1.15.12, Docker 18.6.1, kernel 5.4.15-1.el7.elrepo.x86_64.

Our Docker image for /usr/bin/mongos is built from "deb http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org-unstable/3.6 multiverse" official MongoDB package mongodb-org-mongos:3.6.16 and the binary is unmodified.

Sequence of Events

The task executor stall and related side effects occur as follows:

  1. Queries arrive at our MongoS instance and call ClusterFind::runQuery which selects the next task executor and calls mongo::establishCursors.
  2. ConnectionPool::SpecficPool::getConnection(hostAndPort) is called.
  3. ConnectionPool::SpecificPool::spawnConnections is called, which sees the available connection pool is too small and constructs a new connection, then calls ASIOConnection::setup to establish the connection and add it to the ready pool.
  4. ASIOConnection::setup dispatches a callback to the ASIO strand with _impl->strand().dispatch() and a lambda callback {{[this, timeout, cb] { ... }}}
  5. Something occurs which delays invoking the dispatch lambda callback.
    • We see this latency take up to 20 seconds maximum.
  6. This ASIOConnection::setup process repeats for each getConnection(host:port) pair on the stalled task executor.
    • Maximum number of simultaneous "connecting" connections defaults to 2. If the getConnection were called on 6 distinct host:port endpoints we can see the number of "connecting" connections spike to 12. This increases the "refreshing" connection count in the pool stats.
  7. Queries queue up at the stalled task executor waiting for an available connection.
  8. Applications connected using the Java driver SRV-based cluster select among servers randomly. The stalled server's connection pool in the client becomes exhausted as all connections are waiting on stalled queries.
  9. Applications gradually begin opening additional connections to the server with the stalled task executor, attempting to increase throughput.
    • Some fraction of these new queries will also become queued at the stalled task executor.
  10. The affected MongoS instance sees a gradual increase in client connections.
  11. Other task executors in the affected MongoS continue processing queries and maintain normal latencies.
    • Only the stalled executor shows increased query latencies, since queries are not being executed and thus don't reach a MongoD instance for many seconds.
  12. Finally the dispatch lambda callbacks are invoked, which calls ASIOConnection::setTimeout and then initiates the connection process by calling _global->_impl->_connect(_impl.get()); at line 251.
  13. No connection-related errors are observed in the logs.
  14. Latency between _connect() and _setupCallback lambda calls appear normal (comparable to the steady state connect times of a few milliseconds). The latency only increases between strand()->dispatch() and _connect().
  15. Application clients observe increased latency for some random subset of queries executed on the stalled instance during the stall event.

Observations

 

 

Using bpftrace I've measured the latency between the following points inside connection_pool_asio.cpp - all line numbers are for the r3.6.16 version of the source:

  1. At the start of the ASIOConnection::setup method invoked at line 207.
  2. Immediately before the call to {{setTimeout(timeout, [this, access, generation] { ... }}} at line 242. This indicates our lambda has been invoked by strand().dispatch(..) and we are about to call _global->_impl->_connect(). Latency is delta from (1).
  3. At the start of execution of the _setupCallback lambda defined at line 209. This indicates the completion of the _connect operation. Latency is delta from (1).

In steady state we see the latency hovering around a few milliseconds. At certain times the task executor stalls for up to 20 seconds between points (1) and (2). We believe this may involve the refresh timeout (default 20 seconds) but have not yet been able to trace the root cause.

Attached is a bpftrace script and a log showing the latencies during a single query stall event.

For example, at 22:44:54.411 the first call to 1. ASIOConnection::setup() is made for connection handle 93874978094592. This takes 16.821 seconds to reach 2. call setTimeout() at 22:45:11.232 and 3. in _setupCallback about 7ms after that.

Graphs

**I've attached graphs showing the client connection spike associated with the task executor stall (setup-latency-client-connections.png) as well as a 3-hour and 48-hour graph showing the frequency over time.

Stalls occur with varying frequency depending on the volume of request traffic going to a particular MongoS instance. The probability of a stall increases with increasing traffic – rate of stalls is lower overnight and higher at peak traffic mid-day. Higher traffic tiers experience more frequent spikes than lower traffic tiers.



 Comments   
Comment by Dmitry Agranat [ 21/Sep/20 ]

Hi phensley@squarespace.com,

Our internal investigation concluded that the reported incident (i.e., intermittent stalls) is potentially due to using ServiceExecutorAdaptive, which is an experimental feature, not enabled by default, and depreciated in new versions. If the same behavior is observed after switching back to the default service executor, we'll be happy to investigate the underlying cause.

Thanks,
Dima

Comment by Dmitry Agranat [ 10/Aug/20 ]

Thanks phensley@squarespace.com, I am passing this to one of our teams to analyze.

Comment by Patrick Hensley [ 06/Aug/20 ]

The mongos-pod-mnzmb-client-connections.png overview graph is in NY time UTC-4, so 13:35 refers to the first event starting around 17:35:00 UTC.

The setup-latency-mongos-pod-mnzmb.log file is in UTC, and the first dispatch/connect stall occurs at 17:35:43.034 for connection handle 94440919596544, which is dispatched 9.7 seconds later at 17:35:53.735:

17:35:43.034  48013 conn27847         1. ASIOConnection::setup() 94440919596544
17:35:53.735  22462 Network.ool-1-0   2. call setTimeout()       94440919596544   9700 ms
17:35:53.743  22462 Network.ool-1-0   3. in _setupCallback       94440919596544   9709 ms

The mongos-server.log is also in UTC, and at 17:35:43.008 client connections gradually increase from 646 up to a peak of 996 at 17:35:58.362. By 17:37:50 all excess connections have gone idle and been closed by the application connection pools.

Comment by Dmitry Agranat [ 06/Aug/20 ]

Thanks phensley@squarespace.com, could you please provide times in UTC we should be looking at? From the notes I am not sure what 13:35 refers to, is it UTC -4?

Comment by Patrick Hensley [ 05/Aug/20 ]

Hi Dima, I've collected diagnostics files and server logs for a new capture, along with the other related files, and have uploaded them via the support uploader tool.

Comment by Patrick Hensley [ 03/Aug/20 ]

Hi Dima, since we run mongoS under Kubernetes/Docker we log to stdout. Since we don't set the logpath option on startup, that apparently fails to create the diagnostic.data directory.

I'll look into getting that enabled on a single pod in production and capturing another tracing session, and uploading it via the support uploader.

Comment by Dmitry Agranat [ 02/Aug/20 ]

Hi phensley@squarespace.com,

Thank you for a detailed report.

For the mongoS in question and covering the time of the event, would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Generated at Thu Feb 08 05:21:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.