[SERVER-49998] MongoS task executor stalls intermittently while establishing connections Created: 29/Jul/20 Updated: 27/Oct/23 Resolved: 05/Oct/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.6.16 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Patrick Hensley | Assignee: | Dmitry Agranat |
| Resolution: | Community Answered | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Steps To Reproduce: | We have not yet been able to observe or reproduce the issue in a non-production environment. |
| Participants: |
| Description |
|
Description Periodically our MongoS (version 3.6.16) task executors experience intermittent increased latency while establishing new connections to MongoD. The latency occurs between the ASIOConnection::setup's call to strand()->dispatch() and the dispatch lambda being invoked. This can delay the call to NetworkInterfaceASIO::_connect for up to 20 seconds. The latency seems to max out at 20 seconds, indicating it may involve the options.refreshTimeout which is also used as a timeout during connection establishment. There is no external cause or correlation we can find. Stalls occur randomly / intermittent, and seem to occur during otherwise normal conditions. Runtime Environment We run MongoS in Kubernetes 1.15.12, Docker 18.6.1, kernel 5.4.15-1.el7.elrepo.x86_64. Our Docker image for /usr/bin/mongos is built from "deb http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org-unstable/3.6 multiverse" official MongoDB package mongodb-org-mongos:3.6.16 and the binary is unmodified. Sequence of Events The task executor stall and related side effects occur as follows:
Observations
Using bpftrace I've measured the latency between the following points inside connection_pool_asio.cpp - all line numbers are for the r3.6.16 version of the source:
In steady state we see the latency hovering around a few milliseconds. At certain times the task executor stalls for up to 20 seconds between points (1) and (2). We believe this may involve the refresh timeout (default 20 seconds) but have not yet been able to trace the root cause. Attached is a bpftrace script and a log showing the latencies during a single query stall event. For example, at 22:44:54.411 the first call to 1. ASIOConnection::setup() is made for connection handle 93874978094592. This takes 16.821 seconds to reach 2. call setTimeout() at 22:45:11.232 and 3. in _setupCallback about 7ms after that. Graphs **I've attached graphs showing the client connection spike associated with the task executor stall (setup-latency-client-connections.png) as well as a 3-hour and 48-hour graph showing the frequency over time. Stalls occur with varying frequency depending on the volume of request traffic going to a particular MongoS instance. The probability of a stall increases with increasing traffic – rate of stalls is lower overnight and higher at peak traffic mid-day. Higher traffic tiers experience more frequent spikes than lower traffic tiers. |
| Comments |
| Comment by Dmitry Agranat [ 21/Sep/20 ] | |||
|
Our internal investigation concluded that the reported incident (i.e., intermittent stalls) is potentially due to using ServiceExecutorAdaptive, which is an experimental feature, not enabled by default, and depreciated in new versions. If the same behavior is observed after switching back to the default service executor, we'll be happy to investigate the underlying cause. Thanks, | |||
| Comment by Dmitry Agranat [ 10/Aug/20 ] | |||
|
Thanks phensley@squarespace.com, I am passing this to one of our teams to analyze. | |||
| Comment by Patrick Hensley [ 06/Aug/20 ] | |||
|
The mongos-pod-mnzmb-client-connections.png overview graph is in NY time UTC-4, so 13:35 refers to the first event starting around 17:35:00 UTC. The setup-latency-mongos-pod-mnzmb.log file is in UTC, and the first dispatch/connect stall occurs at 17:35:43.034 for connection handle 94440919596544, which is dispatched 9.7 seconds later at 17:35:53.735:
The mongos-server.log is also in UTC, and at 17:35:43.008 client connections gradually increase from 646 up to a peak of 996 at 17:35:58.362. By 17:37:50 all excess connections have gone idle and been closed by the application connection pools. | |||
| Comment by Dmitry Agranat [ 06/Aug/20 ] | |||
|
Thanks phensley@squarespace.com, could you please provide times in UTC we should be looking at? From the notes I am not sure what 13:35 refers to, is it UTC -4? | |||
| Comment by Patrick Hensley [ 05/Aug/20 ] | |||
|
Hi Dima, I've collected diagnostics files and server logs for a new capture, along with the other related files, and have uploaded them via the support uploader tool. | |||
| Comment by Patrick Hensley [ 03/Aug/20 ] | |||
|
Hi Dima, since we run mongoS under Kubernetes/Docker we log to stdout. Since we don't set the logpath option on startup, that apparently fails to create the diagnostic.data directory. I'll look into getting that enabled on a single pod in production and capturing another tracing session, and uploading it via the support uploader. | |||
| Comment by Dmitry Agranat [ 02/Aug/20 ] | |||
|
Thank you for a detailed report. For the mongoS in question and covering the time of the event, would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Thanks, |