[SERVER-40773] Slow queries every 80s under moderate load Created: 23/Apr/19 Updated: 16/Jun/20 Resolved: 10/Jul/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, WiredTiger |
| Affects Version/s: | 3.6.6, 3.6.12 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Devon Yang | Assignee: | Dmitry Agranat |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 16.04.4 LTS, AWS i3.4xl hvm |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Steps To Reproduce: | This pattern reproduces daily in our production clusters, can get additional information if useful. |
||||||||||||
| Participants: | |||||||||||||
| Description |
|
We are seeing very consistent degradation in our primary periodically in our mongo logs. Every 60-90s or so we see a large batch of queries that seemingly gets blocked by a global lock before resolving on the order of seconds. These queries are generally fully indexed queries, often by _id. They span across multiple databases and collections at the same time. We originally noticed this consistently at 80s interval in 3.6.6, and since upgrading to 3.6.12 it is less prevalent, but still happening. In 3.6.12 we are seeing in vary in range from 60-90s generally. We have approximately 150 databases, with approximately 150 collections in each one. We are running on an AWS i3.8xlarge HVM on Ubuntu 16 LTS. Here are a couple logs, one from each batch of failures that show redacted, operations that we believe should resolve quickly. I have the diagnostics directory for this time frame if you can provide a secure upload link.
|
| Comments |
| Comment by Dmitry Agranat [ 10/Jul/19 ] | |||||||||
|
Hi ronnel, I am closing this ticket as a duplicate of Thank you for bring this to our attention. | |||||||||
| Comment by Dmitry Agranat [ 05/Jun/19 ] | |||||||||
|
Hi ronnel, After looking at all the information provided so far, the main issue with the periodic drop in throughput is related to There is another issue we've identified in this workload. Linux kernel influences the behavior of the Virtual Memory manager when it needs to allocate a swap. Periodic swap usage (under xvdh volume) adds more performance degradation events in this cluster. This might be due to vm-swappiness setting misconfiguration or other reasons. However, tuning system's performance is beyond the scope of the SERVER project. Thank you, | |||||||||
| Comment by Ronnel Boettcher [ 03/Jun/19 ] | |||||||||
|
Hi Dima, At the moment I do not having any metrics through which to verify your observation about the throughput, I will re-run the experiment, measure it, and get back to you. What I did observe though was a sustained increase in the number of queries that appearing in mongo logs (meaning an overall increase in queries slower than 100ms), even during periods in-between the checkpointing. Which would indicate to me slower processing by our mongo cluster, given our traffic did not change substantially. I will verify this, though. Regarding the two drives, actually neither of those seem related to mongo – our mongo data is being stored on a 1.8T mapper/secure drive. nvme1n1 appears to be used for a linux raid partition, and xvdh appears to be our swap partition. | |||||||||
| Comment by Dmitry Agranat [ 30/May/19 ] | |||||||||
|
Thanks ronnel for uploading new data. I am able to see the event that you've described.
After this even was over (not shown above), the workload was back to normal and based on the partial data available, the throughput has almost doubled (please double check this). I suspect the cause of this even was the fact that we had to sweep and drop 150,000 dhandles in such a short period of time (dhandles were dropped from 270K to 120K). We still see periodic (every ~60 seconds) events of degraded performance. We will continue investigating this. One more point, could you please clarify the purpose of these two volumes?
Specifically, I am trying to understand why there is a contention on volume xvdh, every 60 seconds, given the assumption that nvme1n1 is being used to store mongoD data. Thank you, | |||||||||
| Comment by Ronnel Boettcher [ 29/May/19 ] | |||||||||
|
@Dima, we removed DataDog monitoring in one of our lower environments to remove the listCollections and listDatabases calls that occur every 15 seconds, then changed the close_idle_time to 30 seconds. I was not able to identify what was calling listIndexes. The result was an initial spike with queries taking hundreds of seconds to complete, followed by degraded perf for the couple of hours close_idle_time was configured to 30s. After setting close_idle_time to 100000s perf immediately returned to normal. I uploaded the diagnostics file metrics.2019-05-24T15-19-30Z-00000 for this time period. We flipped the setting around May 24th 12:25 pm PST. Thanks again for your help, let me know if you need anything else from us. Ronnel | |||||||||
| Comment by Devon Yang [ 23/May/19 ] | |||||||||
|
@Dima, thank you for this additional information, it's very helpful. We think this is related to our DataDog monitoring, so we'll look to tweaking that. I do have a couple follow up questions, though. We failed over the primary to mitigate, and everything seemed to work perfectly fine on another replica. Based on historical data, it seems like this issue is something that degrades over time on a replica, it's not immediate even though those queries are happening all the time. Would increasing the interval of when these queries occur help? For example, if we changed it to 30 minutes, would that prevent this from happening? Or should we just try to avoid making these types of queries in general? Thanks in advance, we really appreciate you looking into this. | |||||||||
| Comment by Dmitry Agranat [ 23/May/19 ] | |||||||||
|
Hi devony, After looking at the latest data, it appears that this workload "touches" all the data, all the time. Specifically, this workload executes listCollections and listDatabases every ~15 seconds and listIndexes every 5 minutes. When you execute these commands, you effectively "touch" collections and all the associated indexes, making all these tables as active. Please note that the proposed settings with the file_manager are suitable whenever an application has a large number of tables (a table is either a collection or an index) which only regularly access a relatively small proportion of them. Applications, similar to this workload, that consistently access data from across a large set of collections and indexes are likely to see a worse performance with the suggested configuration. And this is exactly what we see based on the latest data. In order to move forward:
Thank you, | |||||||||
| Comment by Devon Yang [ 22/May/19 ] | |||||||||
|
Uploaded metrics in Archive.zip | |||||||||
| Comment by Devon Yang [ 22/May/19 ] | |||||||||
|
@Kelsey we tried making this change in a couple ways. I'm also going to upload newer diagnostics information so you can have a look with more context. Here's what we noticed. On 5/21 (today) around 12:00PM PT, we noticed a significant spike in latency that extended until 3:30PM PT. I am curious if you see anything out of the ordinary that may have started around that time. Even since then, we are seeing the spikes in latency every so often. At 5:13PM we set the close_idle_time to 3600 and did not appear to have much impact.
We then attempted to set close_idle_time to 30 but the operation got rejected:
At 5:35PM, we then attempted to set it to 300, which worked. This caused a HUGE spike in performance and we reverted it back to 3600 at 5:42PM. At 5:54PM, we set it to 1800, and that's where it is currently set. It's unclear at this moment whether it has improved or if any flattening is just variance. | |||||||||
| Comment by Devon Yang [ 22/May/19 ] | |||||||||
|
@Kelsey we tried making this change in a couple ways. I'm also going to upload newer diagnostics information so you can have a look with more context. Here's what we noticed. On 5/21 (today) around 12:00PM PT, we noticed a significant spike in latency that extended until 3:30PM PT. I am curious if you see anything out of the ordinary that may have started around that time. Even since then, we are seeing the spikes in latency every so often. At 5:13PM we set the close_idle_time to 3600 and did not appear to have much impact.
We then attempted to set close_idle_time to 30 but the operation got rejected:
| |||||||||
| Comment by Kelsey Schubert [ 21/May/19 ] | |||||||||
|
Yes, you'd see improvement from having fewer active files if you had larger collections rather than the same data distributed across many DBs (assuming you aren't regularly running COLLSCANS). Yes, there is tradeoff with the cost associated with opening and closing files too frequently. Have you had a chance to try the settings I suggested? | |||||||||
| Comment by Ronnel Boettcher [ 21/May/19 ] | |||||||||
|
@Kelsey any update on this? | |||||||||
| Comment by Ronnel Boettcher [ 13/May/19 ] | |||||||||
|
Is the issue with active file overhead, or is it the size of the collection and index files that is the issue? Would we see improvement if we had larger collections in a single DB instead of the same data distributed across many DBs? Also is there risk of setting close_idle_time too small and creating a large percentage of operations needing to open a file first? | |||||||||
| Comment by Kelsey Schubert [ 10/May/19 ] | |||||||||
|
Hi devony, Sorry for the delay getting back to you. I've examined the diagnostic data that you've provided. During a checkpoint, WiredTiger has to examine each recently accessed index or collection file. Unfortunately, with many active indexes or collections this process may currently result in a drop throughput. I've plotted an example from the diagnostic data that you've provided: One workaround would be to reduce the number of active files held open by MongoDB. This can be accomplished in a few ways, but one option that doesn't require an application or deployment change would be to set the following parameter:
This will reduce the amount of time a collection has to be idle from a bit over a day to 30 seconds and may benefit your workload if many collections aren't being accessed within that time window. This parameter can be set using setParameter or on startup in your config file. Would you please set this parameter and let us know if it resolves the issue? Thank you, | |||||||||
| Comment by Devon Yang [ 09/May/19 ] | |||||||||
|
Any update on this? | |||||||||
| Comment by Devon Yang [ 24/Apr/19 ] | |||||||||
|
@Eric i've uploaded the metric files from the relevant timeframe | |||||||||
| Comment by Eric Sedor [ 23/Apr/19 ] | |||||||||
|
Thanks for your report devony. Would you please archive (tar or zip) the $dbpath/diagnostic.data directory (described here) and attach it to this ticket? | |||||||||
| Comment by Devon Yang [ 23/Apr/19 ] | |||||||||
|
I mistyped, the instance size is an i3.8xlarge. |