[SERVER-36291] Severe Performance regression from 3.4 to 3.6 w/ faulty logging or secondaryPreferred query Created: 26/Jul/18 Updated: 08/Mar/19 Resolved: 06/Feb/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance |
| Affects Version/s: | 3.6.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Chad Kreimendahl | Assignee: | Dmitry Agranat |
| Resolution: | Done | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
When running our systems on 3.6 vs 3.4 with the exact same data, we're seeing intermittent slowness happening with slow queries logging as secondaryPreferred, while not a single one of our queries is a against a secondary. We only do primaryPreferred queries.
See my attachments |
| Comments |
| Comment by Chad Kreimendahl [ 06/Feb/19 ] | |||
|
Definitely not yet resolved. We'll attempt the test with the config file change soon. Our only issue is that these types of issues are not easy to create outside the demands of production. Recreating production issues in dev is nearly impossible. | |||
| Comment by Dmitry Agranat [ 06/Feb/19 ] | |||
|
Hi sallgeud, We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket. Regards, | |||
| Comment by Dmitry Agranat [ 31/Dec/18 ] | |||
|
Hi sallgeud,
This might be a good use case for micro-sharding, you can co-locate MongoDB processes on the same server, by isolating processes and their overall resource utilization by using a container, for example Docker or cgroups. Thank you, | |||
| Comment by Chad Kreimendahl [ 27/Dec/18 ] | |||
|
What's considered a "small proportion"? Let's say we have 350k "tables", and we regularly access about 25-50k of them. What about regularly accessing 10k vs 100k of them, compared to 25-50k? Also, what is "consistently access"? Hourly, daily, etc? | |||
| Comment by Dmitry Agranat [ 27/Dec/18 ] | |||
|
Hi sallgeud, To set the mentioned WT parameter, update the mongod.conf files for the Primary and Secondary nodes to include the following and restart the process:
As a reminder, these settings are suitable whenever an Application has a large number of tables (a table is either a collection or an index) which only regularly access a relatively small proportion of them. Applications that consistently access data from across a large set of collections and indexes are likely to see a worse performance with the suggested configuration. Thank you, | |||
| Comment by Chad Kreimendahl [ 24/Dec/18 ] | |||
|
How would one set those specific WT parameters? Or where? I assume this works on 3.4? I'd love to share how we organize our data and indexes in a way that would be helpful. If you all want, I can set up a web conference and show / discuss our methods. We've got hundreds of databases. Our method to get around these file limitations, which include time-to-boot (very long), and various problems that seem greatly exacerbated in 3.6 (vs 3.4), is to simply have more physical hardware and/or more services. So when we have a machine that's running into file problems, but has plenty of memory and disk, we simply configure another service with a different port and data directory. It allows us to scale within a machine, but it's a poor scaling method. Unfortunately, sharding is nearly pointless in our environment, as > 90% of customers will never really use more than 50GB (we're very data-light, on purpose). | |||
| Comment by Dmitry Agranat [ 11/Dec/18 ] | |||
|
Hi Chad, I'd like to address your latest comment about the large amount of files. Under certain workloads, the number of active files MongoDB uses can impact a system's performance. Sometimes the bottleneck in these workloads is the eviction server single thread identifying candidate pages for eviction (since we need to walk many trees). To mitigate this issues, we can try to speed up the time when we need to close non-active dhandles by setting this WT parameter:
Metrics explanation:
The default to TTL non-active dhandles is every 100000 seconds. Please note that these settings are suitable whenever an Application has a large number of tables (a table is either a collection or an index) which only regularly access a relatively small proportion of them. Applications that consistently access data from across a large set of collections and indexes are likely to see a worse performance with the suggested configuration. Another mitigation with a large number of tables is trying to avoid unnecessary making a table as active. For example, when you execute collStats() command, you effectively "touch" this collection and all the associated indexes, making all these tables as active. In case your Application needs to "touch" many tables frequently, effectively making a large number of tables and cursors as active, you might consider horizontal scaling by sharding while leaving individual collections unsharded. We've deprioritized Wildcard indexes, which will be documented under Thank you, | |||
| Comment by Chad Kreimendahl [ 06/Dec/18 ] | |||
|
I'm going to say "probably". We've now got several clusters in production. The 3.6 cluster now has substantially less files on it than the 3.4 cluster. It seems like with low file count, we're fine, but as we get into the 10k-100k range somewhere, we get the badness. Our research suggests we could cut the number of indexes we have probably in half, which will help a tiny bit, but when you've got 400k files in a data directory, cutting it to 200k only seems to help a little. I'd love to see a single file for indexes, which used to be identified for 4, but appears to have been indefinitely postponed. | |||
| Comment by Kelsey Schubert [ 05/Dec/18 ] | |||
|
Hi sallgeud, Is this still an issue for you? Have you been able to gather diagnostic.data and logs from 3.4 and 3.6 nodes running against the same workload that shows a regression? Thanks, | |||
| Comment by Chad Kreimendahl [ 14/Sep/18 ] | |||
|
Also, if you could privately email me the actual log lines that caused those 2... I'm having difficulty locating the specifc log file we sent over. | |||
| Comment by Chad Kreimendahl [ 14/Sep/18 ] | |||
|
We don't have a way to downgrade the system on which these were created. I could share logs of a system which has several orders of magnitude more traffic. However, that system has quite a bit more memory, and the fastest enterprise disk in existence (Optane). That said, we had the exact same issue in product in the 2 week period that it ran a mix of 3.4 and 3.6. Our only way of generating a mix of both is to upgrade 1 server in production, then make it primary. However, that would have a fairly significant impact on our users, so we'd have to figure out how exactly we'd go about it, without causing too much pain. | |||
| Comment by Nick Brewer [ 04/Sep/18 ] | |||
|
sallgeud, bgentry Thanks for your patience. I’m curious where you’re seeing that slow queries occur in approximately 60 second batches - looking at the logs, slow queries do not seem to follow such a pattern. For example, a roughly two minute window that contains multiple slow queries: From the information you’ve provided, it appears that the most pronounced periods of increased slowness are directly correlated with numerous extremely large collection scans. For example, during the period of slowness exhibited above, where both disk read and the number of documents returned increased dramatically, multiple large collection scans were performed:
Note that I’ve sanitized the collection / database names in this query, since this ticket is public. Another example from the same time frame:
The slow query instances that we’ve managed to find so far all match up with periods where several collection scans were performed with a large number of documents examined. In order to demonstrate a regression, we would need to see logs and diagnostic.data for a 3.4 primary with the same workload. -Nick | |||
| Comment by Chad Kreimendahl [ 04/Sep/18 ] | |||
|
It makes me wonder if the combination of fairly regular deletes across hundreds (out of thousands) of collections could be a major culprit. Was there anything that changed from 3.4.x to 3.6.x either to how files are managed in WT or how TTL deletes are done? | |||
| Comment by bgentry [ 29/Aug/18 ] | |||
|
In one of our logs, we noticed that the slow requests were occurring in batches just slightly more than 60 seconds apart. This weekend, I saw in the docs that the background task that removes expired documents based on TTL indexes runs every 60 seconds. Could this be what is causing the performance issue? | |||
| Comment by Nick Brewer [ 08/Aug/18 ] | |||
|
sallgeud I agree that the logging of secondaryPreferred is confusing - though in my testing so far, I haven't seen it when a different log level is set on an actual query; for example I see the correct result in the logs when running something like db.collection.find().readPref('primaryPreferred'). I'm still testing this, and it may result in a separate SERVER ticket to get some clarification. The read concern change I mentioned in my previous update is one possible cause for the increased slowness you're seeing, but I'm taking a closer look at the diagnostic information you provided in your previous ticket, just to narrow down the potential source. -Nick | |||
| Comment by Chad Kreimendahl [ 07/Aug/18 ] | |||
|
If these need to be broken out into separate issues, we can do that. However, a severe performance regression still exists. And the incorrect logging of the query type still exist, independently. I wouldn't care if the logging took a ton of time to fix. However, we really need the performance issues fixed, soon. | |||
| Comment by Chad Kreimendahl [ 06/Aug/18 ] | |||
|
Fair enough. Though, in our replicaSet, it seems strange that a query that is explicitly called as primaryPreferred would ever log as secondaryPreferred... which made us assume it was potentially related. | |||
| Comment by Nick Brewer [ 06/Aug/18 ] | |||
|
In 3.6 and above, MongoDB logs the slaveOK bit as secondaryPreferred - because of this, you can in fact see secondaryPreferred in your logs, even in a standalone environment. This change was part of the work outlined here: One thing I should note is that 3.6 makes the "majority" read concern default. -Nick | |||
| Comment by Chad Kreimendahl [ 27/Jul/18 ] | |||
|
The node from which the logs were sent was a primary system 100% of the time. Our connection (through C# driver), does not make any secondary queries. So it shouldn't be sending secondaryPreferred. When we see slow query logs on the exact same databases but under 3.4, they do not log secondaryPreferred. We have one command line tool that's run once a week that makes an aggregation query. That's literally the only place anywhere in any of our code that we would do anything related to secondaryPreferred. Point being, you should basically never see a secondaryPreferred query out of our system. We're seeing them logged by the thousands. We're also happening to notice that we're seeing a ton more slow queries on the exact same databases.
We migrated back to 3.4 in production and 100% of our performance issues were immediately resolved. | |||
| Comment by Nick Brewer [ 26/Jul/18 ] | |||
|
sallgeud Drivers that perform direct connections to a mongod set the slaveOK bit by default. The server logs the slaveOK bit as "secondaryPreferred" because that's the read preference that most closely matches the behavior of slaveOK. Picking up from the other ticket:
Looking at the logs you provided previously, it appears that both the "before" and "after" primary switch logs contain multiple entries with secondaryPreferred. Just to clarify, these logs are both from the 3.6 node, before and after it is elected as primary, correct? Thanks, |