[SERVER-27700] WT secondary performance drops to near-zero with cache full Created: 17/Jan/17 Updated: 06/Feb/19 Resolved: 02/Mar/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, WiredTiger |
| Affects Version/s: | 3.2.9, 3.2.10, 3.2.11 |
| Fix Version/s: | 3.2.13, 3.4.3, 3.5.4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Adrien Jarthon | Assignee: | Michael Cahill (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v3.4, v3.2
|
||||||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: | No easy steps of course, in our case we have a secondary with replication load and some mongo dumps running every night, the problem appears progressively. |
||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Storage 2017-03-06 | ||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Hello, I'm sorry but I'll have to reopen the difficult topic of WT dying when cache is full which has been reported multiple times already but all closed hopping 3.2.10 would fix the issue: We upgraded to 3.2.10 and later 3.2.11 as suggested but unfortunately it didn't solve this issue (though it improved other performance problems). We're currently running 3.2.11 in RS on 2 capable servers (dual Xeon, 256GB RAM, Raid 10 SSD) and running only mongo dumps on the seconday every night. The servers were running fine for week and we recently added another mongo dump with --oplog option evey night, and since then (about 5 days ago), our secondary started getting slower and slower, up to a point today that we couldn't do anything on it, even running db.getReplicationInfo() take 30 to 60 seconds. This is the exact same problem I explained in
I've uploaded the diagnostic data dir to your secure portal (diagnostic.tar.bz2, I hope we'll be able to fix this once and for all and that our primary server won't suffer from the same issue. Here is the server status before the restart (at it's worse):
|
| Comments |
| Comment by Scott Glajch [ 04/May/17 ] | |||||||||||||||||||
|
Updating from 3.2.12 to 3.2.13 has significantly improved our system's performance (and specifically, the variability in performance), and we are pretty sure it was due to this bug fix. | |||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 28/Apr/17 ] | |||||||||||||||||||
|
Thanks for the update bigbourin@gmail.com, and very glad to hear that MongoDB 3.4.3 has taken care of the issue and is working well for you. Regards, | |||||||||||||||||||
| Comment by Adrien Jarthon [ 28/Apr/17 ] | |||||||||||||||||||
|
Thanks, and for the record we are running 3.4.3 in production since last week, no performance issue so far, the WT cache usage level is MUCH lower (~105G now, whereas it was ~150G before):
| |||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 28/Apr/17 ] | |||||||||||||||||||
|
This is to let watchers know that a 3.2.13-rc0 release candidate, which includes a fix for this ticket, is available for testing. We're planning to release 3.2.13 for production use early next week. Regards, | |||||||||||||||||||
| Comment by Adrien Jarthon [ 22/Apr/17 ] | |||||||||||||||||||
|
Yeah the Request time in queue and Mean requests duration are web server level charts, provided by our ELK stack. | |||||||||||||||||||
| Comment by Jérémy Carlier [ 22/Apr/17 ] | |||||||||||||||||||
|
Yeah those graphs are nice. It's really helpful. We didn't manage this it's totally managed by our web host. The system providing those charts is Grafana. It can give you a lots of informations about mongo about mysql about your cpu etc. | |||||||||||||||||||
| Comment by Chad Kreimendahl [ 22/Apr/17 ] | |||||||||||||||||||
|
Jeremy, Where do you get those pretty charts of your mongo performance? | |||||||||||||||||||
| Comment by Jérémy Carlier [ 12/Apr/17 ] | |||||||||||||||||||
|
I work with @adrien jarthon, we are from the same company and we investigate the same MongoDB servers, so could you please have a look to confirm us if you think it's the same issue without asking me to create an other ticket which will finally be a duplicate Thank you | |||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 10/Apr/17 ] | |||||||||||||||||||
|
keysen, it's always preferable to open separate tickets to avoid confusion and allow for parallel assignments. Thanks, | |||||||||||||||||||
| Comment by Jérémy Carlier [ 10/Apr/17 ] | |||||||||||||||||||
|
Hello, I come back with some news in that ticket because we had another downtime few days ago in my company on our product using mongoDB, for your information we are still using 3.2.11 in production. Soon we will use the 3.4.3. We think it's the same issue but we would like your confirmation? The issue occurs the 04th April. Thank you. | |||||||||||||||||||
| Comment by Adrien Jarthon [ 03/Apr/17 ] | |||||||||||||||||||
|
Thank you for the analysis Michael. Ok so hopefully with 3.4.3 we should not have this problem again. | |||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 03/Apr/17 ] | |||||||||||||||||||
|
bigbourin@gmail.com, I realized your most recent report had not gotten a response so I took a look. I note that this data is from an unpatched 3.2.11 server.
This is essentially the same issue as before: there is a phase change in the workload causing many pages to be read into cache. When the cache becomes full, MongoDB's yielding behavior causes many more cursors to be opened (yielding causes churn which here means we don't keep the required cursors cached). Opening all of those cursors (prior to The improvements in concurrency in | |||||||||||||||||||
| Comment by Kelsey Schubert [ 21/Mar/17 ] | |||||||||||||||||||
|
Hi aqueen, The fixVersion, "3.2 required", indicates that the backport to MongoDB 3.2 has been requested and approved. Please continue to watch this ticket for updates. Thank you, | |||||||||||||||||||
| Comment by Aaron Queen [ 21/Mar/17 ] | |||||||||||||||||||
|
Can this be updated to affects 3.2.12 and can this be backported to 3.2 and hopefully land in 3.2.13? Thanks! | |||||||||||||||||||
| Comment by Adrien Jarthon [ 20/Mar/17 ] | |||||||||||||||||||
|
Hi, This is the 3 slowest queries during the issue and I believe the first and last to be the culprit as they were exceptional (started from a console) whereas the second one happens all the time so it was most probably a consequence here:
Here is our web requests processing during the incident, we can see it's not totally blocked but almost: Here is the mongodb primary memory usage (careful timeframe is not the same), we can see the cache rising quickly as the fullscan runs before starving and slowing down at top capacity. Max cache size is 152G for us (60% of RAM - 1G), we're getting above here but there's probably other things than WT cache I guess? I uploaded the diagnostic.data from our primary to the same portal: diagnostic.data.tgz (108M) | |||||||||||||||||||
| Comment by Adrien Jarthon [ 01/Mar/17 ] | |||||||||||||||||||
|
Ok great | |||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 01/Mar/17 ] | |||||||||||||||||||
|
bigbourin@gmail.com, the synthetic workload I referenced above can be tweaked to show eviction being much more efficient with this set of changes (e.g., 50x less work in eviction to support the same / higher workload). We have now heard from several MongoDB users including you that these changes also help in real world workloads, so I'm confident that the synthetic workload is reproducing the condition you reported. | |||||||||||||||||||
| Comment by Adrien Jarthon [ 24/Feb/17 ] | |||||||||||||||||||
|
Well we can see the eviction attempts and "queue empty after locking" lasting for hours after all load stops on the first version, which is why I observed that the server didn't recover in a timely manner the first time we had the issue, in my test here the performance went back quickly but when we had the issue in production it was much worse. 12-14 hours at 100% CPU stuck doing eviction is a hang to me Did this WT test workload showed any improvement with this fix ? | |||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 24/Feb/17 ] | |||||||||||||||||||
|
bigbourin@gmail.com, there is no evidence of a hang in this data: the eviction state did recover in the unpatched version, it just did so inefficiently. The WiredTiger workload used for testing this is here. It is part of our regular performance regression test suite. | |||||||||||||||||||
| Comment by Adrien Jarthon [ 23/Feb/17 ] | |||||||||||||||||||
|
Indeed the numbers looks great, I also like the fact that we can see in the previous version WT trying to walk through pages for eviction long after the server load stopped without managing to do anything, whereas in the patched build it never gets stuck trying to evict pages indefinitely. I'm glad you figured this out, it's a really scary bug when you think your main database can hang under a regular load and keep dying and getting slower and slower even when the load decrease, without crashing (which would be much better). Did you manage to reproduce the issue in a synthetic way to add to your performance regression test suite? | |||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 23/Feb/17 ] | |||||||||||||||||||
|
bigbourin@gmail.com, here's a summary of what I can see from the diagnostic data. In the original workload, there were frequent stalls caused by eviction (on average, out of 728 requests for eviction per second, 523 (or 72%) failed because the eviction queue was not kept full):
In contrast, with the patch build, only 79 / 659 requests (12%) find the queue empty:
Another way of looking at the same data is comparing "pages queued" vs "pages seen", which gives an estimate of how efficiently eviction is working:
The result is that MongoDB is supporting higher throughput without stalls and with less CPU load from eviction. As I said, I'll let you know when these changes are in a release. Thanks again. | |||||||||||||||||||
| Comment by Adrien Jarthon [ 23/Feb/17 ] | |||||||||||||||||||
|
Sure, I just uploaded it! (diagnostic.tar.bz2, 96.7 MB) | |||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 22/Feb/17 ] | |||||||||||||||||||
|
bigbourin@gmail.com, that's great news! Thanks for running this test and letting us know how it went. Would you mind uploading the diagnostic.data from test test run to the same portal as before? I will let you know when we have these changes scheduled into a release. | |||||||||||||||||||
| Comment by Adrien Jarthon [ 22/Feb/17 ] | |||||||||||||||||||
|
Good news, I wasn't able to reproduce the issue with this custom build, the secondary never went over 129G cache usage (whereas it was going up to 153G before) and always performed as fast (~1200 req/s on my benchmark), mongodump lasted consistently about 1.2h. Looks like a win to me, good job guys \o/ Please let me know in which production release these fixes will land and if you need any more info | |||||||||||||||||||
| Comment by Adrien Jarthon [ 21/Feb/17 ] | |||||||||||||||||||
|
Thanks, we'll give that a try! | |||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 20/Feb/17 ] | |||||||||||||||||||
|
bigbourin@gmail.com, I have created a patch build here with the current tip of MongoDB 3.2 combined with the current WiredTiger develop. The link is for Debian 7.1 binaries which matches your previous data: if you need something different, please let me know. This build includes the WiredTiger changes listed above to make eviction more efficient in workloads with many active collections. Please let me know what you see when you have a chance to try it out. | |||||||||||||||||||
| Comment by Adrien Jarthon [ 10/Feb/17 ] | |||||||||||||||||||
|
Mm the only thing running frequently on this mongo is the monitoring (telegraf), which is issuing two commands: "listDatabases" and "dbStats" every 30 seconds:
Is this keeping all the handles opens to compute stats ? Ok let me know when the test version is ready | |||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 10/Feb/17 ] | |||||||||||||||||||
|
bigbourin@gmail.com, thank you for the new data and apologies for the delay. The setting I recommended doesn't seem to be very effective for your workload, in the statistics I still see the "data handles currently active" statistic sitting at around 12,000. Could there be that many collections and indexes accessed every 30 seconds? What is happening during the slower periods is as explained before: most threads spend most of their time waiting for shared access to MongoDB's global lock. The global lock is held mostly by threads applying oplog operations on the secondary. When the oplog reads start, the amount of oplog data in cache grows by ~20GB in 20 minutes. Eviction is inefficient during this phase because it spends a lot of time working through the 12,000 trees in cache to find the ones containing most of the data. That slows down oplog application, which in turn slows down reads. As explained earlier, we have made a number of changes that improve performance in this kind of workload in our testing. There is one further change I would like to make ( | |||||||||||||||||||
| Comment by Adrien Jarthon [ 03/Feb/17 ] | |||||||||||||||||||
|
Ok that's done | |||||||||||||||||||
| Comment by Kelsey Schubert [ 03/Feb/17 ] | |||||||||||||||||||
|
Sorry for the confusion, portals (and their contents) are routinely deleted to protect user information. Going forward on this ticket, would you please upload to this new portal? Thank you, | |||||||||||||||||||
| Comment by Adrien Jarthon [ 03/Feb/17 ] | |||||||||||||||||||
|
Hi, I've just uploaded I am ok to run a custom build of mongoDB but a not too edgy one and not to often because it's still our production secondary and we need to ask our hosting provider to install it ☺
How shall I upload it ? | |||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 02/Feb/17 ] | |||||||||||||||||||
|
bigbourin@gmail.com, I'm sorry to hear that the experiment was unsuccessful. I think it's telling that it made a difference, in other words I believe we're on the right track. Would you be able to attach the diagnostic data from that run? You are correct about the parsing of config strings: the last setting wins, so the experiment was run with the suggested settings. In terms of what is going on here, having many open handles is causing the existing eviction algorithm to be inefficient. The setting should reduce the number of open handles, which probably explains why performance was better, but the diagnostic data includes statistics that allow us to track the open handles directly. Note that as I said eariler, we are working on various changes to WiredTiger that should help. Would you be prepared to try a custom build of MongoDB with those changes to see if it helps in your case? | |||||||||||||||||||
| Comment by Adrien Jarthon [ 31/Jan/17 ] | |||||||||||||||||||
|
Ok I tried with the suggestion configuration:
Unfortunately it didn't fix the problem, it's a bit better though, I tried 4 times and couldn't get below 200 req/s so the performance drop was smaller, and the time it take to dump the oplog didn't go above 20 minutes, whereas it took 1h in my last comment. But the problem is this here as the normal performance is 1200 req/s and the cache was still full (153G). This is a bit expected to me, as the problem is not really about how long we keep the handles open when idle but more how we evict them when very active IMO. Is there a way to double check the setting are currently active the for WT engine ? I saw this in the log:
So the file_manager options are duplicated (new ones at the end), I guess this is on purpose and WT keeps the last ones ? | |||||||||||||||||||
| Comment by Eric Milkie [ 26/Jan/17 ] | |||||||||||||||||||
|
mongodump does not actually record the oplog during the dump. Instead, it remembers the last entry in the oplog when it started dumping. When it then logs "writing captured oplog to xxxx", it at that moment issues a query to the oplog to read all the oplog entries since that remembered entry. This can be rather expensive. | |||||||||||||||||||
| Comment by Adrien Jarthon [ 26/Jan/17 ] | |||||||||||||||||||
|
Ok I think I managed to reproduce the issue, as I said previously I took our existing secondary (so with an active replication steam) and I spammed it with read requests AND ran a mongodump --oplog at the same time. This all goes well until the mongodump reaches the "writing captured oplog to" line. The dump took 1 hour up to this line, and is now stuck at this line since about 1h. My script which sends read requests in loop was at 1200 req/sec on idle mongo and during the whole dump up to this line when it dropped to about 100 - 150req/sec (on average but basically they are blocked for dozens of seconds and processed in batches all at a time). db.getReplicationInfo() now takes 5-20 sec to respond and shows up to 1 minute of synchronization lag. It seems to slowly be getting slower and slower and if I stop the dump now, the server will slowly regain it's performance back and free memory. I got this slowdown 3 times on 3 attempts including one after a fresh restart of mongo. Here is the log of the end of the dump where we can see the 55 minutes gap and the oplog counter going whaaaat, as soon as the progress bar started (17:46:50), the performances were getting better and the server freed 20G of cache at once (going from 153G to 133G):
Why is this step so intensive? we don't seem to have the problem with mongodump without "--oplog" I know it records the oplog during the dump but where is it stored, maybe this is the part causing trouble with cache eviction? I'll now try with your suggested configuration to see if I can reproduce the problem. | |||||||||||||||||||
| Comment by Adrien Jarthon [ 25/Jan/17 ] | |||||||||||||||||||
|
Ok thanks for the clarifications.
I totally understand why the handles are kept open when the server is idle, that's what the cache is for, but in my case it was not idle, it was struggling to handle the replication and dump load, in this case when the cache is full isn't it supposed to evict last used items from cache to free it? was it failing to do that? was it considering everything impossible to evict? I'm still trying to reproduce the issue but haven't managed so far. BTW if this helps I had a very similar issue (probably the same reason but can't be sure) on a much smaller DB with a dozens collections (for https://updown.io). I have a primary and slave, replication is about 120 req/sec and secondary does nothing more than replication and dump. All was going well for months and one day I tried issuing a few queries to the secondary (just some test from the console I couldn't go further because I found a bug in the client ☺) and a few hours after that, I got an MMS alert telling me my secondary was lagging behind primary, so I ssh'ed to check and the mongo was basically in the state you know: 100% cpu, doing nothing, and replication was totally stuck or processing a few ops per minute I don't recall exactly. The cache was full of course but it's always full on this secondary because I configured it with only 1GB cache. I restarted the mongo process and everything is fine since then (it was 4-5 months ago). Is there anything special about them that could explain this? something different than dump queries? maybe because they use indexes? I just had a look if I have the diagnostic.data for this event but it's way too old I only have 4 days of diagnostic. | |||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 25/Jan/17 ] | |||||||||||||||||||
|
bigbourin@gmail.com, thanks for your reply. This issue is certainly more pronounced on secondaries because any delays while holding the Global lock cause everything to stall. It's certainly possible that some of the same underlying inefficiency is in play on primaries but it is less likely to disrupt other operations. To answer your question about why handles are not closed automatically: MongoDB configures WiredTiger to not do that – the setting change I sent earlier does exactly this. One motivation for MongoDB's default setting is that some users were surprised when handles were closed in an idle server (causing additional I/O when the server became busy again). Please do let us know if you are able to reproduce the issue and try with the configuration change. I still believe that may be sufficient to resolve the issue. The storage team has been working to improve WiredTiger's handling of workloads with many active collections and indexes. In particular, we have changes in code review for:
Based on what we've seen in our testing so far, I'm hopeful that the combination of these changes will significantly improve WiredTiger's handling of workloads like yours, even with MongoDB's default settings. | |||||||||||||||||||
| Comment by Adrien Jarthon [ 20/Jan/17 ] | |||||||||||||||||||
|
Thanks for investigating this issue, your diagnosis seems to match what I saw. You said this is linked to secondary using global exclusive lock when syncing oplog, so I guess the same problem can't happen on primary ? Because we didn't saw this issue on the primary up to now (fortunately) The WT cache eviction problem is probably the same I guess but without the global lock it's probably much less visible ? We do have a lot of collections (7k) and indexes (29k), which are all opened (and probably stays "active" forever if you says the default idle time is > 1day) due the daily dump. But I don't really understand how the issue can get bigger across several days as the activity stays the same. In the beginning we did execute some queries on the secondaries, but we stopped doing that as we noticed their performance were recently getting 10 to 100 times slower while the dump was running. This was 4-5 days before the secondary really hit the ground. So maybe the replication + dumps + queries traffic started to put the db in a bad state ? and even once we stopped the queries, our secondary kept getting slower and slower for 4-5 days until the complete stop. So when WT reaches a point where the cache is full of active handles, how can it not close some of them to keep going ? It's not like we were requesting all of them at the same time, mongodump only queries 4 collections at the same time. There's the replication load of course but this is write only, does it uses the cache at all (except for updating it of course) ? I know WT is complex engine but I have trouble to understand how can it fail so randomly under such regular traffic. I would love to test this configuration but I'll try to reproduce the issue first otherwise we have no way to assert if it helped, but I'm not sure if running the secondary without queries (just replication & dumps) can reproduce it, though I can't put the queries back if it's at the risk of impacting all our clients. I developed a script to replay mongo queries from the ruby logs (yeah just before you released mongoreplay ☹) so I'll try to hammer the secondary with typical queries we send there in hope to find a way to reproduce the issue. | |||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 20/Jan/17 ] | |||||||||||||||||||
|
bigbourin@gmail.com, thanks for the diagnostic data: I have analyzed the performance issue you are seeing here. We are currently trying to reproduce the issue in testing (so far without success). What we see is a combination of two issues causing slow performance: 1. MongoDB holds the Global lock exclusive while applying operations to secondaries; and 2. Applying operations is slow because of inefficiency in WiredTiger's cache eviction implementation. The main factor causing slowness in eviction is the large number of open collections and indexes, even though it appears that only a small fraction of those collections are active. If you can give us any more information about the access patterns in your application, that may help us reproduce this behavior more quickly. One configuration change we have suggested in this situation is to tune WiredTiger to close inactive handles more aggressively (by default with MongoDB, inactive handles are kept open for a little over a day). The configuration change is as follows:
If you are able to test this configuration, you may find it resolves the performance issues (the main downside of this change is that if the server becomes idle, handles will be closed so that next time the data is accessed it will have to be read back into cache). Regardless, we will continue trying to reproduce this behavior locally so that we can be confident it is fixed. | |||||||||||||||||||
| Comment by Kelsey Schubert [ 18/Jan/17 ] | |||||||||||||||||||
|
I'm sorry you're still encountering issues. Thank you for letting us know and providing the diagnostic.data – we're looking into it. Kind regards, |