[SERVER-42698] WT Cache eviction server performance regression in 3.6.13 (compared to 3.4.16) Created: 08/Aug/19 Updated: 27/Oct/23 Resolved: 29/Jul/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.6.13 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Adrien Jarthon | Assignee: | Rachelle Palmer |
| Resolution: | Gone away | Votes: | 5 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Case: | (copied to CRM) | ||||||||
| Description |
|
While investigating We added some charts to our monitoring about this since we're hitting the limits of the eviction server (the single threaded one filling the queue apparently). On one of our primary in 3.4.13 we regularly hit 30k+ page queued for eviction per second, and at this point it's pretty much saturated and can't fill up the queue any more, increasing cache usage to 95% and generating app thread evictions (this is
So then we put a secondary on 3.6.13, nothing bad so far except that we already see some big cache used spikes with almost no traffic (only mongodump) as mentioned in https://jira.mongodb.org/browse/SERVER-42256?focusedCommentId=2354831&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-2354831 But then more recently we started creating new indexes on our 2 RS (one is 3.4.16 only, and the other has the secondary on 3.6.13) and that's when the disaster came in, basically the 3.6.13 secondary was queuing page for eviction more than 10 times slower than that (looks capped at 3k pages/sec) even though the pressure generated by the index creation is bigger, meaning 95+ cache use pretty quickly whereas the 3.4.16 secondary was performing correctly. We can see this well on this chart where we tried creating a couple indexes one by one on 1 RS then the other:
Here we can see in green the 3.6.13 secondary and in yellow the 3.4.16 secondary. This clearly looks like a huge performance regression to me, and it also mean we can't even risk putting 3.6.13 in production as it'll probably die the second we do it don't you think? You have the diagnostic.data already uploaded for these 2 servers during this operation in |
| Comments |
| Comment by Alexander Gorrod [ 10/Jun/20 ] | ||||||||||||||||||||
|
That is great news - thanks for circling back around adrien.jarthon@ringcentral.com | ||||||||||||||||||||
| Comment by Adrien Rey-Jarthon [ 09/Jun/20 ] | ||||||||||||||||||||
|
I have good news for you, we recently were able to give 4.0.18 a try (to try to mitigate another issue which turned out to be even worse: | ||||||||||||||||||||
| Comment by Driss Tahraoui--Maldague [ 04/Dec/19 ] | ||||||||||||||||||||
|
Hi ! I'm working with Adrien and we recently put mongo 3.6.15 to the same test we used for other versions: Just as a reminder here's exactly what we did:
Please note that during this test the node that was using version 3.6.15 was secondary and was hidden.
Since it behaved much better than 3.6.13 with the same test we were confident enough to try it as primary in production so we did it on a Sunday afternoon when we had less traffic than usual and here's what happened: As soon as our 3.6.15 node was elected primary we can see that the cache used reached 87.5% pretty fast and stayed there for a really long time (~40 minutes), the cache activity was super high BUT the page queued for eviction was too low compared to the cache pressure we had at that moment. You can see how the 3.4.16 node responded to the (almost) same cache pressure when we put it back as primary. The IOs were also heavily used during that time (I assume this is a consequence of It shows that the eviction performance regression is still present in 3.6.15 but it's mitigated by the fact that from 87.5% cache used mongo is not caching read pages anymore and relies on disk IOs instead. We have recurrent jobs that are performed each hour at around :20 that create significant cache pressure (that 3.4.16 handles easily whereas 3.6.15 cannot keep its cache used to 80% during that time), here are some graph that shows some occurrence of these recurrent jobs and their effect on a 3.4.16 primary node (in yellow) and on our 3.6.15 when it was primary (in green): An interesting thing to look at as well is the IOs during these cache usage spikes (the flat purple line is 3.4.16, the blue one is 3.6.15): The good news is that we were able to use mongo 3.6.8 as primary in production as of yesterday and it behaves way better than more recent 3.6 versions (it also confirms what Adrien said in a previous comment: the cache eviction performance regression has been introduced in 3.6.9), here are some graphs showing that 3.6.8 behaves as good as 3.4.16 did (the primary was 3.4.16 until we switched to 3.6.8 the 12/3 at 11:00PM):
| ||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 29/Oct/19 ] | ||||||||||||||||||||
|
I had been doing some testing on my own of the difference between 3.6.8 and 3.6.9 and that picture was part of a question I was asking internally. Unfortunately, while comments can be made private pictures cannot which can cause a lack of context. In that screenshot we can see a distinct difference in how WiredTiger is treating eviction but there were a large package of WiredTiger-related changes between those two versions (122 in 3.6.9). The specific ticket(s) that triggered my change in patterns may not be the same as yours so I did not want to present the screenshot as absolute proof of something wrong. I believe Brian has the right tact here; if we focus our investigation on the other large set of changes in 3.6.14 (and the upcoming 3.6.15), we can make progress that gives you the best performance moving forward rather than trying to recreate decent performance in the past. | ||||||||||||||||||||
| Comment by Adrien Rey-Jarthon [ 28/Oct/19 ] | ||||||||||||||||||||
|
Thanks, I saw a new chart uploaded showing 3.6.8 and 3.6.9 without comment, is there anything you noticed here? are you not looking at the changes between both version which seem to have introduce most of the regression? We're still testing other versions and will be able to provide more feedback later, we'll add 3.6.15 to the test when it's available! | ||||||||||||||||||||
| Comment by Brian Lane [ 28/Oct/19 ] | ||||||||||||||||||||
|
Thanks for the nice writeup of this issue. We have done some recent performance optimizations with eviction in newer releases, which is starting to show up in some of the 3.6.x releases now. 3.6.14 had some of these optimizations, and there are a couple more targetting the 3.6.15 release. I would be interested in getting your feedback on any testing you could do once 3.6.15 is released. -Brian | ||||||||||||||||||||
| Comment by Adrien Jarthon [ 13/Sep/19 ] | ||||||||||||||||||||
|
Hi again, so today I gave a quick try at 3.6.8 with a small index creation and noticed it went pretty good with no significant increase in case use so similar behavior to 3.4.16 for me, which would mean the change is between 3.6.8 and 3.6.13. Then I tried 3.6.9 because there's a lot of changes in WT between the two and I already noticed quite a visible difference:
In both these test I did the same thing:
3.6.9 is creating the index a bit faster than 3.6.8 but the eviction can't keep up and cache size increase. Not a lot in this example because it's a small index but you can imagine with a bigger one it's like my previous examples. I wanted to try with a bigger index after that to show you more data but unfortunately it killed my primary which couldn't keep up with the index creation and live traffic at the same time (not really what I call "background" index creation but that's another story ☺) I'll make more test, if you think of particular things I should try let me know, but it already looks like there's at least a good part of the regression between 3.6.8 and 3.6.9 to me. | ||||||||||||||||||||
| Comment by Adrien Jarthon [ 12/Sep/19 ] | ||||||||||||||||||||
Yes of course it has much bigger cache activity after the switch because the cache wasn't warm, though these stay reasonable for this machine and we've seen much bigger numbers on 3.4 with no eviction issue, for example this is what we have every week day on our 3.4 primary:
I don't have the diagnostic data for these tests any more, but I still have grafana instrumentation (not for long, 45 days retention) to which I could add the "pages requested from cache". Here is it for 3.6:
So with this new number I took more screenshot of previous events if it is of any help. And this is the August 11th even when we tried 3.6 as primary: I think we'll try installing other intermediary version like 3.6.8 to see if we can reproduce, that would probably help reduce the diff to find the cause. I have one way to reproduce this easily on our production secondary (which is to create indexes), but unfortunately when I tried to reproduce this with my script it didn't show the same effect so I'm not sure how else I can manage to reduce the search area? | ||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 10/Sep/19 ] | ||||||||||||||||||||
|
Hi Adrien, Thanks for uploading the new data and sorry for the delay. In the new data there is considerable cache pressure on the 3.6 node, seen in "pages read into cache", that was not present before it was primary nor on the 3.4 node when it was primary, probably because a node that has been secondary for an extended period will have to warm the cache for the primary query load when it becomes primary. This difference in conditions between the 3.4 node and the 3.6 node makes it a little difficult to know whether the issue seen in the Aug 11th data on 3.6 was due to this higher cache pressure, or whether it was due to a difference between 3.4 and 3.6. However there is one possibly interesting difference between 3.4 and 3.6 that we saw both in the Aug 11th and in the Aug 7th data: the 3.6 node, in spite of servicing a generally lower workload (presumably because of the high cache fill ratio), showed a much higher rate (> 2x) of "pages requested from cache". We don't know the reason for this, but it seems possible that this could cause a difference in cache eviction efficiency. You mentioned above some tests with different workloads where no difference was seen between 3.6 and 3.4. Do you still have the diagnostic.data for those tests that you can upload, together with details (where and when, exactly)? We can check that to see if the hypothesis about a larger "pages requested from cache" being involved is supported by that data. In addition, do you think it would be possible to add "pages requested from cache" to your monitoring and do some further experiments to see if a higher value of "pages requested from cache" is well correlated with the lower eviction efficiency that you are seeing? It would be ideal if it were possible to isolate a specific workload that triggered this behavior, such that we could reproduce it internally, although I understand that could be difficult. Thanks, | ||||||||||||||||||||
| Comment by Adrien Jarthon [ 05/Sep/19 ] | ||||||||||||||||||||
|
Sorry to bump this but did you start looking into it? | ||||||||||||||||||||
| Comment by Adrien Jarthon [ 12/Aug/19 ] | ||||||||||||||||||||
|
If it is of any help, we also notice this eviction performance issue at the end of the mongodump, for example this morning at 8am (UTC+2) we can clearly see it on mongo1.s2: Whereas on 3.4.16 there is no visible issue on mongo2.s1 and even on mongo1.s2 when it was running 3.4.16. | ||||||||||||||||||||
| Comment by Adrien Jarthon [ 12/Aug/19 ] | ||||||||||||||||||||
|
Hi again, I'm bringing more bad news, we tried yesterday switching the 3.6.13 node to primary to see if this issue maybe was secondary specific and that it would work fine as Primary but it didn't, as we expected it quickly rose up to 95% cache usage and stayed there without being able to evict fast enough while performance was degrading (app thread doing eviction). After 10 minutes we rolled back to 3.4.13. Here's how it looked in our grafana: So the 3.6.13 node (mongo1.s2) was primary between 10:46 and 10:57 (UTC+2). | ||||||||||||||||||||
| Comment by Adrien Jarthon [ 09/Aug/19 ] | ||||||||||||||||||||
|
For the record I've done more tests to try to narrow down the possible explanations for this performance issue. Here are some charts showing the cache numbers during my benchmark for 3.4.16: This test was with a very small max cache size (1G), I also tried with 50G but the results where similar.
Then I also tried reproducing the issue on our production 3.6 secondary but without the queries (to eliminate that from the equation), so I marked the secondary as hidden and re-created one index on a 5M records collection to compare if the impact and eviction speed would be the same, and yes it's the same. With secondary queries load (yesterday): Without secondary query load (today): The longer tail visible on the first chart is actually a couple more indexes that were created just after, but basically the shape is exactly the same for the first big index create we can see here (basically the first spike in cache activity). So these 2 tests confirms IMO that:
So maybe something related to the RS mechanism? the replication load? something secondary specific? | ||||||||||||||||||||
| Comment by Adrien Jarthon [ 08/Aug/19 ] | ||||||||||||||||||||
|
Sure! I've just uploaded the log files for mongo1.s2 and mongo2.s1 on August 7th. Ok no problem for the new portal ! | ||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 08/Aug/19 ] | ||||||||||||||||||||
|
Hi Adrien, There is a difference in eviction patterns between your 3.4 and 3.6 servers but we'll have to dig into it more to determine the cause. Could you please upload the logs from the two servers covering your entire index creation test time? I'd like to compare specific timestamps. You can use this unique portal for this ticket so we can keep files distinct as the investigation paths between this and |