[SERVER-42698] WT Cache eviction server performance regression in 3.6.13 (compared to 3.4.16) Created: 08/Aug/19  Updated: 27/Oct/23  Resolved: 29/Jul/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.13
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Adrien Jarthon Assignee: Rachelle Palmer
Resolution: Gone away Votes: 5
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 1s2_v3.6.15_B-C.png     PNG File 2s2_v3.4.16_A-B.png     PNG File 2s2_v3.4.16_C-D.png     PNG File Screen Shot 2019-08-08 at 3.55.23 PM.png     PNG File Screen Shot 2019-10-23 at 12.44.55 PM.png     PNG File Screenshot 2019-12-06 at 15.15.40.png     PNG File diag_A-B-C-D.png     PNG File eviction.png     PNG File image-2019-08-07-18-30-35-724.png     PNG File image-2019-08-08-10-46-36-023.png     PNG File image-2019-08-09-13-24-05-160.png     PNG File index_creation_mongo_3_6_15_secondary.png     PNG File mongo3_6_15_primary_cache.png     PNG File mongo3_6_15_primary_ios.png     PNG File mongo_3_4_16_vs_3_6_15_cache_cron.png     PNG File mongo_3_4_16_vs_3_6_15_ios_cron.png     PNG File mongo_3_6_8_cache.png     PNG File mongo_3_6_8_cache2.png     PNG File screenshot-1.png     PNG File screenshot-10.png     PNG File screenshot-2.png     PNG File screenshot-2.png     PNG File screenshot-3.png     PNG File screenshot-3.png     PNG File screenshot-4.png     PNG File screenshot-5.png     PNG File screenshot-6.png     PNG File screenshot-7.png     PNG File screenshot-8.png     PNG File screenshot-9.png    
Issue Links:
Related
related to SERVER-42256 Mongo unable to evict cache Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Case:

 Description   

While investigating SERVER-42256 we tried 3.6.13 as suggested (only secondary for now) and so far we noticed some memory leak (SERVER-42662) and some huge cache eviction server performance regression (we think).

We added some charts to our monitoring about this since we're hitting the limits of the eviction server (the single threaded one filling the queue apparently). On one of our primary in 3.4.13 we regularly hit 30k+ page queued for eviction per second, and at this point it's pretty much saturated and can't fill up the queue any more, increasing cache usage to 95% and generating app thread evictions (this is SERVER-42256):

So then we put a secondary on 3.6.13, nothing bad so far except that we already see some big cache used spikes with almost no traffic (only mongodump) as mentioned in https://jira.mongodb.org/browse/SERVER-42256?focusedCommentId=2354831&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-2354831

But then more recently we started creating new indexes on our 2 RS (one is 3.4.16 only, and the other has the secondary on 3.6.13) and that's when the disaster came in, basically the 3.6.13 secondary was queuing page for eviction more than 10 times slower than that (looks capped at 3k pages/sec) even though the pressure generated by the index creation is bigger, meaning 95+ cache use pretty quickly whereas the 3.4.16 secondary was performing correctly.

We can see this well on this chart where we tried creating a couple indexes one by one on 1 RS then the other:

Here we can see in green the 3.6.13 secondary and in yellow the 3.4.16 secondary.
At 17:43 and 17:45 for example we created 2 identical index on similar collections, generating approximately the same pressure on the cache (in MB/s read), and we can see that 3.4.13 managed to queue pages for eviction at the same speed and thus keep the cache at 80% without trouble, whereas the 3.6.13 server looks capped at ~3k page/sec and this evict much slower than the cache fills up.

This clearly looks like a huge performance regression to me, and it also mean we can't even risk putting 3.6.13 in production as it'll probably die the second we do it don't you think?

You have the diagnostic.data already uploaded for these 2 servers during this operation in SERVER-42662, it was yesterday (August 7th) at 17:40 UTC+2



 Comments   
Comment by Alexander Gorrod [ 10/Jun/20 ]

That is great news - thanks for circling back around adrien.jarthon@ringcentral.com

Comment by Adrien Rey-Jarthon [ 09/Jun/20 ]

I have good news for you, we recently were able to give 4.0.18 a try (to try to mitigate another issue which turned out to be even worse: WT-5479) and we did not see any eviction performance issue on this version, so it looks like the regression introduce in 3.6.9 reported here (which was still present in 3.6.15) is NOT present in 4.0.18.

Comment by Driss Tahraoui--Maldague [ 04/Dec/19 ]

Hi ! I'm working with Adrien and we recently put mongo 3.6.15 to the same test we used for other versions:

Just as a reminder here's exactly what we did:

  • Create an index on a medium collection (~6M records)
  • Create an index on a bigger collection (~21M records)
  • Launch mongodump for a couple of minutes (~15 minutes)

Please note that during this test the node that was using version 3.6.15 was secondary and was hidden.
Here are the cache related graph during our test:

 
We can see that during both index creations the cache used goes over 80% without the eviction going hard enough to keep it at 80% (like it did in 3.4.16).
The interesting part happened during the mongodump (the last big cache used bump):

  • As soon as it started the cache used skyrocketed to 87.5% and didn't go back to 80% until I stop it
  • The cache activity was pretty high too but the eviction wasn't following along (not enough to keep the cache used at 80%)

 

 

Since it behaved much better than 3.6.13 with the same test we were confident enough to try it as primary in production so we did it on a Sunday afternoon when we had less traffic than usual and here's what happened:

As soon as our 3.6.15 node was elected primary we can see that the cache used reached 87.5% pretty fast and stayed there for a really long time (~40 minutes), the cache activity was super high BUT the page queued for eviction was too low compared to the cache pressure we had at that moment. You can see how the 3.4.16 node responded to the (almost) same cache pressure when we put it back as primary.

The IOs were also heavily used during that time (I assume this is a consequence of WT-4869 which, from what I understood, prevent read pages from being cached if cache used is at 87.5% or more)

It shows that the eviction performance regression is still present in 3.6.15 but it's mitigated by the fact that from 87.5% cache used mongo is not caching read pages anymore and relies on disk IOs instead.

We have recurrent jobs that are performed each hour at around :20 that create significant cache pressure (that 3.4.16 handles easily whereas 3.6.15 cannot keep its cache used to 80% during that time), here are some graph that shows some occurrence of these recurrent jobs and their effect on a 3.4.16 primary node (in yellow) and on our 3.6.15 when it was primary (in green):

An interesting thing to look at as well is the IOs during these cache usage spikes (the flat purple line is 3.4.16, the blue one is 3.6.15):

The good news is that we were able to use mongo 3.6.8 as primary in production as of yesterday and it behaves way better than more recent 3.6 versions (it also confirms what Adrien said in a previous comment: the cache eviction performance regression has been introduced in 3.6.9), here are some graphs showing that 3.6.8 behaves as good as 3.4.16 did (the primary was 3.4.16 until we switched to 3.6.8 the 12/3 at 11:00PM):


We also noticed that the page requested from cache has almost doubled between 3.4.16 and 3.6.8, is that normal ? It didn't degraded the cache eviction performance but we don't understand why there's such a big difference.
I'm currently uploading diagnostic data for both nodes (the 3.6.15 one and the 3.4.16) on the portal you provided in the first comment

Comment by Danny Hatcher (Inactive) [ 29/Oct/19 ]

I had been doing some testing on my own of the difference between 3.6.8 and 3.6.9 and that picture was part of a question I was asking internally. Unfortunately, while comments can be made private pictures cannot which can cause a lack of context. In that screenshot we can see a distinct difference in how WiredTiger is treating eviction but there were a large package of WiredTiger-related changes between those two versions (122 in 3.6.9). The specific ticket(s) that triggered my change in patterns may not be the same as yours so I did not want to present the screenshot as absolute proof of something wrong. I believe Brian has the right tact here; if we focus our investigation on the other large set of changes in 3.6.14 (and the upcoming 3.6.15), we can make progress that gives you the best performance moving forward rather than trying to recreate decent performance in the past.

Comment by Adrien Rey-Jarthon [ 28/Oct/19 ]

Thanks, I saw a new chart uploaded showing 3.6.8 and 3.6.9 without comment, is there anything you noticed here? are you not looking at the changes between both version which seem to have introduce most of the regression? We're still testing other versions and will be able to provide more feedback later, we'll add 3.6.15 to the test when it's available!

Comment by Brian Lane [ 28/Oct/19 ]

Hi bigbourin@gmail.com,

Thanks for the nice writeup of this issue. We have done some recent performance optimizations with eviction in newer releases, which is starting to show up in some of the 3.6.x releases now. 3.6.14 had some of these optimizations, and there are a couple more targetting the 3.6.15 release. I would be interested in getting your feedback on any testing you could do once 3.6.15 is released.

-Brian

Comment by Adrien Jarthon [ 13/Sep/19 ]

Hi again, so today I gave a quick try at 3.6.8 with a small index creation and noticed it went pretty good with no significant increase in case use so similar behavior to 3.4.16 for me, which would mean the change is between 3.6.8 and 3.6.13. Then I tried 3.6.9 because there's a lot of changes in WT between the two and I already noticed quite a visible difference:

3.6.8 3.6.9

In both these test I did the same thing:

  • The secondary as hidden but still had production replication traffic.
  • Then it's re-started in version 3.6.x
  • Then I run a mongodump on it to fill up the cache to 80%
  • An existing index is dropped and re-created on the primary, the spikes visible in these charts are the index creation on the secondary.

3.6.9 is creating the index a bit faster than 3.6.8 but the eviction can't keep up and cache size increase. Not a lot in this example because it's a small index but you can imagine with a bigger one it's like my previous examples. I wanted to try with a bigger index after that to show you more data but unfortunately it killed my primary which couldn't keep up with the index creation and live traffic at the same time (not really what I call "background" index creation but that's another story ☺)

I'll make more test, if you think of particular things I should try let me know, but it already looks like there's at least a good part of the regression between 3.6.8 and 3.6.9 to me.

Comment by Adrien Jarthon [ 12/Sep/19 ]

In the new data there is considerable cache pressure on the 3.6 node, seen in "pages read into cache", that was not present before it was primary nor on the 3.4 node when it was primary, probably because a node that has been secondary for an extended period will have to warm the cache for the primary query load when it becomes primary. This difference in conditions between the 3.4 node and the 3.6 node makes it a little difficult to know whether the issue seen in the Aug 11th data on 3.6 was due to this higher cache pressure, or whether it was due to a difference between 3.4 and 3.6.

Yes of course it has much bigger cache activity after the switch because the cache wasn't warm, though these stay reasonable for this machine and we've seen much bigger numbers on 3.4 with no eviction issue, for example this is what we have every week day on our 3.4 primary:

The cache read activity here is, for hours, above the levels 3.6 couldn't even cope with during 5 minutes.

You mentioned above some tests with different workloads where no difference was seen between 3.6 and 3.4. Do you still have the diagnostic.data for those tests that you can upload, together with details (where and when, exactly)? We can check that to see if the hypothesis about a larger "pages requested from cache" being involved is supported by that data.

I don't have the diagnostic data for these tests any more, but I still have grafana instrumentation (not for long, 45 days retention) to which I could add the "pages requested from cache". Here is it for 3.6:

And for 3.4:

Numbers looks similar, but as my test failed to reproduce the issue we have in production that's kind of expected.

In addition, do you think it would be possible to add "pages requested from cache" to your monitoring and do some further experiments to see if a higher value of "pages requested from cache" is well correlated with the lower eviction efficiency that you are seeing? It would be ideal if it were possible to isolate a specific workload that triggered this behavior, such that we could reproduce it internally, although I understand that could be difficult.

So with this new number I took more screenshot of previous events if it is of any help.
Here is the first event (August 7th) comparing index creation on 3.4 secondary and 3.6 secondary:

And this is the August 11th even when we tried 3.6 as primary:

I think we'll try installing other intermediary version like 3.6.8 to see if we can reproduce, that would probably help reduce the diff to find the cause. I have one way to reproduce this easily on our production secondary (which is to create indexes), but unfortunately when I tried to reproduce this with my script it didn't show the same effect so I'm not sure how else I can manage to reduce the search area?

Comment by Bruce Lucas (Inactive) [ 10/Sep/19 ]

Hi Adrien,

Thanks for uploading the new data and sorry for the delay.

In the new data there is considerable cache pressure on the 3.6 node, seen in "pages read into cache", that was not present before it was primary nor on the 3.4 node when it was primary, probably because a node that has been secondary for an extended period will have to warm the cache for the primary query load when it becomes primary. This difference in conditions between the 3.4 node and the 3.6 node makes it a little difficult to know whether the issue seen in the Aug 11th data on 3.6 was due to this higher cache pressure, or whether it was due to a difference between 3.4 and 3.6.

However there is one possibly interesting difference between 3.4 and 3.6 that we saw both in the Aug 11th and in the Aug 7th data: the 3.6 node, in spite of servicing a generally lower workload (presumably because of the high cache fill ratio), showed a much higher rate (> 2x) of "pages requested from cache". We don't know the reason for this, but it seems possible that this could cause a difference in cache eviction efficiency.

You mentioned above some tests with different workloads where no difference was seen between 3.6 and 3.4. Do you still have the diagnostic.data for those tests that you can upload, together with details (where and when, exactly)? We can check that to see if the hypothesis about a larger "pages requested from cache" being involved is supported by that data.

In addition, do you think it would be possible to add "pages requested from cache" to your monitoring and do some further experiments to see if a higher value of "pages requested from cache" is well correlated with the lower eviction efficiency that you are seeing? It would be ideal if it were possible to isolate a specific workload that triggered this behavior, such that we could reproduce it internally, although I understand that could be difficult.

Thanks,
Bruce

Comment by Adrien Jarthon [ 05/Sep/19 ]

Sorry to bump this but did you start looking into it?

Comment by Adrien Jarthon [ 12/Aug/19 ]

If it is of any help, we also notice this eviction performance issue at the end of the mongodump, for example this morning at 8am (UTC+2) we can clearly see it on mongo1.s2:

Whereas on 3.4.16 there is no visible issue on mongo2.s1 and even on mongo1.s2 when it was running 3.4.16.
You have the log and diagnostic data covering many samples of this event (it's a daily dump)

Comment by Adrien Jarthon [ 12/Aug/19 ]

Hi again, I'm bringing more bad news, we tried yesterday switching the 3.6.13 node to primary to see if this issue maybe was secondary specific and that it would work fine as Primary but it didn't, as we expected it quickly rose up to 95% cache usage and stayed there without being able to evict fast enough while performance was degrading (app thread doing eviction). After 10 minutes we rolled back to 3.4.13.

Here's how it looked in our grafana:

So the 3.6.13 node (mongo1.s2) was primary between 10:46 and 10:57 (UTC+2).
I uploaded the new diagnostic data archive (for all 4 servers but you're only interested in mongo1.s2 and mongo2.s2 here, *.s1 is a different RS) and server logs for the two servers during this day.
I think this kind of rules out the "secondary only" regression ☹

Comment by Adrien Jarthon [ 09/Aug/19 ]

For the record I've done more tests to try to narrow down the possible explanations for this performance issue.
First I tried on our staging server to delete and re-create some indexes (while cache is already full to force some evictions) in multiple collections in parallel to create more pressure, but in these tests both versions (3.4.16 and 3.6.13) showed similar performances (good but means no help to pinpoint the issue).

Here are some charts showing the cache numbers during my benchmark for 3.4.16:

3.6.13:

This test was with a very small max cache size (1G), I also tried with 50G but the results where similar.
Here is the script I used to start the 3 index creation in parallel (10s between each start):

#!/usr/bin/env ruby
 
require 'mongo'
client = Mongo::Client.new([ 'localhost:27017' ])
db = client.use('database_name')
 
# 3 big collections (13G) identical in content
[
  [:perftest1, {texts: 1}],
  [:perftest2, {texts: 1}],
  [:perftest3, {texts: 1}],
].each_with_index do |(collection, index), n|
  Thread.new {
    sleep 7 + n * 10
    puts "#{Time.now}: re-creating index #{index} on #{collection} in background"
    db[collection].indexes.drop_one( index ) rescue nil
    db[collection].indexes.create_one( index, background: true )
    puts "#{Time.now}: index #{index} on #{collection} finished"
  }
end

Then I also tried reproducing the issue on our production 3.6 secondary but without the queries (to eliminate that from the equation), so I marked the secondary as hidden and re-created one index on a 5M records collection to compare if the impact and eviction speed would be the same, and yes it's the same.

With secondary queries load (yesterday):

Without secondary query load (today):

The longer tail visible on the first chart is actually a couple more indexes that were created just after, but basically the shape is exactly the same for the first big index create we can see here (basically the first spike in cache activity).
Of course before that spike the index is created on the primary (yellow on the left chart) with absolutely no visible impact on the cache use (3.4.16).

So these 2 tests confirms IMO that:

  1. The issue is not linked to our workload on the secondary
  2. It's not a global "wired tiger eviction is slower in 3.6" problem as it looked quite ok in isolated testing.

So maybe something related to the RS mechanism? the replication load? something secondary specific?

Comment by Adrien Jarthon [ 08/Aug/19 ]

Sure! I've just uploaded the log files for mongo1.s2 and mongo2.s1 on August 7th. Ok no problem for the new portal !

Comment by Danny Hatcher (Inactive) [ 08/Aug/19 ]

Hi Adrien,

There is a difference in eviction patterns between your 3.4 and 3.6 servers but we'll have to dig into it more to determine the cause. Could you please upload the logs from the two servers covering your entire index creation test time? I'd like to compare specific timestamps.

You can use this unique portal for this ticket so we can keep files distinct as the investigation paths between this and SERVER-42662 diverge. I've already copied the diagnostics so you don't have to worry about them.

Generated at Thu Feb 08 05:01:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.