[SERVER-42256] Mongo unable to evict cache Created: 17/Jul/19  Updated: 27/Oct/23  Resolved: 26/Jul/21

Status: Closed
Project: Core Server
Component/s: Performance, Stability, WiredTiger
Affects Version/s: 3.4.16
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sébastien Puyet Assignee: Edwin Zhou
Resolution: Gone away Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screenshot from 2019-07-17 14-12-00.png     PNG File Screenshot from 2019-07-17 14-42-58.png     PNG File Screenshot from 2019-07-18 17-27-37.png     PNG File eviction_server.png     PNG File screenshot-1.png     PNG File screenshot-2.png     PNG File workload_shift.png    
Issue Links:
Related
is related to SERVER-42698 WT Cache eviction server performance ... Closed
Operating System: ALL
Participants:

 Description   

Yesterday, we had multiple slowdown on our primary server:

From our metrics, we can see that the cache evictions threads were not able to evict cache fast enough until the cache used reached 95% and that applications threads started to help the dedicated threads (source: http://source.wiredtiger.com/mongodb-3.4/tune_cache.html):

We already encountered this issue with mongo 3.2 two years ago (on the secondary server though) and the only thing that fixed the issue was to upgrade to mongo 3.4: https://jira.mongodb.org/browse/SERVER-27700

We plan to upgrade to mongo 3.6 in the following months but can you check that a regression has not be introduced in our current mongo version ?

Another related ticket: https://jira.mongodb.org/browse/SERVER-27753

I can provide you diagnostic.data and slow log files. Thx for you help.



 Comments   
Comment by Edwin Zhou [ 26/Jul/21 ]

Hi bigbourin@gmail.com,

We appreciate you for following up and are happy to hear that you no longer are seeing the regression you earlier described.

Best,
Edwin

Comment by Adrien Jarthon [ 20/Jul/21 ]

Hi !

As per the following of our tests in SERVER-42698 we finally managed to upgrade to 4.0.18 version which at least does not have the cache eviction regression introduced in 3.6.9, this is what I said in SERVER-42698:

We recently were able to give 4.0.18 a try (to try to mitigate another issue which turned out to be even worse: WT-5479) and we did not see any eviction performance issue on this version, so it looks like the regression introduce in 3.6.9 reported here (which was still present in 3.6.15) is NOT present in 4.0.18.

It's not clear if 4.0.18+ is better than before 3.6.9 in terms of cache eviction performance, but at least the biggest regression is not present in latest versions and we did not see any issue with cache eviction in our production environment. So I believe this ticket can be closed now ^^

Comment by Edwin Zhou [ 19/Jul/21 ]

Hi adrien.jarthon@ringcentral.com,

Thank you for your participation in helping us investigate and identify the regression you're seeing in 3.4. As you may already know, both versions 3.4 and 3.6 are EOL per our support policy.

We're interested if you're seeing continued regressions in the eviction performance given later versions of MongoDB (v4.0 and later). If you're interested in continuing this investigation, can you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket?

Best,
Edwin

Comment by Adrien Rey-Jarthon [ 28/Oct/19 ]

Well "there have been many improvements made to the eviction server since" is all relative as we are not even able to upgrade yet due to the drastic eviction performance decrease in 3.6.13. But yes I agree on the fact that the worse performance of 3.6 is more important to fix than the poor performance of 3.4 ☺ We're currently spending some time testing various 3.6 versions to help you pinpoint the regression in SERVER-42698 so we'll not spend too much time on 3.4 as it's mostly abandoned performance-wise anyway. Just hope we'll be able to upgrade at some point

Comment by Danny Hatcher (Inactive) [ 28/Oct/19 ]

As mentioned in an earlier comment by Dmitry, this specific case is the result of immense cache pressure possibly related to a bottleneck in the eviction server. Because this version of MongoDB is fairly old and there have been many improvements made to the eviction server since, I'm not sure that there is a large amount of value in trying to resolve issues on this specific version. We have SERVER-42698 open to investigate possible issues with the eviction server on 3.6 so I propose we continue on that ticket and close this one.

If you would like to continue looking at 3.4, I think the best path forward would be to compare two discrete 3.4.x versions where performance is "acceptable" on one patch version and "not acceptable" on another. However, our focus would still be on 3.6 and newer work.

Comment by Adrien Jarthon [ 08/Aug/19 ]

And for the record I also created SERVER-42662

Comment by Adrien Jarthon [ 08/Aug/19 ]

I did this in SERVER-42698 to separate the two issues as this one is related to 3.4.16 not so great eviction performance, whereas the new one is related to 3.6.13 regression on this matter.

Comment by Dmitry Agranat [ 08/Aug/19 ]

Thanks for the update Adrien, please upload the diagnostic.data from the primary and the secondary which was upgraded to 3.6.13 to the secure portal. Also, please provide the timestamps + timezone of the events you'd like us to take a look.

Comment by Adrien Jarthon [ 08/Aug/19 ]

We added some charts to our monitoring about this since we're hitting the limits of the eviction server (the single threaded one filling the queue apparently). On this server in 3.4.13 we regularly hit 30k+ page queued for eviction per second, and at this point it's pretty much saturated and can't fill up the queue any more, increasing cache usage to 95% and generating app thread evictions:

Comment by Adrien Jarthon [ 01/Aug/19 ]

Ok thanks, we started by updating the secondary to 3.6.13 and so far it's not looking good, even on the secondary which never had any eviction performance issue with 3.4, after one day with 3.6.13 we saw some spikes to 95% cache usage:

This is the MAX cache usage for primary (yellow, 3.4) and secondary (green, 3.6.13 since the drop to 0 visible on the chart). We can clearly see on the secondary it used to be OK and start showing some dangerous spikes since the update, while the primary is actually close to dying every day because of this. We'll monitor for a couple more days but it doesn't look very safe to update in these conditions...

Comment by Dmitry Agranat [ 01/Aug/19 ]

Hi Adrien,

The eviction server is single-threaded. It's performance, contention and issues need to be carefully diagnosed in order to get to the root cause. There is no single metric which can surface eviction server efficiency and point to the root cause. You can try monitoring the wt cache pages evicted by application threads metric which might be a good indication that the eviction is struggling and needs to be investigating.

Just to reiterate, after the upgrade to the latest MongoDB version, please upload the diagnostic.data and we will be happy to take a look.

Thank you,
Dima

Comment by Adrien Jarthon [ 31/Jul/19 ]

Hi, so if I understand well (I'm working with Sebastien) mongo has an "eviction server" which generates candidates for eviction or something like that? and then the multiple eviction threads (or application threads) evicts items from a queue which is filled by this servers? and so if the server is not fast enough the evictions waits (blocking the entire server if we're at 95%) for the eviction server?

Is this a multi-threaded thing? As this is another mongoDB bottleneck we would need to monitor it closely, how can see how close we are to the eviction server saturation? If we can measure this we can also see how much of an improvement 3.6 brings in this regard.

Thanks!

Comment by Dmitry Agranat [ 31/Jul/19 ]

Hi spuyet@gmail.com,

Apologies for a delayed response.

We can see that starting from 05:30 AM UTC, July 16th, there is a shift in the workload and the operations rate was gradually increased by ~x3.

It looks to me like the main problem is that the eviction server is not queuing pages fast enough:

  • This is a zoom in into the last our (12:05 - 13:05) of the above workload increase marked by A-B
  • eviction calls to get a page found queue empty is not as large as eviction calls to get a page, so eviction workers are not starved for work all the time but still having trouble in ~27% of the time. In other words, if we had managed to evict pages at just a 0.27% higher rate, we would have kept cache utilization from rising to 95%.
  • application thread time evicting is small, application thread time waiting for cache is large, indicating that application threads aren't finding eviction work queued for them so are waiting instead.
  • eviction empty score is large, meaning the eviction server often found the eviction queue empty, i.e. eviction workers are draining it faster than the server can fill it.

We plan to upgrade to mongo 3.6 in the following months but can you check that a regression has not be introduced in our current mongo version ?

I do not believe this is a regression of SERVER-27700 which you have mentioned. SERVER-27700 explicitly talks about a secondary member while here, we are investigating the Primary member during the reported event. In addition, we do not have the information of the issue you have encouraged 2 years ago to be able to determine if this is a regression.

It will be difficult to determine which out of ~400 WT improvements we've made since your current MongoDB version would help this workload with faster page eviction. We would be happy to re-evaluate this after the upgrade if the issue reoccurs. In addition, if possible, please consider upgrading to the latest of 4.0 (as of today, 4.0.11).

Comment by Sébastien Puyet [ 29/Jul/19 ]

Dmitry any news about this issue ?

Comment by Sébastien Puyet [ 18/Jul/19 ]

The cache used started to grow more than 80% from the morning yes:

But we started to have a real impact on production between 2:00PM UTC+2 and 3:05PM UTC+2 (when we stepped down the server). Not sure that timezone really matters here though.

Comment by Dmitry Agranat [ 18/Jul/19 ]

Thanks spuyet@gmail.com. Just to confirm, we are looking into the event that started with the workload increase/shift on July 16th at 06:00 AM and lasted until 01:00 PM UTC?

Comment by Sébastien Puyet [ 18/Jul/19 ]

Hello Dmitry,

I've just uploaded the diagnostic data and the slow log files.

We don't see anything different than other days.

Comment by Dmitry Agranat [ 18/Jul/19 ]

Hi spuyet@gmail.com,

Thank you for the report, we'll be happy to look into this.

Has anything changed on the application side in terms of the workload prior to this event?

Please upload the diagnostic data, mongod logs as well as the exact time and timezone of the reported issue.
You can upload all the data into this secure portal.

Thank you,
Dima

Comment by Sébastien Puyet [ 17/Jul/19 ]

An interesting thing to notice is that after the steDdown the server was back to a "normal" state.

Generated at Thu Feb 08 05:00:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.