[SERVER-42256] Mongo unable to evict cache Created: 17/Jul/19 Updated: 27/Oct/23 Resolved: 26/Jul/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, Stability, WiredTiger |
| Affects Version/s: | 3.4.16 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Sébastien Puyet | Assignee: | Edwin Zhou |
| Resolution: | Gone away | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Yesterday, we had multiple slowdown on our primary server:
From our metrics, we can see that the cache evictions threads were not able to evict cache fast enough until the cache used reached 95% and that applications threads started to help the dedicated threads (source: http://source.wiredtiger.com/mongodb-3.4/tune_cache.html):
We already encountered this issue with mongo 3.2 two years ago (on the secondary server though) and the only thing that fixed the issue was to upgrade to mongo 3.4: https://jira.mongodb.org/browse/SERVER-27700 We plan to upgrade to mongo 3.6 in the following months but can you check that a regression has not be introduced in our current mongo version ? Another related ticket: https://jira.mongodb.org/browse/SERVER-27753 I can provide you diagnostic.data and slow log files. Thx for you help. |
| Comments |
| Comment by Edwin Zhou [ 26/Jul/21 ] |
|
We appreciate you for following up and are happy to hear that you no longer are seeing the regression you earlier described. Best, |
| Comment by Adrien Jarthon [ 20/Jul/21 ] |
|
Hi ! As per the following of our tests in
It's not clear if 4.0.18+ is better than before 3.6.9 in terms of cache eviction performance, but at least the biggest regression is not present in latest versions and we did not see any issue with cache eviction in our production environment. So I believe this ticket can be closed now ^^ |
| Comment by Edwin Zhou [ 19/Jul/21 ] |
|
Hi adrien.jarthon@ringcentral.com, Thank you for your participation in helping us investigate and identify the regression you're seeing in 3.4. As you may already know, both versions 3.4 and 3.6 are EOL per our support policy. We're interested if you're seeing continued regressions in the eviction performance given later versions of MongoDB (v4.0 and later). If you're interested in continuing this investigation, can you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket? Best, |
| Comment by Adrien Rey-Jarthon [ 28/Oct/19 ] |
|
Well "there have been many improvements made to the eviction server since" is all relative as we are not even able to upgrade yet due to the drastic eviction performance decrease in 3.6.13. But yes I agree on the fact that the worse performance of 3.6 is more important to fix than the poor performance of 3.4 ☺ We're currently spending some time testing various 3.6 versions to help you pinpoint the regression in |
| Comment by Danny Hatcher (Inactive) [ 28/Oct/19 ] |
|
As mentioned in an earlier comment by Dmitry, this specific case is the result of immense cache pressure possibly related to a bottleneck in the eviction server. Because this version of MongoDB is fairly old and there have been many improvements made to the eviction server since, I'm not sure that there is a large amount of value in trying to resolve issues on this specific version. We have If you would like to continue looking at 3.4, I think the best path forward would be to compare two discrete 3.4.x versions where performance is "acceptable" on one patch version and "not acceptable" on another. However, our focus would still be on 3.6 and newer work. |
| Comment by Adrien Jarthon [ 08/Aug/19 ] |
|
And for the record I also created |
| Comment by Adrien Jarthon [ 08/Aug/19 ] |
|
I did this in |
| Comment by Dmitry Agranat [ 08/Aug/19 ] |
|
Thanks for the update Adrien, please upload the diagnostic.data from the primary and the secondary which was upgraded to 3.6.13 to the secure portal. Also, please provide the timestamps + timezone of the events you'd like us to take a look. |
| Comment by Adrien Jarthon [ 08/Aug/19 ] |
|
We added some charts to our monitoring about this since we're hitting the limits of the eviction server (the single threaded one filling the queue apparently). On this server in 3.4.13 we regularly hit 30k+ page queued for eviction per second, and at this point it's pretty much saturated and can't fill up the queue any more, increasing cache usage to 95% and generating app thread evictions:
|
| Comment by Adrien Jarthon [ 01/Aug/19 ] |
|
Ok thanks, we started by updating the secondary to 3.6.13 and so far it's not looking good, even on the secondary which never had any eviction performance issue with 3.4, after one day with 3.6.13 we saw some spikes to 95% cache usage:
This is the MAX cache usage for primary (yellow, 3.4) and secondary (green, 3.6.13 since the drop to 0 visible on the chart). We can clearly see on the secondary it used to be OK and start showing some dangerous spikes since the update, while the primary is actually close to dying every day because of this. We'll monitor for a couple more days but it doesn't look very safe to update in these conditions... |
| Comment by Dmitry Agranat [ 01/Aug/19 ] |
|
Hi Adrien, The eviction server is single-threaded. It's performance, contention and issues need to be carefully diagnosed in order to get to the root cause. There is no single metric which can surface eviction server efficiency and point to the root cause. You can try monitoring the wt cache pages evicted by application threads metric which might be a good indication that the eviction is struggling and needs to be investigating. Just to reiterate, after the upgrade to the latest MongoDB version, please upload the diagnostic.data and we will be happy to take a look. Thank you, |
| Comment by Adrien Jarthon [ 31/Jul/19 ] |
|
Hi, so if I understand well (I'm working with Sebastien) mongo has an "eviction server" which generates candidates for eviction or something like that? and then the multiple eviction threads (or application threads) evicts items from a queue which is filled by this servers? and so if the server is not fast enough the evictions waits (blocking the entire server if we're at 95%) for the eviction server? Is this a multi-threaded thing? As this is another mongoDB bottleneck we would need to monitor it closely, how can see how close we are to the eviction server saturation? If we can measure this we can also see how much of an improvement 3.6 brings in this regard. Thanks! |
| Comment by Dmitry Agranat [ 31/Jul/19 ] |
|
Hi spuyet@gmail.com, Apologies for a delayed response. We can see that starting from 05:30 AM UTC, July 16th, there is a shift in the workload and the operations rate was gradually increased by ~x3. It looks to me like the main problem is that the eviction server is not queuing pages fast enough:
I do not believe this is a regression of It will be difficult to determine which out of ~400 WT improvements we've made since your current MongoDB version would help this workload with faster page eviction. We would be happy to re-evaluate this after the upgrade if the issue reoccurs. In addition, if possible, please consider upgrading to the latest of 4.0 (as of today, 4.0.11). |
| Comment by Sébastien Puyet [ 29/Jul/19 ] |
|
Dmitry any news about this issue ? |
| Comment by Sébastien Puyet [ 18/Jul/19 ] |
|
The cache used started to grow more than 80% from the morning yes:
But we started to have a real impact on production between 2:00PM UTC+2 and 3:05PM UTC+2 (when we stepped down the server). Not sure that timezone really matters here though. |
| Comment by Dmitry Agranat [ 18/Jul/19 ] |
|
Thanks spuyet@gmail.com. Just to confirm, we are looking into the event that started with the workload increase/shift on July 16th at 06:00 AM and lasted until 01:00 PM UTC? |
| Comment by Sébastien Puyet [ 18/Jul/19 ] |
|
Hello Dmitry, I've just uploaded the diagnostic data and the slow log files. We don't see anything different than other days. |
| Comment by Dmitry Agranat [ 18/Jul/19 ] |
|
Hi spuyet@gmail.com, Thank you for the report, we'll be happy to look into this. Has anything changed on the application side in terms of the workload prior to this event? Please upload the diagnostic data, mongod logs as well as the exact time and timezone of the reported issue. Thank you, |
| Comment by Sébastien Puyet [ 17/Jul/19 ] |
|
An interesting thing to notice is that after the steDdown the server was back to a "normal" state. |