[SERVER-27753] wiredTiger cache 95% performance issues Created: 19/Jan/17 Updated: 28/Apr/17 Resolved: 01/Mar/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.2.11 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Luke Morfitt | Assignee: | Michael Cahill (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Production |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Hello, Performing a mongodump on numerous databases serially causes the wiredTiger cache to become 95% populated. In some instances the mongodump halts and our backups do not complete. What I think is happening is the cache is growing above 95% used. At this point mongo is trying to clear data from the cache and failing, at which point mongodump hangs. Changing the size of the wiredTiger cache (up or down) can cause the process to continue, I suspect its causing a forced flush of some cache data. Please can you provide a list of data you would like to me to capture and a upload location for this. Kind regards, |
| Comments |
| Comment by sumit jha [ 28/Apr/17 ] | ||||||||
|
Great , Thanks @Thomas | ||||||||
| Comment by Kelsey Schubert [ 25/Apr/17 ] | ||||||||
|
Hi sumitjha, The fixes in the patch build that Michael provided earlier are included in MongoDB 3.2.13-rc0, our release candidate for MongoDB 3.2.13, which we intend to release next week. I would recommend using MongoDB 3.2.13-rc0 for your tests as this will provide the best upgrade path to a stable release that contains these fixes. Thank you, | ||||||||
| Comment by sumit jha [ 25/Apr/17 ] | ||||||||
|
Hi Michael , Can you post the changes you have done over stable release , I will verify that for few another use cases . | ||||||||
| Comment by Michael Cahill (Inactive) [ 01/Mar/17 ] | ||||||||
|
lukemorfitt, the changes you need are tracked in | ||||||||
| Comment by Michael Cahill (Inactive) [ 01/Mar/17 ] | ||||||||
|
lukemorfitt, thank you for testing the patch build: I'm glad to hear that it improved performance in your workload. Here is a summary view from the diagnostic data:
You can see the baseline 3.2.11 stats on the left, and the patched results on the right. The improvement is stark: with the patched build, there is virtually no time spent waiting for space in cache, and that is because the internal stats show that eviction is orders of magnitude more efficient. We are working now to get these changes into a 3.4.x release, once that is stable we will get them into a 3.2.x release. It will take at least a few weeks for a production-ready release. | ||||||||
| Comment by Luke Morfitt [ 28/Feb/17 ] | ||||||||
|
Hello, Please see attached screenshots. I've also uploaded diagnostic data per and post custom binary for your review. We placed the custom binary on the 24th. As you can see from the graph's the performance characteristics have improved significantly. WiredTiger cache appears to be evicting more aggressively and not reaching the high water mark and hanging. The global locks have dropped from over a second to less than 10ms. Are these changes going to be available in a future mongo build? Kind regards, | ||||||||
| Comment by Luke Morfitt [ 23/Feb/17 ] | ||||||||
|
Michael Cahill, thank you for the custom build. We will perform some testing and back back to you. Kind regards, | ||||||||
| Comment by Michael Cahill (Inactive) [ 21/Feb/17 ] | ||||||||
|
lukemorfitt, thank you for uploading the diagnostic data, I have looked through it. Your case looks similar to what we have seen in several workloads recently. In particular, there are over 20,000 active collections + indexes, and in that case with some workloads we have seen WiredTiger's cache management become inefficient. See We have been working to improve WiredTiger for this type of workload (see here for a summary of the changes). Are you able to test a patch build of MongoDB with your workload and measure whether it improves performance in this situation for you? If so, a build for Ubuntu 14.04 is here. | ||||||||
| Comment by Luke Morfitt [ 26/Jan/17 ] | ||||||||
|
Cache usage graphs | ||||||||
| Comment by Luke Morfitt [ 26/Jan/17 ] | ||||||||
|
Hello ~mark.agarunov, It feels like the cache eviction threads are not able to keep up with the incoming data and thus the performance drops off. Is it possible to tune the aggressiveness of the cache eviction process? As the cache reaches 95% usage the performance of the server drops off and the lock times increase. Resizing the cache allows mongo to continue until it reaches the new 95% level. The server discussed within this post is a secondary in a 3 node replica set (r3.8xlarge, 32cores, 244gb ram, 1Tb data disk, 334gb Journel disk). It is storing 326gb on the datadisk with a 50gb OpLog. It is hosting 176 databases, only 14 databases consume and significant space ( more than 1gb in size) as per below. dbname 1.173GB I've uploaded the following files to the portal. diagnostic.data.tar.gz Kind regards, | ||||||||
| Comment by Mark Agarunov [ 20/Jan/17 ] | ||||||||
|
Hello lukemorfitt, I've created a safe, secure upload portal so your logs can only be accessed by us for the purpose of debugging this issue. Just to reiterate:
Thanks, | ||||||||
| Comment by Luke Morfitt [ 20/Jan/17 ] | ||||||||
|
Hello Mark, Thank you for the update. I will gather the information for you. Please can you provide a secure upload location as our server logs will contain sensitive information. Thanks, | ||||||||
| Comment by Mark Agarunov [ 19/Jan/17 ] | ||||||||
|
Hello lukemorfitt, Thank you for the report. This may be expected behavior; at 95% full, some of the application threads are used to do cache evictions which causes a slowdown. However, to definitively determine if this is due to a bug or expected behavior, please provide the following if possible:
Thanks, | ||||||||
| Comment by Luke Morfitt [ 19/Jan/17 ] | ||||||||
|
We are also seeing this issue whilst trying to add a new hidden node to the replicaset. The new node is trying to replicate data from a mongo secondary server. Once this server hits 95% cache utilization (105gb cache) the replicating node slows down whilst replicating data from the secondary. Down sizing the cache from 105G to 104G allowed the replication to continue.
Cache was downscaled here.
thanks, |