[SERVER-27753] wiredTiger cache 95% performance issues Created: 19/Jan/17  Updated: 28/Apr/17  Resolved: 01/Mar/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.2.11
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Luke Morfitt Assignee: Michael Cahill (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Production


Attachments: PNG File CB Screen Shot 2017-02-28 at 09.24.25.png     PNG File CB Screen Shot 2017-02-28 at 09.24.38.png     PNG File CB Screen Shot 2017-02-28 at 09.25.02.png     PNG File CB Screen Shot 2017-02-28 at 09.25.22.png     PNG File CB Screen Shot 2017-02-28 at 09.25.34.png     PNG File Cache Filling and hitting 95 image 1.png     PNG File Cache filling and hitting 95 images 2.png     PNG File Resized cache image1.png     PNG File Resized cahce image 2.png     PNG File server-27753-comparison.png    
Issue Links:
Duplicate
duplicates SERVER-27700 WT secondary performance drops to nea... Closed
is duplicated by SERVER-27153 Mongodump on secondaries causes perfo... Closed
Related
is related to SERVER-28316 Related to SERVER-27753 - Custom bui... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Hello,
We are experiencing performance drop off's when the wiredTiger Cache becomes 95% full.

Performing a mongodump on numerous databases serially causes the wiredTiger cache to become 95% populated.

In some instances the mongodump halts and our backups do not complete.

What I think is happening is the cache is growing above 95% used. At this point mongo is trying to clear data from the cache and failing, at which point mongodump hangs.

Changing the size of the wiredTiger cache (up or down) can cause the process to continue, I suspect its causing a forced flush of some cache data.

Please can you provide a list of data you would like to me to capture and a upload location for this.

Kind regards,
Luke



 Comments   
Comment by sumit jha [ 28/Apr/17 ]

Great , Thanks @Thomas

Comment by Kelsey Schubert [ 25/Apr/17 ]

Hi sumitjha,

The fixes in the patch build that Michael provided earlier are included in MongoDB 3.2.13-rc0, our release candidate for MongoDB 3.2.13, which we intend to release next week. I would recommend using MongoDB 3.2.13-rc0 for your tests as this will provide the best upgrade path to a stable release that contains these fixes.

Thank you,
Thomas

Comment by sumit jha [ 25/Apr/17 ]

Hi Michael , Can you post the changes you have done over stable release , I will verify that for few another use cases .

Comment by Michael Cahill (Inactive) [ 01/Mar/17 ]

lukemorfitt, the changes you need are tracked in SERVER-27700, I'm going to close this ticket as a duplicate of that one.

Comment by Michael Cahill (Inactive) [ 01/Mar/17 ]

lukemorfitt, thank you for testing the patch build: I'm glad to hear that it improved performance in your workload.

Here is a summary view from the diagnostic data:

You can see the baseline 3.2.11 stats on the left, and the patched results on the right. The improvement is stark: with the patched build, there is virtually no time spent waiting for space in cache, and that is because the internal stats show that eviction is orders of magnitude more efficient.

We are working now to get these changes into a 3.4.x release, once that is stable we will get them into a 3.2.x release. It will take at least a few weeks for a production-ready release.

Comment by Luke Morfitt [ 28/Feb/17 ]

Hello,

Please see attached screenshots. I've also uploaded diagnostic data per and post custom binary for your review.

We placed the custom binary on the 24th. As you can see from the graph's the performance characteristics have improved significantly.

WiredTiger cache appears to be evicting more aggressively and not reaching the high water mark and hanging.

The global locks have dropped from over a second to less than 10ms.

Are these changes going to be available in a future mongo build?

Kind regards,
Luke

Comment by Luke Morfitt [ 23/Feb/17 ]

Michael Cahill, thank you for the custom build.

We will perform some testing and back back to you.

Kind regards,
Luke

Comment by Michael Cahill (Inactive) [ 21/Feb/17 ]

lukemorfitt, thank you for uploading the diagnostic data, I have looked through it.

Your case looks similar to what we have seen in several workloads recently. In particular, there are over 20,000 active collections + indexes, and in that case with some workloads we have seen WiredTiger's cache management become inefficient. See SERVER-27700 for more details, the short version is that eviction is wasting time looking for pages to evict in trees that occupy very little space in cache.

We have been working to improve WiredTiger for this type of workload (see here for a summary of the changes).

Are you able to test a patch build of MongoDB with your workload and measure whether it improves performance in this situation for you? If so, a build for Ubuntu 14.04 is here.

Comment by Luke Morfitt [ 26/Jan/17 ]

Cache usage graphs

Comment by Luke Morfitt [ 26/Jan/17 ]

Hello ~mark.agarunov,

It feels like the cache eviction threads are not able to keep up with the incoming data and thus the performance drops off. Is it possible to tune the aggressiveness of the cache eviction process?

As the cache reaches 95% usage the performance of the server drops off and the lock times increase.

Resizing the cache allows mongo to continue until it reaches the new 95% level.

The server discussed within this post is a secondary in a 3 node replica set (r3.8xlarge, 32cores, 244gb ram, 1Tb data disk, 334gb Journel disk). It is storing 326gb on the datadisk with a 50gb OpLog. It is hosting 176 databases, only 14 databases consume and significant space ( more than 1gb in size) as per below.

dbname 1.173GB
dbname 1.374GB
dbname 1.407GB
dbname 1.839GB
dbname 2.060GB
dbname 23.149GB
local 26.939GB
dbname 4.240GB
dbname 48.836GB
dbname 6.272GB
dbname 60.971GB
dbname 66.852GB
dbname 67.230GB
dbname 7.656GB

I've uploaded the following files to the portal.

diagnostic.data.tar.gz
mongod.log.cache_hits_95.gz
mongod.log.cache_resize.gz

Kind regards,
Luke

Comment by Mark Agarunov [ 20/Jan/17 ]

Hello lukemorfitt,

I've created a safe, secure upload portal so your logs can only be accessed by us for the purpose of debugging this issue. Just to reiterate:

  • Please archive (tar or zip) the $dbpath/diagnostic.data directory and upload the archive via the portal.
  • Please upload the logs from mongod for the time period when this issue was present.

Thanks,
Mark

Comment by Luke Morfitt [ 20/Jan/17 ]

Hello Mark,

Thank you for the update. I will gather the information for you.

Please can you provide a secure upload location as our server logs will contain sensitive information.

Thanks,
Luke

Comment by Mark Agarunov [ 19/Jan/17 ]

Hello lukemorfitt,

Thank you for the report. This may be expected behavior; at 95% full, some of the application threads are used to do cache evictions which causes a slowdown. However, to definitively determine if this is due to a bug or expected behavior, please provide the following if possible:

  • Please archive (tar or zip) the $dbpath/diagnostic.data directory and attach it to this ticket
  • Please attach the logs from mongod for the time period when this issue was present.

Thanks,
Mark

Comment by Luke Morfitt [ 19/Jan/17 ]

We are also seeing this issue whilst trying to add a new hidden node to the replicaset.

The new node is trying to replicate data from a mongo secondary server. Once this server hits 95% cache utilization (105gb cache) the replicating node slows down whilst replicating data from the secondary.

Down sizing the cache from 105G to 104G allowed the replication to continue.

insert query update delete getmore command % dirty % used flushes  vsize   res qr|qw ar|aw  netIn netOut conn             set repl                 time
  *180    *0   *769    *35       0     4|0     0.3   95.0       0   105G 96.8G   0|0  0|16   503b  23.1k   37 repl-gsp-aeu002  SEC 2017-01-19T13:48:52Z
  *290    41  *1113    *68       1     2|0     0.3   95.0       0   105G 96.8G   0|0   0|1  7.57k  30.7k   37 repl-gsp-aeu002  SEC 2017-01-19T13:48:53Z
  *228    *0   *843   *103       0     3|0     0.3   95.0       0   105G 96.8G   0|0  0|16   445b  22.7k   37 repl-gsp-aeu002  SEC 2017-01-19T13:48:54Z

Cache was downscaled here.

*1512   100  *6017  *3507       0     3|0     0.4   91.5       0   105G 96.8G   0|0  0|16  17.8k  47.3k   37 repl-gsp-aeu002  SEC 2017-01-19T13:48:55Z
*7032   402 *27015  *3692       0     4|0     0.7   91.7       0   105G 96.8G   0|0   1|0  71.6k  98.3k   37 repl-gsp-aeu002  SEC 2017-01-19T13:48:56Z
  *74     2   *438    *10      14     8|0     0.7   91.8       0   105G 96.8G   0|0   1|0  1.10k  45.4k   35 repl-gsp-aeu002  SEC 2017-01-19T13:48:57Z
 *106    *0   *454     *9      11     3|0     0.7   91.8       0   105G 96.8G   0|1   0|0   445b  22.7k   35 repl-gsp-aeu002  SEC 2017-01-19T13:48:58Z

thanks,
Luke

Generated at Thu Feb 08 04:16:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.