[SERVER-31141] Poor performance with 95% cache and many writers Created: 18/Sep/17  Updated: 29/Jan/18  Resolved: 08/Jan/18

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bob Potter Assignee: Alexander Gorrod
Resolution: Done Votes: 2
Labels: MAREF
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2017-09-18 at 5.36.27 PM.png     PNG File Screen Shot 2017-09-19 at 4.57.59 PM.png     PNG File eviction.png     Text File mongostat.txt    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

On mongo 3.4.9 we are seeing poor performance when the cache hits 95% utilization. I've attached a snippet of mongostat output from a time when this was happening. When this slowdown happens there is almost no disk activity.

Some additional information about our environment:

  • The server is running on an r4.8xlarge instance (244GB RAM / 32 cores)
  • We are using gzip compression
  • We have many concurrent writers and when this slow down happens we max out all of the write tickets. (Is it possible that eviction is having trouble making progress with that many application threads?)
  • This collection has 12 indexes.

Are there any configuration settings we should try changing to work around this?

I can upload the diagnostic.data to the private portal if that would help.



 Comments   
Comment by Bob Potter [ 03/Jan/18 ]

Great! We are already on 3.6.1 and that issue has gone away.

We are now seeing a different performance issue trigger when cache isn't maxed out. I'll open a new ticket for that though once I can repro again and collect diagnostic information.

Comment by Alexander Gorrod [ 03/Jan/18 ]

bpot Sorry for the long delay in getting back to you and thanks for uploading the data. I've taken a look at the data, and the symptoms you see correspond to some that were fixed in WT-3079, that fix is in the latest version of MongoDB 3.6, but hasn't yet landed in a version of MongoDB 3.4. I'd recommend you upgrading to the next version of MongoDB 3.4 when it becomes available, as I expect it to improve the behavior for you.

In particular your workload has a collection that uses the majority of the cache, and the fix in WT-3079 meant we more fairly added pages to the LRU queue in cases like that.

Comment by Bob Potter [ 26/Sep/17 ]

I've uploaded logs for about a 10 minute period where we were experiencing the issue. Let me know if there is any additional information which would be helpful.

Thanks for looking into this.

Comment by Alexander Gorrod [ 26/Sep/17 ]

bpot Thanks for uploading the logs - we have investigated the data. We haven't seen similar behavior in other workloads - it would help us understand the behavior on your system if you can share additional diagnostic information while the system is in the described state. It's possible that the additional information gathering will have some affect on throughput - so please monitor carefully when enabling.

How you enable the logging depends on how you administer your database, the instructions I give below work through the mongodb shell.

Enable the additional logging using:

db.getSiblingDB(‘admin’).runCommand( { setParameter: 1, wiredTigerEngineRuntimeConfig : “statistics=[fast,cache_walk],statistics_log=(wait=10,json=false,sources=[file:])” } )

Allow the machine to run for approximately 10 minutes with this setting enabled. Disable the setting by running:

db.getSiblingDB(‘admin’).runCommand( { setParameter: 1, wiredTigerEngineRuntimeConfig : “statistics=[fast,cache_walk],statistics_log=(wait=0,json=false,sources=[file:])” } )

Then please upload the following for the time period corresponding to when the additional logging is enabled:

  • The mongod log
  • The diagnostic.data subdirectory of the database directory
  • Any files that match the pattern WiredTigerStat.XX.XX that exist in the database directory. Where X corresponds to a numerical digit.

The additional information enabled here doesn't track application specific content - it purely tracks abstract database operations, so there shouldn't be any privacy issues.

Comment by Bob Potter [ 19/Sep/17 ]

Hi,

I don't have the logs from the original time period but I've uploaded redacted logs and diagnostic data for a more recent occurrence.

Thanks,
Bob

Comment by Kelsey Schubert [ 18/Sep/17 ]

Hi bpot,

Thanks for providing the diagnostic.data, we're still examining it to better understand what is going on, but an initial review suggests that it would be helpful if you could provide the complete log files covering the same time period so we can correlate these events. If there is sensitive material in the logs, would you please redact it using the same procedure as in SERVER-22642?

Kind regards,
Kelsey

Comment by Bob Potter [ 18/Sep/17 ]

Thanks. I've uploaded the data.

Comment by Kelsey Schubert [ 18/Sep/17 ]

Hi bpot,

I've created a secure upload portal for you to use to provide an archive of the diagnostic.data directory.

Thank you,
Kelsey

Generated at Thu Feb 08 04:26:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.