[SERVER-24094] Server cache use can take up to an hour to recover from heavy load Created: 08/May/16  Updated: 22/Jul/16  Resolved: 22/Jul/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.2.4, 3.2.5, 3.2.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mike Templeman Assignee: Ramon Fernandez Marina
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive mongo-5-6-2.log.zip     File mongo-mongodb_mongo_v3.2_ubuntu1404_30162fa8bbb9d7e7f7a789361aed7e046995f7b3_16_06_28_18_55_58.tgz     HTML File mongo-server-stall     Zip Archive mongoPrimaryDiagnostics.zip     PNG File wired_tiger_cache_full.png     Text File wiredtiger status.txt    
Issue Links:
Duplicate
duplicates SERVER-24580 Improve performance when WiredTiger c... Closed
Operating System: ALL
Steps To Reproduce:

One our system this occurs several times a day during the work week (PST).

Participants:

 Description   

We have found that the last 3 releases of mongod can get into a state whereby the server is unable to service read tickets in anything like a timely manner. We find this occurring when the primary shard is under heavy read and write load. Once the server is in this state (%cache used ~=96%, %cache dirty ~> 30, read tickets available ~>60) for more than ~15 minutes it will remain in that state for up to an hour even when query load is significantly reduced (reduced to ~1/3 of previous load).

Checking tps and cpu load with iostat I see that the tps load is very low (usually under 1000 on ephemeral SSD storage) but cpu load is very high (usually 85% with system use taking anywhere between 15 and 50% of cpu load).

Looking in the logs for long-running queries or heavy query loads when in this condition shows no document scans. Checking db.currentOp() can return over 200 operations outstanding but no operations with secs_running>1.

We are running a 3.2.6 sharded cluster on ec2. The ami id is amzn-ami-hvm-2015.03.0.x86_64-gp2 (ami-1ecae776). Instance type is i2.xLarge. No other services are running on the server.



 Comments   
Comment by Kelsey Schubert [ 22/Jul/16 ]

Hi miketempleman,

gheppner has seen encouraging results after upgrading to 3.2.8. Therefore, I am closing this ticket as a duplicate of SERVER-24580. Please upgrade to 3.2.8 and confirm that the issue is resolved.

If you still experience issues after upgrading, please let us know and we will continue to investigate.

Thank you,
Thomas

Comment by Ramon Fernandez Marina [ 28/Jun/16 ]

When I wrote the comment above the changes in the provided patch build had only made it to the development branch and into the 3.3.9 development release (announced earlier today).

As of today, these changes have also been added to the v3.2 branch, and the goal is to make them part of the next stable release, 3.2.8, scheduled for the coming weeks.

Hope that clarifies things, but please let us know if you have additional questions.

Thanks,
Ramón.

Comment by George Heppner [ 28/Jun/16 ]

Thanks. I'll keep you posted. As I mentioned, we see this less frequently in our dev environment so may be a while before I have any conclusive results. And I too am somewhat hesitant to put something like this out in production, although might consider it if we don't see any issues running it in Dev.

Can you clarify for me when the changes in this patch would see a production release? You mentioned a 3.3.x development release, so does that mean it won't be until 3.4.x?

Comment by Ramon Fernandez Marina [ 28/Jun/16 ]

Thanks gheppner, I've attached a file to this ticket with a patched version of MongoDB for Ubuntu 14.04 LTS that you can use for testing:

mongo-mongodb_mongo_v3.2_ubuntu1404_30162fa8bbb9d7e7f7a789361aed7e046995f7b3_16_06_28_18_55_58.tgz

Please let us know how it goes.

Thanks,
Ramón.

Comment by George Heppner [ 28/Jun/16 ]

We're on Ubuntu 14.04 LTS.

Comment by Ramon Fernandez Marina [ 28/Jun/16 ]

gheppner, we have a patch that has passed all our internal qualification tests, so we can send it your way for testing. Can you please tell me what platform(s) will you be testing on so I can produce the right binaries?

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 27/Jun/16 ]

Thanks for the additional input gheppner. The current changes for SERVER-24580 are part of development release 3.3.9, which will be available soon – but I'll work on getting a patch build on MongoDB 3.2 available in case you're able to find resources to test it.

Comment by George Heppner [ 27/Jun/16 ]

Per my comment - the node in "blue" is routinely using 100% of its cache despite query volume dropping off, and the other two shards only at 80%.

Comment by George Heppner [ 27/Jun/16 ]

I wanted to chime in and say we are being impacted by this exact same issue, on Mongo 3.2.5. In the early versions of 3.2.x we did not see this problem, and first noticed it on 3.2.4.

I'm attaching a graph that tells a very similar story to what the reporter is seeing. In our case, I do (much less frequently) see this same issue in our Dev environment, so testing a patch might be an option for us. This is causing us fairly serious problems, as you can see in the attached image having one shard experience such huge delays (which comes with huge load average and system % time) is badly hurting our throughput. I am debating scaling up on either shards or memory since I do know our working set is fairly large, but I'm hesitant since if it were not for this issue I feel like we'd be getting adequate performance with what we have today.

If providing correlating diagnostic data from our deployment would be of help here I'm happy to try and provide it.

Comment by Ramon Fernandez Marina [ 22/Jun/16 ]

No problem, given your current situation I'd do the same thing. Thanks for getting back to us, we'll keep you posted on any developments.

Comment by Michael Templeman [ 22/Jun/16 ]

Ramon

I am understandably reluctant to run a patch on our production system.
Particularly if the patch has any possibility of data loss or corruption. I
cannot effectively simulate this problem on our staging system (load,
system setup, etc.) either. Given that the problem occurs on our primary
shard, any potential data loss of semi-relational collections (yeah, I
know) would require us to roll back to the last hour of backup and restore.
Since we have not tested a restore for months (again, yeah, I know) i
cannot expect 100% data recovery.

Mike

Mike Templeman
Head of Development

T: @talkingfrog1950 <http://twitter.com/missdestructo>
T: @Meshfire <http://twitter.com/meshfire>

On Wed, Jun 22, 2016 at 5:11 AM, Ramon Fernandez (JIRA) <jira@mongodb.org>

Comment by Ramon Fernandez Marina [ 22/Jun/16 ]

miketempleman, after taking a closer look at the diagnostic data you uploaded the evidence for you being affected by SERVER-24580 is inconclusive. We're working on a fix for it, and I was wondering if you'd have an environment where you could try out a patch build with such tentative fix.

I understand testing a patch build is not ideal if you're seeing this in production, but field testing would help us make progress on this ticket.

Please let us know your thoughts. In the meantime we're also investigating whether this could be a manifestation of a variation of SERVER-23622.

Thanks,
Ramón.

Comment by Mike Templeman [ 21/Jun/16 ]

Ramon

No worries about the delay. This clearly is a difficult problem and it might well be the same one as SERVER-24580. While I do not have the same analysis tools available to your folks I have examined my issue for some months now and have some observations that might help a little:

  • the affected mongod process is not i/o bound at all. I am using ephemeral SSD storage and when in the stall state TPS is usually less than 1,000 (I see TPS burst often to >10,000 without any issues)
  • Virtually always I see iostat %system go to > 40% during the stall period
  • %dirty is generally > 25% during the entire stall period
  • %used is invariably = 96%. Very rarely greater than that.
  • vsize seems more important than resident size. Default cache size on my servers is 16GB (WT default). When vsize goes to ~24GB on the primary shard primary during the busy daytime I start to schedule my usual workaround
  • the number of read tickets available during the stall drops to <30 (average). Write tickets drop to ~100
  • Workaround is to (1) reduce the number of reads, (2) step down the primary server, and (3) restart the primary server. This often works unless the secondary vsize is > 20GB, in which case the same situation occurs on the new primary within a minute or two.
  • balancing is turned off
  • fast compression is used.
  • OS is amzn-ami-hvm-2015.03.0.x86_64-gp2 (ami-1ecae776)

The attached image shows the period of time going from (1) everything is working without a problem, (2) stall occurred on primary shard primary, (3) stepdown of primary followed by restart of affected server.

Some of the behavior does indicate a possible deadly embrace, though I would not expect CPU load to go so high in that situation. I was thinking that the issue was limited cpu capacity when the cache is fully loaded but after reading SERVER-24580 I am less certain.

I hope this information provides a little help to the engineers.

Mike

Comment by Ramon Fernandez Marina [ 21/Jun/16 ]

Apologies for the radio silence miketempleman. This issue has proven to be elusive, but there's been some recent developments: I believe the symptoms you describe are the same as those in SERVER-24580, which is being investigated at high priority.

I'll check with my colleagues, but if they agree that you're likely running into SERVER-24580 I'll be closing this ticket as a duplicate of SERVER-24580 – which I suggest you watch for updates.

Thanks for your patience on this matter.

Regards,
Ramón.

Comment by Mike Templeman [ 12/May/16 ]

Could the issue be simply that I need more cpu resources to handle the compression/decompression processing by WT given the load? It doesn't appear that io latency is an issue.

Comment by Mike Templeman [ 10/May/16 ]

The %system cpu load when in this state seems very high as well. Here is some iostat 5 output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          51.00    0.00   47.49    0.10    0.05    1.35
 
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvda              0.40         0.00         4.80          0         24
xvdb           1077.40      6228.80     19368.00      31144      96840
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          52.54    0.00   45.00    0.40    0.05    2.01
 
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvda              0.00         0.00         0.00          0          0
xvdb           3578.40      5400.00     69345.60      27000     346728

Comment by Mike Templeman [ 09/May/16 ]

Attached are the diagnostic logs for the last 5 days. The latest log covers a period where the core server/WT entered the cache load/ticket exhaustion state.

Comment by Daniel Pasette (Inactive) [ 08/May/16 ]

Hi Mike,
If possible, can you zip up and attach to this ticket the diagnostic log files found in $dbpath/diagnostic.data for the server?

Thanks, Dan

Generated at Thu Feb 08 04:05:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.