[SERVER-24094] Server cache use can take up to an hour to recover from heavy load Created: 08/May/16 Updated: 22/Jul/16 Resolved: 22/Jul/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.2.4, 3.2.5, 3.2.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mike Templeman | Assignee: | Ramon Fernandez Marina |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | One our system this occurs several times a day during the work week (PST). |
||||||||
| Participants: | |||||||||
| Description |
|
We have found that the last 3 releases of mongod can get into a state whereby the server is unable to service read tickets in anything like a timely manner. We find this occurring when the primary shard is under heavy read and write load. Once the server is in this state (%cache used ~=96%, %cache dirty ~> 30, read tickets available ~>60) for more than ~15 minutes it will remain in that state for up to an hour even when query load is significantly reduced (reduced to ~1/3 of previous load). Checking tps and cpu load with iostat I see that the tps load is very low (usually under 1000 on ephemeral SSD storage) but cpu load is very high (usually 85% with system use taking anywhere between 15 and 50% of cpu load). Looking in the logs for long-running queries or heavy query loads when in this condition shows no document scans. Checking db.currentOp() can return over 200 operations outstanding but no operations with secs_running>1. We are running a 3.2.6 sharded cluster on ec2. The ami id is amzn-ami-hvm-2015.03.0.x86_64-gp2 (ami-1ecae776). Instance type is i2.xLarge. No other services are running on the server. |
| Comments |
| Comment by Kelsey Schubert [ 22/Jul/16 ] | |||||||||||||
|
Hi miketempleman, gheppner has seen encouraging results after upgrading to 3.2.8. Therefore, I am closing this ticket as a duplicate of If you still experience issues after upgrading, please let us know and we will continue to investigate. Thank you, | |||||||||||||
| Comment by Ramon Fernandez Marina [ 28/Jun/16 ] | |||||||||||||
|
When I wrote the comment above the changes in the provided patch build had only made it to the development branch and into the 3.3.9 development release (announced earlier today). As of today, these changes have also been added to the v3.2 branch, and the goal is to make them part of the next stable release, 3.2.8, scheduled for the coming weeks. Hope that clarifies things, but please let us know if you have additional questions. Thanks, | |||||||||||||
| Comment by George Heppner [ 28/Jun/16 ] | |||||||||||||
|
Thanks. I'll keep you posted. As I mentioned, we see this less frequently in our dev environment so may be a while before I have any conclusive results. And I too am somewhat hesitant to put something like this out in production, although might consider it if we don't see any issues running it in Dev. Can you clarify for me when the changes in this patch would see a production release? You mentioned a 3.3.x development release, so does that mean it won't be until 3.4.x? | |||||||||||||
| Comment by Ramon Fernandez Marina [ 28/Jun/16 ] | |||||||||||||
|
Thanks gheppner, I've attached a file to this ticket with a patched version of MongoDB for Ubuntu 14.04 LTS that you can use for testing:
Please let us know how it goes. Thanks, | |||||||||||||
| Comment by George Heppner [ 28/Jun/16 ] | |||||||||||||
|
We're on Ubuntu 14.04 LTS. | |||||||||||||
| Comment by Ramon Fernandez Marina [ 28/Jun/16 ] | |||||||||||||
|
gheppner, we have a patch that has passed all our internal qualification tests, so we can send it your way for testing. Can you please tell me what platform(s) will you be testing on so I can produce the right binaries? Thanks, | |||||||||||||
| Comment by Ramon Fernandez Marina [ 27/Jun/16 ] | |||||||||||||
|
Thanks for the additional input gheppner. The current changes for | |||||||||||||
| Comment by George Heppner [ 27/Jun/16 ] | |||||||||||||
|
Per my comment - the node in "blue" is routinely using 100% of its cache despite query volume dropping off, and the other two shards only at 80%. | |||||||||||||
| Comment by George Heppner [ 27/Jun/16 ] | |||||||||||||
|
I wanted to chime in and say we are being impacted by this exact same issue, on Mongo 3.2.5. In the early versions of 3.2.x we did not see this problem, and first noticed it on 3.2.4. I'm attaching a graph that tells a very similar story to what the reporter is seeing. In our case, I do (much less frequently) see this same issue in our Dev environment, so testing a patch might be an option for us. This is causing us fairly serious problems, as you can see in the attached image having one shard experience such huge delays (which comes with huge load average and system % time) is badly hurting our throughput. I am debating scaling up on either shards or memory since I do know our working set is fairly large, but I'm hesitant since if it were not for this issue I feel like we'd be getting adequate performance with what we have today. If providing correlating diagnostic data from our deployment would be of help here I'm happy to try and provide it. | |||||||||||||
| Comment by Ramon Fernandez Marina [ 22/Jun/16 ] | |||||||||||||
|
No problem, given your current situation I'd do the same thing. Thanks for getting back to us, we'll keep you posted on any developments. | |||||||||||||
| Comment by Michael Templeman [ 22/Jun/16 ] | |||||||||||||
|
Ramon I am understandably reluctant to run a patch on our production system. Mike – Mike Templeman T: @talkingfrog1950 <http://twitter.com/missdestructo> On Wed, Jun 22, 2016 at 5:11 AM, Ramon Fernandez (JIRA) <jira@mongodb.org> | |||||||||||||
| Comment by Ramon Fernandez Marina [ 22/Jun/16 ] | |||||||||||||
|
miketempleman, after taking a closer look at the diagnostic data you uploaded the evidence for you being affected by I understand testing a patch build is not ideal if you're seeing this in production, but field testing would help us make progress on this ticket. Please let us know your thoughts. In the meantime we're also investigating whether this could be a manifestation of a variation of Thanks, | |||||||||||||
| Comment by Mike Templeman [ 21/Jun/16 ] | |||||||||||||
|
Ramon No worries about the delay. This clearly is a difficult problem and it might well be the same one as
The attached image shows the period of time going from (1) everything is working without a problem, (2) stall occurred on primary shard primary, (3) stepdown of primary followed by restart of affected server. Some of the behavior does indicate a possible deadly embrace, though I would not expect CPU load to go so high in that situation. I was thinking that the issue was limited cpu capacity when the cache is fully loaded but after reading I hope this information provides a little help to the engineers. Mike | |||||||||||||
| Comment by Ramon Fernandez Marina [ 21/Jun/16 ] | |||||||||||||
|
Apologies for the radio silence miketempleman. This issue has proven to be elusive, but there's been some recent developments: I believe the symptoms you describe are the same as those in I'll check with my colleagues, but if they agree that you're likely running into Thanks for your patience on this matter. Regards, | |||||||||||||
| Comment by Mike Templeman [ 12/May/16 ] | |||||||||||||
|
Could the issue be simply that I need more cpu resources to handle the compression/decompression processing by WT given the load? It doesn't appear that io latency is an issue. | |||||||||||||
| Comment by Mike Templeman [ 10/May/16 ] | |||||||||||||
|
The %system cpu load when in this state seems very high as well. Here is some iostat 5 output:
| |||||||||||||
| Comment by Mike Templeman [ 09/May/16 ] | |||||||||||||
|
Attached are the diagnostic logs for the last 5 days. The latest log covers a period where the core server/WT entered the cache load/ticket exhaustion state. | |||||||||||||
| Comment by Daniel Pasette (Inactive) [ 08/May/16 ] | |||||||||||||
|
Hi Mike, Thanks, Dan |