[SERVER-23001] Occasional 100% cache uses cripples server Created: 08/Mar/16 Updated: 18/Jun/16 Resolved: 28/Mar/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.0.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Paul Ridgway | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Distributor ID: Ubuntu mongod -version AWS c4.4xlarge 1 x 6TB gp2 volume for mongo exclusively No other significant processes or services running/consuming resource. |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | Linux | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
Apologies this may be a WT ticket? When running WiredTiger in a 3 node replica set, after a period of time one or more of the nodes will consistently run above 80% cache used and grow until it hits 100% at which point it will struggle to serve reads (or writes when master) in a timely fashion. Usually a restart and or stepDown helps. So far adding more eviction threads or reducing the eviction trigger and target does not help, nor does providing more RAM/cache. Before this happens mongo uses ~45-50% ram, once the issue presents ram goes over 50% - anecdotally sometimes up to 80%. The issue seems to occur randomly after days or more of apparent stability. There is no apparent stress on the VM. More RAM did not help, more CPU seems to have reduced the rate of ocurrances (was on an m4.2xlarge). Before running on the current instances there was a lot of system CPU. Running on 3 x AWS c4.4xlarge, 16 Cores, 60 GB Ram. Mongostat output from the affected slave (during the issue):
IOStat:
iotop:
|
| Comments |
| Comment by Bruce Lucas (Inactive) [ 18/Jun/16 ] | ||||
|
Thanks Paul. I can confirm that this looks like an instance of the issue or issues we're investigating on | ||||
| Comment by Paul Ridgway [ 18/Jun/16 ] | ||||
|
I've got the diagnostics from the 3 nodes. The primary moved around a lot in the chaos and there was lots of restarting and resizing instances (none of which helped). I uploaded them using the Support File Upload tool - hope it helps! Thanks | ||||
| Comment by Ramon Fernandez Marina [ 17/Jun/16 ] | ||||
|
Thanks for the update paul@thefloow.com, sorry to hear 3.2 did not fix your use case. We're investigating performance issues when the WiredTiger cache fills up in MongoDB 3.2 automatically collects diagnostic data in the diagnostic.data directory inside your dbpath. If you still have those files and could upload them to this ticket we could investigate whether this indeed matches Thanks, | ||||
| Comment by Paul Ridgway [ 17/Jun/16 ] | ||||
|
In a high volume prod env 3.2 was catastrophically worse and we had to roll back. We plan to update some slaves to get logs when we can. In testing it was fine but it was disastrous under the production workload ... EDIT: I should add it was crippling every node, master and slave, within moments of a restart at times but not predictably. On 3.0 a step down and restart would cure a stressed node | ||||
| Comment by Ramon Fernandez Marina [ 28/Mar/16 ] | ||||
|
paul.ridgway, I am going to mark this ticket as a duplicate of Thanks, | ||||
| Comment by Paul Ridgway [ 22/Mar/16 ] | ||||
|
Thanks - I guess as a restart helps then there are pages suitable for eviction it just cant find them fast enough? We are trialling 3.2 in dev currently so upgrade is certainly an option - do you know when it is likely to get backported? Cheers | ||||
| Comment by Bruce Lucas (Inactive) [ 22/Mar/16 ] | ||||
|
Thanks paul.ridgway. This looks similar to the previous incident:
We suspect this could be related to Thanks, | ||||
| Comment by Paul Ridgway [ 22/Mar/16 ] | ||||
|
Hi, It happened again on the same node I had stepped down at ~6AM (as a slave now) at ~07:50. 100 cache use, 'slow' queries as a result, 59.8% system RAM usage. Logs submitted (including ss.log) as node3-22mar2016.zip. Appeared to recover following a restart. EDIT: We now log some mongo metrics with more granularity in our monitoring system, so attaching plots of cache and RAM usage - note node3 was primary, stepped down and node1 took over at 8am. Interestingly node2 is quite choppy on cache and is the only one using over 50% system RAM. Attached WT Cache usage during the incident, also attached plot for 3 nodes, colour matched WT Cache (solid) to RAM (dotted). As a reminder these VMs have 30GB of RAM. Thanks, Paul | ||||
| Comment by Paul Ridgway [ 22/Mar/16 ] | ||||
|
Hi, It happened again this morning on the Primary, possibly as early as 2AM am and stopped at 6:08AM once stepped down. Mongo system memory usage was up to 58.3% and cache use was solidly at 100% causing slow queries (which are not normally considered slow). It seems the ss.log writer crashed at some point before hand but I have iostat Logs submitted. Thanks, Paul | ||||
| Comment by Paul Ridgway [ 19/Mar/16 ] | ||||
|
Hi Bruce, I tried both before and after - tried a large range of min and max threads too - it all made zero impact. Cheers, Paul | ||||
| Comment by Bruce Lucas (Inactive) [ 19/Mar/16 ] | ||||
|
Hi Paul, The tool used to generate the graphs is an internal tool that we don't yet feel is in a condition to make public. Thanks for the confirmation. One other question occurs to me: you mentioned that adding more eviction threads didn't help, but you also mentioned that adding CPU did help somewhat. Did you try increasing the eviction threads before or after you increased CPU? If it was before, would you be able to try more eviction threads now that you have more CPU? Thanks, | ||||
| Comment by Paul Ridgway [ 19/Mar/16 ] | ||||
|
Hi Bruce, Thanks for following up! This incident was not exactly as we've seen before - as you say, there are some larger queries (aggregate) running. On previous incidents we've stopped all but the core consumer services and seen no recovery. Additionally, despite letting the aggregate queries keep running a restart was enough to recover the instance and have the cache site around 80%. The memory did peak over 50% but not as bad as before. As you note, the main issue still seems to be an eviction rate that cannot keep up despite plenty of resource being available (CPU, RAM, iops, disk throughput, etc). Cheers, Paul P.S - is there a way for us to produce those plots or is it a private tool? | ||||
| Comment by Bruce Lucas (Inactive) [ 18/Mar/16 ] | ||||
|
paul.ridgway, thanks for uploading that snapshot. here's a plot of some of the data from the end of the snapshot:
We can see starting at C, at about 2016-03-17 09:13:38Z, the cache usage begins to rise, and around D at about 2016-03-17 09:16:58.828Z it reaches 100%. At this point we see application threads begin to do evictions because of the 100% cache full condition, and this would have impacted performance. Can you confirm that this corresponds to events as you observed them? Regarding your opening comment that mongod RAM usage sometimes goes as high as 80%, it doesn't appear that we are seeing that in this instance as RAM usage appears about as expected, with virtual memory around 19 GB; can you confirm? The trigger in this case appears, based on coincidence in timing, to be a number of aggregation queries that started at C, around 2016-03-17 09:13:38Z. I imagine these queries were probably accessing data not in cache so were creating increased cache pressure; does this seem plausible to you? Here are some statistics about cache churn just before and just after C:
We see an increase in pages read into the cache, presumably a result of the aggregations; but we see a decrease, both in modified and unmodified pages evicted per second, for a net increase of about 500-600 pages per second in the cache, consistent with the cache growth we see during this period. What is not clear so far however is why evictions weren't to keep up to keep up in order to keep cache usage at target levels below 100%; we will continue to look into that. Bruce | ||||
| Comment by Paul Ridgway [ 17/Mar/16 ] | ||||
|
One slave did it this morning without a big query - I snapshotted the logs while it was happening and uploaded those (logs-17mar2016.zip). | ||||
| Comment by Paul Ridgway [ 15/Mar/16 ] | ||||
|
I was running an unindexed query this morning which first time completed without drama in ~1000 seconds, second run (Same query) ran for an hour before I killed it, got the cache to 100% and mongo using 60% system RAM. After killing the query mongo eventually recovered. I'll submit the logs anyway though I'm not certain this is the same issue - even though the symptoms are identical we were not running large unindexed queries when it last happened... I stopped the query no more than a minute or so before taking a copy of the ss.log and iostat.log files. | ||||
| Comment by Paul Ridgway [ 08/Mar/16 ] | ||||
|
Thanks - I did wonder if 3.2 was better but looking at the release notes it was hard to know if WT was better or whether you back-ported critical/performance/engine fixes? Having said that, in another cluster we run 3.2 and can make it thrash the cache but it was a different workload so its not necessarily comparable. | ||||
| Comment by Ramon Fernandez Marina [ 08/Mar/16 ] | ||||
|
Understood, we'll keep this ticket open for the time being. Note that MongoDB 3.2 includes a lot of improvements over 3.0, specially on WiredTiger, so if you have the ability to test the same scenario in a 3.2 replica set it would be useful to see if this particular issue has been addressed, and whether 3.2 will be a better fit for your use case. Anyway, please let us know when you have data and have uploaded it, and we'll take a closer look. Thanks, | ||||
| Comment by Paul Ridgway [ 08/Mar/16 ] | ||||
|
HI ramon.fernandez, Thank you for getting back to me so quickly. I'll set the logging running and revert back to you once the issue has presented again - it could take a week or so it seems. Cheers, Paul | ||||
| Comment by Ramon Fernandez Marina [ 08/Mar/16 ] | ||||
|
paul.ridgway, in order to troubleshoot this issue we'll need some diagnostic information. I'd like to ask you to run the following commands from the command line on each of your nodes:
These commands will create two files on each node, ss.log and iostat.log. Please start data collection now; when the issue appears again, let the data collection continue for a bit (as much as you can afford without impacting your production) and then upload the resulting files here (these files won't contain any sensitive information, and they'll only be available to MongoDB staff). Thanks, |