[SERVER-22482] Cache growing to 100% followed by crash Created: 05/Feb/16 Updated: 12/Feb/16 Resolved: 12/Feb/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.2.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mike Templeman | Assignee: | Kelsey Schubert |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | Start primary shard server during a busy day and wait a couple of hours |
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
The primary server on my primary shard has encountered crashing problems repeatedly during production. Mongostat reports the used% of cache growing to 100% (and sometimes 101%) and the dirty % to over 90%. When this situation occurs it is just a matter of time until the server crashes. Memory size and res do not grow to the point where the server crashes because it is out of memory. Opening the log file for either server (this happened to both the promoted secondary as well as the primary) I find thousands of lines with these error messages:
The end of the log has no crash information at all. |
| Comments |
| Comment by Ramon Fernandez Marina [ 12/Feb/16 ] | |||||||||
|
Hi Mike, the 3.2.2 release is currently scheduled for next week. The 3.2.2-rc2 release candidate should have the same bits. Regards, | |||||||||
| Comment by Michael Templeman [ 12/Feb/16 ] | |||||||||
|
Thanks Thomas. Just as a fyi, Do you have an expected date for releasing 3.2.2? – Mike Templeman T: @talkingfrog1950 <http://twitter.com/missdestructo> On Fri, Feb 12, 2016 at 10:41 AM, Thomas Schubert (JIRA) <jira@mongodb.org> | |||||||||
| Comment by Kelsey Schubert [ 12/Feb/16 ] | |||||||||
|
Hi miketempleman, Thank you for opening this ticket. I'm sorry you have run into these issues. I'd like to summarize our findings:
Both Since we have identified the causes for this behavior, I will be closing this ticket as a duplicate of these tickets. If you see WT cache growing significantly past its limit or if you experience any other issues that cannot be explained by these tickets, please do not hesitate to open a new ticket. Thank you again, | |||||||||
| Comment by Michael Templeman [ 05/Feb/16 ] | |||||||||
|
Bruce I will set the cache size on all instances to the default. I can see that I am surprised by the number of connections in comparison to the other Finally and most importantly, please accept my personal apology for my Mike – Mike Templeman T: @talkingfrog1950 <http://twitter.com/missdestructo> | |||||||||
| Comment by Michael Templeman [ 05/Feb/16 ] | |||||||||
|
Ramon Can you delete my last message? It was both unprofessional and unfair. No Thanks – Mike Templeman T: @talkingfrog1950 <http://twitter.com/missdestructo> On Thu, Feb 4, 2016 at 5:27 PM, Ramon Fernandez (JIRA) <jira@mongodb.org> | |||||||||
| Comment by Bruce Lucas (Inactive) [ 05/Feb/16 ] | |||||||||
|
Hi Mike, We've reviewed the data and it appears that
Given this, it doesn't appear there are any additional issues beyond what we are aware of related to memory usage, and reducing the WT cache to its default value should eliminate the OOM crashes. (Note by the way that reducing WT cache does not necessarily reduce performance, and in fact under some conditions can improve it, because the remaining memory is used by the kernel as a filesystem cache, and that can have an even greater benefit than WT cache.) | |||||||||
| Comment by Ramon Fernandez Marina [ 05/Feb/16 ] | |||||||||
|
miketempleman, my colleagues in the Replication team tell me that the Fatal assertion 28723 error you're seeing in the logs has been fixed in I looked at the diagnostic data and it doesn't seem like there were any other problems going on, but I'd still recommend you consider using the default cache size for the WiredTiger engine. I'd understand if your decision to move back to 3.0 was firm, but I'd still like to encourage you to give 3.2.2-rc0 a try. Thanks, | |||||||||
| Comment by Ramon Fernandez Marina [ 05/Feb/16 ] | |||||||||
|
Thanks for the additional information miketempleman. I'm sorry to hear that the experience with 3.2.1 hasn't been good so far, and I hope we can get to the bottom if these issues to make MongoDB 3.2 better. If you want to downgrade I'd recommend 3.0.9, the latest stable release in the 3.0 series. The first thing I need to mention is that it's not recommended to increase the WiredTiger cache size: this is the size limit for the WiredTiger cache only – mongod will use additional memory, and increasing the cache size may reduce performance and increase the chances that the OOM killer will terminate mongod. That being said, the log does show a fatal assertion that needs a closer look:
I'm also looking at the diagnostics data, please continue to watch this ticket for updates. Thanks, | |||||||||
| Comment by Mike Templeman [ 05/Feb/16 ] | |||||||||
|
Log of failed restart on secondary due to oplog inconsistencies | |||||||||
| Comment by Mike Templeman [ 05/Feb/16 ] | |||||||||
|
files in <data>/diagnostic.data/ | |||||||||
| Comment by Mike Templeman [ 05/Feb/16 ] | |||||||||
|
Configuration is an AWS ec2 i2.xlarge instance (4 virtual cpu's, 30.5GB RAM, 800GB SSD ephemeral) running AWS Linux (ami-1ecae776). The data is stored on the ephemeral drive. Startup options are:- -storageEngine wiredTiger --wiredTigerCacheSizeGB 24 --logpath xxxxx --oplogSize 90000 --fork --replSet xxxxx The log file of the server after restart is for a secondary. Are you sure you want that log? | |||||||||
| Comment by Ramon Fernandez Marina [ 05/Feb/16 ] | |||||||||
|
miketempleman, I'm not sure the behavior you describe (cache growth and crash) is related to the messages you're seeing in the logs. The WTIndex::updatePosition errors you're seeing look like a manifestation of As for the cache growth and crash, please send us the following information so we can investigate further:
If the end of the log has no information at all it's possible that the mongod process was killed by the OS, which can happen if a but allows the cache to grow past its configured limit. The information requested above should point us in the right direction. Thanks, |