[SERVER-22000] MongoDB version 3.2; WiredTiger is killed by OOM killer Created: 23/Dec/15 Updated: 14/Apr/17 Resolved: 08/Feb/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability, WiredTiger |
| Affects Version/s: | 3.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Amanpreet Singh | Assignee: | Ramon Fernandez Marina |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | WTmem, WTplaybook | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
I just upgraded Mongo 3.0.8 to 3.2 and changed storage engine to WiredTiger. Now primary of my replica set crashes almost everyday. There's no error in mongo log. Tailing the dmesg shows this -
The nproc, ulimit, etc. values are set. The instance has 30 GBs of RAM and same instance used to run mongo with MMapv1 without any issues. |
| Comments |
| Comment by Ramon Fernandez Marina [ 14/Apr/17 ] | ||||||
|
pieterwjordaanpc@gmail.com, the graphs are generated by internal tools. You can find information about FTDC implementation here. | ||||||
| Comment by Pieter Jordaan [ 14/Apr/17 ] | ||||||
|
Though not relevant, how are the graphs generated from the diagnostic data? I've seen many jira entries showing them but don't know how to create them. | ||||||
| Comment by Ramon Fernandez Marina [ 08/Feb/16 ] | ||||||
|
amanpreet@codigami.com, we believe you're running into
At A, B, and C we see jumps in wt "open cursor count", and at those points or shortly thereafter we see the jumps in "current_allocate_bytes" that account for a large part of the excess. I'd recommend you review the places in your application that need to open that many cursors and see if they can be cleaned up after use. Depending on your use case, an alternative may to lower the idle cursor timeout as describe in I'm going to resolve this ticket as a duplicate of Regards, | ||||||
| Comment by Tomasz Pajor [ 27/Jan/16 ] | ||||||
|
Actually if you could do this with 3.2.1 it would be great. | ||||||
| Comment by Amanpreet Singh [ 27/Jan/16 ] | ||||||
|
Okay. Will do that. | ||||||
| Comment by Tomasz Pajor [ 27/Jan/16 ] | ||||||
|
Sad to hear that | ||||||
| Comment by Amanpreet Singh [ 27/Jan/16 ] | ||||||
|
Just now, it crashed with 15 GB cache size as well | ||||||
| Comment by Amanpreet Singh [ 26/Jan/16 ] | ||||||
|
Hi Tomasz, I reduced cache_size from 17G to 15 GB. I could not test with 3.2.1 yet, since it's a prod db and I can't afford to see it having problems again. | ||||||
| Comment by Tomasz Pajor [ 25/Jan/16 ] | ||||||
|
Hello, Does this still happen with 3.2.1? How drastically you had to reduce cache_size? | ||||||
| Comment by Amanpreet Singh [ 25/Jan/16 ] | ||||||
|
Hi, Any update on this? I see the status is still "debugging with submitter". I have downgraded mongod to 3.0.8 and reduced the WiredTiger cache_size to deal with this issue for now. | ||||||
| Comment by Amanpreet Singh [ 11/Jan/16 ] | ||||||
|
Yes, that's correct. It's running only mongod process. | ||||||
| Comment by Tomasz Pajor [ 11/Jan/16 ] | ||||||
|
This is just a suggestion from my end, I don't know if this will fix your issue. So on this instance you are only running mongo server, nothing else that might consume the resources? | ||||||
| Comment by Amanpreet Singh [ 11/Jan/16 ] | ||||||
|
The relevant log output is: This instance in r3.xlarge on AWS. It has ~30 GB memory. But I'm surprised that if this is the issue, why is the default value set to be so? | ||||||
| Comment by Tomasz Pajor [ 11/Jan/16 ] | ||||||
|
From documentation (https://docs.mongodb.org/manual/reference/configuration-options/#storage.wiredTiger.engineConfig.cacheSizeGB) Changed in version 3.2: With WiredTiger, MongoDB utilizes both the filesystem cache and WiredTiger cache. By default, starting in MongoDB 3.2, the WiredTiger cache will use 60% of RAM minus 1 GB or it will use 1 GB, whichever is larger. For systems with up to 10 GB of RAM, this is less than or equal to the 3.0 setting. For systems with more than 10 GB of RAM, the configuration is greater than the 3.0 setting. For me it helped to lower it from the 60% to 8gb, cause based on the 60% it was setting it to 18gb. Can you show me the log output, please look for a line like "wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=sna | ||||||
| Comment by Amanpreet Singh [ 10/Jan/16 ] | ||||||
|
Hi Tomasz, How do I verify if it's due to cacheSizeGB? I haven't manually set the cacheSizeGB value. The instance is running just one mongod instance and nothing else. Shouldn't the default value be okay? | ||||||
| Comment by Tomasz Pajor [ 10/Jan/16 ] | ||||||
|
Can you please verify if this is not happening due to cacheSizeGB? I had similar problem and changing it fixed it for me. | ||||||
| Comment by Amanpreet Singh [ 31/Dec/15 ] | ||||||
|
Hi Dan, rs.status() doesn't show any information about protocolVersion. But I believe it would be version 0, since I just upgraded mongod from 3.0.8 to 3.2 on each member one-by-one and replica-set was initialized before version 3.2. | ||||||
| Comment by Daniel Pasette (Inactive) [ 31/Dec/15 ] | ||||||
|
You can check by calling rs.status() in the mongo shell and checking for the protocol version in the output. If you have simply upgraded, or using a replica set with mixed versions, then you are using protocol version 1. Thanks for the additional information, we are continuing to investigate the issue. | ||||||
| Comment by Amanpreet Singh [ 29/Dec/15 ] | ||||||
|
Hi Bruce, I'll try to answer the questions as much as possible:
Thanks. Let me know if you need anything else. Best, | ||||||
| Comment by Bruce Lucas (Inactive) [ 28/Dec/15 ] | ||||||
|
Hi Aman, Understood. We may have identified another system internally that is showing similar behavior. For comparison purposes can you tell us generally about your setup:
Thanks, | ||||||
| Comment by Amanpreet Singh [ 24/Dec/15 ] | ||||||
|
Hi Bruce, I'm afraid proving logs won't be possible since there's some operations logged which have sensitive data. Best, | ||||||
| Comment by Bruce Lucas (Inactive) [ 24/Dec/15 ] | ||||||
|
Hi Aman, Thanks for providing that data. Here is a timeline of memory usage:
We see increases in allocated memory at some specific times, for example around A and B above. I can't find anything in the stats to account for those increases, but perhaps if we look in the mongod log we will find some clues. Would you be able to upload the mongod log files corresponding to this timeframe (the mongod restart at around 15-21-21 08:00 UTC, and the following day or so)? Thanks, | ||||||
| Comment by Amanpreet Singh [ 24/Dec/15 ] | ||||||
|
Hey Bruce, I have uploaded the diagnostic.data tarball at the private link you provided. Let me know if you find anything Thanks! Best, | ||||||
| Comment by Bruce Lucas (Inactive) [ 23/Dec/15 ] | ||||||
|
Hi Amanpreet, I understand your concern. We don't at the moment have a viewer for the data that is ready for external use. However you can see the data that it collects here, consisting of the following, with the commands you can run from the mongo shell to see what data is collected:
If you are willing to share that information with us (with the assurance that we will keep it private) but don't wish to post it on this public ticket, you can upload it here. Thanks, | ||||||
| Comment by Amanpreet Singh [ 23/Dec/15 ] | ||||||
|
Hi Bruce, Thanks for adding the comment. The files in diagnostic.data directory look like binary. Could you please tell me how to view this data? I don't want to unknowingly share any secret data. Best, | ||||||
| Comment by Bruce Lucas (Inactive) [ 23/Dec/15 ] | ||||||
|
Hi Amanpreet, Can you please archive and attach to this ticket the $dbpath/diagnostic.data directory from the affected node? This will contain periodically collected serverStatus data, including memory statistics, that will help us diagnose the cause. Thanks, |