[SERVER-27818] mongod 3.2.11 out of memory - killed by OOM killer Created: 26/Jan/17  Updated: 31/Jan/17  Resolved: 31/Jan/17

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.2.11
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Meni Livne Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

EC2 m3.xlarge server with 15GB of RAM and 15GB of swap space.


Attachments: Text File shard0-dmesg.txt    
Operating System: ALL
Participants:

 Description   

Our deployment has 4 shards running version 3.2.11, each consisting of a primary, secondary and arbiter, with the wiredTiger engine. Primaries and secondaries are on m3.xlarge servers with 15 GB RAM and 15 GB of swap space. While running smoothly for several months, recently mongod can suddenly be killed by the kernel due to running out of memory. This usually happens on secondaries, but can also happen on primaries. The servers do not seem to be under unusual query load when it happens.

We do not have any text indexes so it does not seem related to SERVER-18926.

Attached is example dmesg output from one of the shards. We can also provide contents of diagnostic.data directory.



 Comments   
Comment by Kelsey Schubert [ 30/Jan/17 ]

Hi meni@contactspls.com,

It's looks like the issue you are experiencing is likely caused by the configuration of Xen. Memory ballooning and/or a full swap could explain this behavior.

From the information you have provided, I do not see anything to indicate a bug in the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group.

Kind regards,
Thomas

Comment by Meni Livne [ 28/Jan/17 ]

There are no containers. As for limits, max resident set is unlimited. These are the full limits for the mongod process.

Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 64000 64000 processes
Max open files 64000 64000 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 60081 60081 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us

(Kernel version: Linux 4.4.0-59-generic #80-Ubuntu SMP Fri Jan 6 17:47:47 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux)

Comment by Kelsey Schubert [ 27/Jan/17 ]

Hi meni@contactspls.com,

Thank you for uploading the files. I've examined the system logs and see that total memory utilization when the system killed mongod was 8259476kB or about 8GB. Are you aware of any containers or ulimits that are constraining the memory available?

Kind regards,
Thomas

Comment by Meni Livne [ 26/Jan/17 ]

We've uploaded the diagnostic.data files for the 3 shards whose secondaries experienced the OOM. The files were copied when the mongod processes were down after being killed by the kernel.

We've also uploaded the log files for those servers, from the time they were brought up until being killed and afterwards.
operationProfiling.slowOpThresholdMs is set to 200ms, so the log is full of these long queries.
Note that the shard2 server was killed a second time due to OOM a few minutes after restarting it, while it was still syncing the missed ops from the primary.

Comment by Kelsey Schubert [ 26/Jan/17 ]

Hi meni@contactspls.com,

Would you please upload the diagnostic.data and complete logs for the affected node to this secure portal?

Thank you,
Thomas

Generated at Thu Feb 08 04:16:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.