[SERVER-33809] Secondary member crashed due to OOM in production Created: 12/Mar/18  Updated: 27/Oct/23  Resolved: 13/Mar/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Roie Yossef Assignee: Dmitry Agranat
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File SERVER-33809.png     File metrics.2018-03-11T03-33-58Z-00000.gz     File metrics.2018-03-11T14-13-58Z-00000.gz     File metrics.2018-03-11T22-40-51Z-00000.gz    
Issue Links:
Related
Participants:

 Description   

Hi All,
we have a mongo ReplicaSet configured with 6 members - all in the same subnet. our application configured to prefer reads from secondaries.
yesterday one of the nodes suffered from a lack of memory and the kernel decided to perform an OOM event which killed the mongo.

the mongo is the only thing we run in the machine (using docker). i'v checked the system logs and found this:
Mar 11 22:40:02 ip-xxx-xx-x-xxx kernel: [10313490.393104] conn14048276 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Mar 11 22:40:02 ip-xxx-xx-x-xxx kernel: [10313490.393109] conn14048276 cpuset=aa6b9fb618f6296d15a964eea9cab273f3d9476fc66e3b24d2ca4ec8e2784e73 mems_allowed=0
Mar 11 22:40:02 ip-xxx-xx-x-xxx kernel: [10313490.393112] CPU: 0 PID: 9200 Comm: conn14048276 Not tainted 3.13.0-88-generic #135-Ubuntu
Mar 11 22:40:02 ip-xxx-xx-x-xxx kernel: [10313490.393113] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
........
Mar 11 22:40:02 ip-xxx-xx-x-xxx kernel: [10313490.393323] Out of memory: Kill process 4340 (mongod) score 912 or sacrifice child
Mar 11 22:40:02 ip-xxx-xx-x-xxx kernel: [10313490.404537] Killed process 4340 (mongod) total-vm:7052416kB, anon-rss:3681108kB, file-rss:0kB

i do have the diagnostic data but not sure how to analyse this (see attached). we don't have any swap memory in the machine, is it recommended to add?

its a production machine and first time we encounter a failure so its important to us to understand the root cause and what is necessary to avoid such cases

Thanks



 Comments   
Comment by Dmitry Agranat [ 13/Mar/18 ]

Hi roiey,

All good questions. Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. Questions like this involving more discussion would be best posted on the mongodb-users group.

Thanks,
Dima

Comment by Roie Yossef [ 13/Mar/18 ]

Hi Dima,
Thanks for you answer. we running a replicaSet with 5 members. its the first time we encountered OOM in production. why this specific node failed and the other sastains healthy? is it possible that one heavy specific query handled by the machine and cause that? if yes , how can we recognize the problematic query? if not , how can i avoid the node from failing? swap will help or just slow down the mongo?

regarding to the disks , what do you consider as poor IO performance? what disks should we have? we use AWS EBS volume , io1 type with 5000 IOPS.

Comment by Dmitry Agranat [ 13/Mar/18 ]

Hi roiey,

After looking at the provided data, I do not see an indication of a bug. I believe that the reported issue is due to insufficient available memory (4GB) relative to your workload.

As the memory required for operations grows, the working set in the WT cache cannot be pushed back onto disk. Instead, the operating system begins swapping application memory to disk until it runs out of space and kills the process that is using the most memory, which, in this case, is MongoDB.

Based on the additional metrics observed (see below), it appears that this server is also suffering from poor IO performance and mostly reading the data from either the FS cache or disk.

All the above indicate that the server is underprovisioned to sustain your workload.

Thanks,
Dima

Generated at Thu Feb 08 04:34:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.