[SERVER-32652] mongod killed by OOM while one secondary lagging Created: 11/Jan/18  Updated: 27/Oct/23  Resolved: 26/Jan/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Michael Smith Assignee: Mark Agarunov
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 16.04


Attachments: PNG File ftdc-mem.png    
Issue Links:
Related
related to SERVER-32398 Primary freezes during background ind... Closed
Operating System: ALL
Participants:

 Description   

mongod 3.4.10 on Ubuntu 16.04 in a replica set with 3 nodes. The primary and a secondary consumed all RAM and were killed by the kernel OOM killer within a couple of minutes of each other.

It's on the default setting for storage.wiredTiger.engineConfig.cacheSizeGB, so I would expect it to use around 50% of RAM.

At the time I think there was some heavy insert activity.

Possibly related, the other secondary had started lagging about 15-20 minutes earlier. That node is in Azure and tends to lag under load because of SERVER-31215 / WT-3461.

Primary detecting lag:

Jan 10 23:19:50 primary monit[12013]: 'mongo_replcheck' '/usr/local/bin/mongo_replcheck.sh' failed with exit status (1) -- azureslave:27017 lag of 445 sec exceeds threshold 300

Primary running out of memory:

Jan 10 23:31:01 primary kernel: [2494318.535214] Out of memory: Kill process 1100 (mongod) score 954 or sacrifice child
Jan 10 23:31:01 primary kernel: [2494318.548316] Killed process 1100 (mongod) total-vm:17422016kB, anon-rss:15654812kB, file-rss:0kB

Secondary running out of memory:

Jan 10 23:33:46 secondary kernel: [2496035.849027] Out of memory: Kill process 26160 (mongod) score 955 or sacrifice child
Jan 10 23:33:46 secondary kernel: [2496035.862134] Killed process 26160 (mongod) total-vm:17415872kB, anon-rss:15675724kB, file-rss:0kB

Total memory on the primary and secondary is 16431148 KB. 14350764 KB in Azure.

I can provide the FTDC logs privately if you are interested.



 Comments   
Comment by Mark Agarunov [ 26/Jan/18 ]

Hello mzs,

Thank you for the additional information. As there have been numerous fixes in the newer version of Mongodb, upgrading was probably the best course of action. I'll close this ticket for now and we can reopen this if you encounter the issue in the future.

Thanks
Mark

Comment by Michael Smith [ 26/Jan/18 ]

Hi Mark,

I'm upgrading to MongoDB 3.6.2. Since so many variables will have changed, we could close this issue and if the problem recurs I can open a new one with heap profiling output.

Best,
Mike

Comment by Mark Agarunov [ 22/Jan/18 ]

Hello mzs,

Thank you for the update. While we have observed a 10-30% performance impact on typical CPU-bound workloads with heap profiling enabled, the diagnostic data you previously provided doesn't seem to indicate that the CPU is a bottleneck, so I believe the the impact should be minimal.

Thanks
Mark

Comment by Michael Smith [ 19/Jan/18 ]

Hi Mark,

I did some reading and I think heap profiling is safe to leave running in production so I'll turn it on and get back to you if the memory condition recurs.

Best,
Mike

Comment by Mark Agarunov [ 18/Jan/18 ]

Hello mzs,

Thank you for providing this information. Unfortunately while the diagnostic data and logs are showing a clear increase in memory usage leading to the OOM, there does not appear to be any indication of the cause. To get a better idea of the root cause, I'd like to request the diagnostic data with heap profiling enabled. This can be accomplished by starting mongod with the heapProfilingEnabled=true parameter:

 
mongod --setParameter heapProfilingEnabled=true

This should let us determine which components are using memory at a much more granular level.

Thanks,
Mark

Comment by Michael Smith [ 11/Jan/18 ]

Thanks Kelsey! I've uploaded the files. The OOMs occurred at 23:31 and 23:33 UTC.

Comment by Kelsey Schubert [ 11/Jan/18 ]

Hi mzs,

Thank you for reporting this issue. I've created a secure portal for you to provide the diagnostic.data and logs from both nodes.

Thanks again,
Kelsey

Generated at Thu Feb 08 04:30:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.