[SERVER-32652] mongod killed by OOM while one secondary lagging Created: 11/Jan/18 Updated: 27/Oct/23 Resolved: 26/Jan/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.4.10 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Michael Smith | Assignee: | Mark Agarunov |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 16.04 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
mongod 3.4.10 on Ubuntu 16.04 in a replica set with 3 nodes. The primary and a secondary consumed all RAM and were killed by the kernel OOM killer within a couple of minutes of each other. It's on the default setting for storage.wiredTiger.engineConfig.cacheSizeGB, so I would expect it to use around 50% of RAM. At the time I think there was some heavy insert activity. Possibly related, the other secondary had started lagging about 15-20 minutes earlier. That node is in Azure and tends to lag under load because of Primary detecting lag:
Primary running out of memory:
Secondary running out of memory:
Total memory on the primary and secondary is 16431148 KB. 14350764 KB in Azure. I can provide the FTDC logs privately if you are interested. |
| Comments |
| Comment by Mark Agarunov [ 26/Jan/18 ] | ||
|
Hello mzs, Thank you for the additional information. As there have been numerous fixes in the newer version of Mongodb, upgrading was probably the best course of action. I'll close this ticket for now and we can reopen this if you encounter the issue in the future. Thanks | ||
| Comment by Michael Smith [ 26/Jan/18 ] | ||
|
Hi Mark, I'm upgrading to MongoDB 3.6.2. Since so many variables will have changed, we could close this issue and if the problem recurs I can open a new one with heap profiling output. Best, | ||
| Comment by Mark Agarunov [ 22/Jan/18 ] | ||
|
Hello mzs, Thank you for the update. While we have observed a 10-30% performance impact on typical CPU-bound workloads with heap profiling enabled, the diagnostic data you previously provided doesn't seem to indicate that the CPU is a bottleneck, so I believe the the impact should be minimal. Thanks | ||
| Comment by Michael Smith [ 19/Jan/18 ] | ||
|
Hi Mark, I did some reading and I think heap profiling is safe to leave running in production Best, | ||
| Comment by Mark Agarunov [ 18/Jan/18 ] | ||
|
Hello mzs, Thank you for providing this information. Unfortunately while the diagnostic data and logs are showing a clear increase in memory usage leading to the OOM, there does not appear to be any indication of the cause. To get a better idea of the root cause, I'd like to request the diagnostic data with heap profiling enabled. This can be accomplished by starting mongod with the heapProfilingEnabled=true parameter:
This should let us determine which components are using memory at a much more granular level. Thanks, | ||
| Comment by Michael Smith [ 11/Jan/18 ] | ||
|
Thanks Kelsey! I've uploaded the files. The OOMs occurred at 23:31 and 23:33 UTC. | ||
| Comment by Kelsey Schubert [ 11/Jan/18 ] | ||
|
Hi mzs, Thank you for reporting this issue. I've created a secure portal for you to provide the diagnostic.data and logs from both nodes. Thanks again, |