[SERVER-35666] mongod process consumes all memory and then exits or is being killed by oom-killer Created: 19/Jun/18 Updated: 16/Nov/21 Resolved: 03/Aug/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 3.6.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Sasa Skevin | Assignee: | Matthew Saltz (Inactive) |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | SWKB | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Google Cloud Instance n1-highmem-4 (4 vCPUs, 26 GB memory) with Debian 9.4. |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Steps To Reproduce: |
|
||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
A two shards MongoDB database regularly crashes with out-of-memory error or is being killed by the oom-killer. The system runs on GCE Debian 9.4 with MongoDB v3.6.5, WiredTiger storage engine. The servers are n1-highmem-4 (4 vCPUs, 26 GB memory). On the server runs just mongod and there are no other services. mongos are on different servers. Initially the servers were without swap as it is the practice on Google Cloud, but afterwards a small 1GB swap has been added, but the problem is the same. We have tried to play with the cacheSizeGB parameter and reduced it to 10GB but still the crash happens. Usually process exit/crash happens al least once a day. Currently on the database there is a chunk moving process underway and crashes happens only on the shard instances that receive chunks. Sending chunks instances never crashed. If mongod process is killed by oom-killer this can be seen in the syslogs:
{{}}Sometimes mongod exits with leaving this in the mongod.log:
I would suspect memory management is not working properly while receiving chunks. |
| Comments |
| Comment by Sasa Skevin [ 03/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi @Matthew Saltz, I can confirm. There was a long running query (daily stats) that we weren't aware of. After we stopped this query, the database didn't crash in the last week. We can close this ticket. Thank you for your help and support. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthew Saltz (Inactive) [ 24/Jul/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Whenever the ChunkManager is updated with a new chunk from a move, it creates a new one by copying the existing one and then adding the metadata for the new chunk. The old ChunkManager is only deleted when there are no longer any active cursors using that particular version of the ChunkManager. Furthermore, with the current algorithm, if the oldest ChunkManager has an open cursor, none of the ChunkManager objects will be cleaned up, even if there are newer expired ChunkManagers without any open cursors. Based on a rough calculation of the amount of memory the ChunkManager would take given the number of chunks you have on the recipient node and the key you're using (ObjectId), I can see that the amount of memory buildup you're having corresponds to the scenario where none of the 430 ChunkManager objects getting created are ever cleaned up. This would indicate that there's some long-running cursor attached to an early ChunkManager object prior to all of the moves that never gets cleaned up. Looking at the FTDC data from the node that crashed, I've confirmed that the memory growth begins to happen around the same time that the first pinned notimeout cursor shows up and the chunk moves begin (see attached). ssasa from your application, do you see where you might be leaving cursors open indefinitely?
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasa Skevin [ 25/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi Bruce, Yes, a single collection is sharded and chunk moves are in one direction. Our setup is not fully by the book (not ideal) and is composed of two shards. First shard is on better and faster hardware (more CPU, memory and SSD disk) and contains all databases and collections (not sharded). The other shard is on slower hardware and contains just this one sharded collection. The sharded collection in place contains user usage logs and the idea behind the setup was to contain recent data (one to two months) in the fast shard, while all historical data (which is rarely used) in the "slow and cheaper" shard. When we initiate chunk migration, i.e. change shard key range, then chunk moving is always from the faster shard to the slower. Never the opposite way. This answers the second question, no matter which mongod is restarted on the slow shard, this slow shard is always on the receiving side of the chunks. And yes, the number of chunks on the recipient shard is increasing. Here is the shard distribution command output:
Thanks, Sasa
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 25/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi Sasa, At the end of the diagnostic data file metrics.2018-06-23T22-56-35Z-00000, there were about 5 GB of memory allocated outside the WT cache. About half was allocated by stacks looking like this:
and most of the rest by stacks like this:
So it appears that the memory relates to data structures that are used by mongod to track chunk metadata. For debugging this it will be helpful to have some more information about your chunk distribution and the chunk moves that are taking place:
Thanks, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasa Skevin [ 24/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi Bruce, pre-release tarball version used and after few hours killed by oom-killer. Diagnostics file diag-20180624.tar.gz uploaded. Thanks, Sasa | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 21/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi Sasa, We have prepared a 3.6.6 (not 3.6.7 as I mistakenly typed above) pre-release tarball containing the fix for Thanks,
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasa Skevin [ 21/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the update. Yes, I would try 3.6.7 pre-release version. But can't do that before Sunday. If a crash happens in the mean time, will try to send the data. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 21/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for pointing that out Sasa, and apologies for missing it. Unfortunately the profiling data collection wasn't successful due to We've fixed this for 4.0, but have not backported it to 3.6. We are working to do that and prepare a 3.6.7 pre-release for you to install for collecting heap profile data one one node, if you are willing to do that. Meanwhile you might continue to run with the heap profiling enabled in your current version of 3.6. In this case the serverStatus document is only 16977838, just a little larger than 16 MB, so it is possible that due to random variation in the heap profiler samples, a future incident may not hit the 16 MB limit. Let me know if and when you another incident you would like me to look at. Thanks for your help,
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasa Skevin [ 20/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
The node is a secondary for the last couple of hours after it crashed. Before that all night long it was a primary and receiving chunks. Could you please look at this period? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 20/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks Sasa. It looks like this node is not experiencing memory growth as it is currently a secondary. In the previous data in each case the node transitioned from secondary to primary about 6 hours after it was started, and at that point the memory growth began because the node starting processing chunk moves. This is seen in the graph above where the member state chart goes from yellow indicating secondary to green indicating primary. I didn't look into why the node transitioned from secondary to primary - I assumed either there was some action you took to cause it to happen, or perhaps the current primary that was processing chunk moves hit OOM causing a failover. In any case please allow this node to continue collecting heap profiling data until it becomes primary, or if necessary force a failover by stepping down the current primary. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasa Skevin [ 20/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
New diagnostic data with profiling option turned on is uploaded to secure portal. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 19/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Some numbers: in one instance there were 430 chunk moves before crashing, and this accumulated about 15 GB of excess memory, or about 35 MB per chunk move. The rate at which the excess memory increased ("allocated minus wt cache") seems to roughly correspond to the rate at which chunk moves are occurring (_recvChunkStart), suggesting this is memory accumulated per chunk move. There are several things happening at that rate that are possible suspects; the heap profiler information should tell us.
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 19/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi Sasa, Thanks for the data. It shows the node as you say repeatedly crashing after allocating 15 GB of memory aside from the WiredTiger cache. This allocated memory does indeed seem to be related to the chunk moves as the rate at which it increases appears to be roughly proportional to the number of chunk moves executed. Unfortunately the normal data collection is not able to identify a specific cause. Would you be able to enable the built-in heap profiler on the primary of a node that is receiving chunks? To do this please restart mongod setting the heapProfilingEnabled parameter; for example, on the command line:
Then, when the node encounters OOM, please upload
Thanks,
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasa Skevin [ 19/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
@Ramon Fernandez diagnostic data and last two logs uploaded. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 19/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
ssasa, will you please upload the contents of the diagnostic.data directory for the affected node as well as the full logs? You can use this secure portal. Thanks, |