[SERVER-35666] mongod process consumes all memory and then exits or is being killed by oom-killer Created: 19/Jun/18  Updated: 16/Nov/21  Resolved: 03/Aug/18

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 3.6.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sasa Skevin Assignee: Matthew Saltz (Inactive)
Resolution: Duplicate Votes: 1
Labels: SWKB
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Google Cloud Instance n1-highmem-4 (4 vCPUs, 26 GB memory) with Debian 9.4.


Attachments: PNG File mem_without_wt_cache.png     PNG File memstack1.png     PNG File memstack2.png     PNG File pinned_cursors.png     HTML File stacks.html     PNG File suspects.png    
Issue Links:
Duplicate
duplicates SERVER-36443 Long-running queries should not cause... Closed
Related
is related to SERVER-35444 Heap stacks should not be included in... Closed
is related to SERVER-42138 Memory used by ChunkManager objects c... Closed
Operating System: ALL
Steps To Reproduce:
  1. Initiate chunk moving
  2. Wait for several hours and mongod will either exit with out-of-memory error or be killed by oom-killer
Participants:

 Description   

A two shards MongoDB database regularly crashes with out-of-memory error or is being killed by the oom-killer. The system runs on GCE Debian 9.4 with MongoDB v3.6.5WiredTiger storage engine. The servers are n1-highmem-4 (4 vCPUs26 GB memory). On the server runs just mongod and there are no other services. mongos are on different servers.

Initially the servers were without swap as it is the practice on Google Cloud, but afterwards a small 1GB swap has been added, but the problem is the same.

We have tried to play with the cacheSizeGB parameter and reduced it to 10GB but still the crash happens.

Usually process exit/crash happens al least once a day.

Currently on the database there is a chunk moving process underway and crashes happens only on the shard instances that receive chunks. Sending chunks instances never crashed.

If mongod process is killed by oom-killer this can be seen in the syslogs:

Jun 15 14:45:17 server4 kernel: [1731430.432189] Out of memory: Kill process 13130 (mongod) score 980 or sacrifice child
Jun 15 14:45:17 server4 kernel: [1731430.441717] Killed process 13130 (mongod) total-vm:28280536kB, anon-rss:26174876kB, file-rss:0kB, shmem-rss:0kB

{{}}Sometimes mongod exits with leaving this in the mongod.log:

2018-06-15T02:14:32.456+0200 F - [rsSync] out of memory.0x55cbc8535751 0x55cbc8534d84 0x55cbc8623b4b 0x55cbc86c665c 0x55cbc70fccff 0x55cbc70f8b02 0x55cbc707b3f1 0x55cbc86449b0 0x7fbbf3507494 0x7fbbf3249acf
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"55CBC6305000","o":"2230751","s":"_ZN5mongo15printStackTraceERSo"},{"b":"55CBC6305000","o":"222FD84","s":"_ZN5mongo29reportOutOfMemoryErrorAndExitEv"},{"b":"55CBC6305000","o":"231EB4B"},{"b":"55CBC6305000","o":"23C165C","s":"_Znam"},{"b":"55CBC6305000","o":"DF7CFF","s":"_ZN5mongo4repl8SyncTail7OpQueueC1Ev"},{"b":"55CBC6305000","o":"DF3B02","s":"_ZN5mongo4repl8SyncTail16oplogApplicationEPNS0_22ReplicationCoordinatorE"},{"b":"55CBC6305000","o":"D763F1","s":"_ZN5mongo4repl10RSDataSync4_runEv"},{"b":"55CBC6305000","o":"233F9B0"},{"b":"7FBBF3500000","o":"7494"},{"b":"7FBBF3161000","o":"E8ACF","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.5", "gitVersion" : "a20ecd3e3a174162052ff99913bc2ca9a839d618", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.9.0-6-amd64", "version" : "#1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07)", "machine" : "x86_64" }, "somap" : [ { "b" : "55CBC6305000", "elfType" : 3, "buildId" : "7D4592BDFAA6C15459D2319DEAB7F10E9EB4E7D7" }, { "b" : "7FFC48D98000", "path" : "linux-vdso.so.1", "elfType" : 3, "buildId" : "A3207CC9FE1CAA3374AE7061AA5C3C5619B8A0E5" }, { "b" : "7FBBF4743000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "713D47D5F599289C0A91ADE8F0122B2B4AA78B2E" }, { "b" : "7FBBF42B0000", "path" : "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1", "elfType" : 3, "buildId" : "2CFE882A331D7857E9CE1B5DE3255E6DA76EF899" }, { "b" : "7FBBF4044000", "path" : "/usr/lib/x86_64-linux-gnu/libssl.so.1.1", "elfType" : 3, "buildId" : "E2AA3B39763D943F56B3BD05C8E36E639BA95E12" }, { "b" : "7FBBF3E40000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "B895F0831F623C5F23603401D4069F9F94C24761" }, { "b" : "7FBBF3C38000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "5D83E0642E645026DBB11F89F7DF7106BD821495" }, { "b" : "7FBBF3934000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1B95E3A8B8788B07E4F59EE69B1877F9DEB42033" }, { "b" : "7FBBF371D000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "51AD5FD294CD6C813BED40717347A53434B80B7A" }, { "b" : "7FBBF3500000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "4285CD3158DDE596765C747AE210AB6CBD258B22" }, { "b" : "7FBBF3161000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "AA889E26A70F98FA8D230D088F7CC5BF43573163" }, { "b" : "7FBBF495A000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "263F909DBE11A66F7C6233E3FF0521148D9F8370" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x55cbc8535751]
 mongod(_ZN5mongo29reportOutOfMemoryErrorAndExitEv+0x84) [0x55cbc8534d84]
 mongod(+0x231EB4B) [0x55cbc8623b4b]
 mongod(_Znam+0x21C) [0x55cbc86c665c]
 mongod(_ZN5mongo4repl8SyncTail7OpQueueC1Ev+0x7F) [0x55cbc70fccff]
 mongod(_ZN5mongo4repl8SyncTail16oplogApplicationEPNS0_22ReplicationCoordinatorE+0x402) [0x55cbc70f8b02]
 mongod(_ZN5mongo4repl10RSDataSync4_runEv+0x111) [0x55cbc707b3f1]
 mongod(+0x233F9B0) [0x55cbc86449b0]
 libpthread.so.0(+0x7494) [0x7fbbf3507494]
 libc.so.6(clone+0x3F) [0x7fbbf3249acf]
----- END BACKTRACE -----

I would suspect memory management is not working properly while receiving chunks.



 Comments   
Comment by Sasa Skevin [ 03/Aug/18 ]

Hi @Matthew Saltz,

I can confirm. There was a long running query (daily stats) that we weren't aware of. After we stopped this query, the database didn't crash in the last week. We can close this ticket.

Thank you for your help and support.

Comment by Matthew Saltz (Inactive) [ 24/Jul/18 ]

Whenever the ChunkManager is updated with a new chunk from a move, it creates a new one by copying the existing one and then adding the metadata for the new chunk. The old ChunkManager is only deleted when there are no longer any active cursors using that particular version of the ChunkManager. Furthermore, with the current algorithm, if the oldest ChunkManager has an open cursor, none of the ChunkManager objects will be cleaned up, even if there are newer expired ChunkManagers without any open cursors. Based on a rough calculation of the amount of memory the ChunkManager would take given the number of chunks you have on the recipient node and the key you're using (ObjectId), I can see that the amount of memory buildup you're having corresponds to the scenario where none of the 430 ChunkManager objects getting created are ever cleaned up. This would indicate that there's some long-running cursor attached to an early ChunkManager object prior to all of the moves that never gets cleaned up. Looking at the FTDC data from the node that crashed, I've confirmed that the memory growth begins to happen around the same time that the first pinned notimeout cursor shows up and the chunk moves begin (see attached).

ssasa from your application, do you see where you might be leaving cursors open indefinitely?


Comment by Sasa Skevin [ 25/Jun/18 ]

Hi Bruce,

Yes, a single collection is sharded and chunk moves are in one direction. Our setup is not fully by the book (not ideal) and is composed of two shards. First shard is on better and faster hardware (more CPU, memory and SSD disk) and contains all databases and collections (not sharded). The other shard is on slower hardware and contains just this one sharded collection. The sharded collection in place contains user usage logs and the idea behind the setup was to contain recent data (one to two months) in the fast shard, while all historical data (which is rarely used) in the "slow and cheaper" shard.

When we initiate chunk migration, i.e. change shard key range, then chunk moving is always from the faster shard to the slower. Never the opposite way. This answers the second question, no matter which mongod is restarted on the slow shard, this slow shard is always on the receiving side of the chunks. And yes, the number of chunks on the recipient shard is increasing.

Here is the shard distribution command output:

mongos> db.usage.getShardDistribution()
Shard rs0 at rs0/mmm1:27017,mmm2:27017
 data : 1508.62GiB docs : 2091330189 chunks : 37511
 estimated data per chunk : 41.18MiB
 estimated docs per chunk : 55752
Shard rs1 at rs1/mmm3:27017,mmm4:27017
 data : 8261.48GiB docs : 12443011945 chunks : 264714
 estimated data per chunk : 31.95MiB
 estimated docs per chunk : 47005
Totals
 data : 9770.1GiB docs : 14534342134 chunks : 302225
 Shard rs0 contains 15.44% data, 14.38% docs in cluster, avg obj size on shard : 774B
 Shard rs1 contains 84.55% data, 85.61% docs in cluster, avg obj size on shard : 712B

Thanks,

Sasa

 

Comment by Bruce Lucas (Inactive) [ 25/Jun/18 ]

Hi Sasa,

At the end of the diagnostic data file metrics.2018-06-23T22-56-35Z-00000, there were about 5 GB of memory allocated outside the WT cache. About half was allocated by stacks looking like this:

operator new[]
std::_Rb_tree_node
...
std::_Rb_tree_node
mongo::ChunkManager::makeUpdated
0x563f8c325173
0x563f8b97d0ad
0x563f8b97daaa
0x563f8c32da9d
mongo::ThreadPool::_doOneTask
mongo::ThreadPool::_consumeTasks
mongo::ThreadPool::_workerThreadBody
0x563f8cc676c0
0x7f106102e494
clone

and most of the rest by stacks like this:

operator new[]
std::_Rb_tree_node
...
std::_Rb_tree_node
mongo::MetadataManager::_setActiveMetadata
mongo::MetadataManager::refreshActiveMetadata
mongo::CollectionShardingState::refreshMetadata
mongo::ShardingState::_refreshMetadata
mongo::ShardingState::onStaleShardVersion
0x563f8b51cb77
mongo::ErrmsgCommandDeprecated::run
mongo::BasicCommand::enhancedRun
mongo::Command::publicRun
0x563f8b537f3b
0x563f8b53a22d
mongo::ServiceEntryPointMongod::handleRequest
mongo::ServiceStateMachine::_processMessage
mongo::ServiceStateMachine::_runNextInGuard
0x563f8b5465a1
mongo::transport::ServiceExecutorSynchronous::schedule
mongo::ServiceStateMachine::_scheduleNextWithGuard
mongo::ServiceStateMachine::_sourceCallback
mongo::ServiceStateMachine::_sourceMessage
mongo::ServiceStateMachine::_runNextInGuard
0x563f8b5465a1
0x563f8c453f65
0x563f8ca0f4c4
0x7f106102e494
clone

So it appears that the memory relates to data structures that are used by mongod to track chunk metadata. For debugging this it will be helpful to have some more information about your chunk distribution and the chunk moves that are taking place:

  • From the logs it appears that all of the chunk moves involve a single collection, and involve chunk moves from and to the same hosts, is this correct?
  • Have the chunk moves been completely in the same direction - in other words, on each restart of the mongod you continue moving chunks to that mongod, so the total number of chunks continues rising on the recipient shard, independent of the mongod restarts?
  • Can you tell us about the distribution of chunks across shards by posting the output of db.collection.getShardDistribution() for the collection in question (or upload it to the portal if you wish to keep it private).

Thanks,
Bruce

Comment by Sasa Skevin [ 24/Jun/18 ]

Hi Bruce,

pre-release tarball version used and after few hours killed by oom-killer. Diagnostics file diag-20180624.tar.gz uploaded. Thanks,

Sasa

Comment by Bruce Lucas (Inactive) [ 21/Jun/18 ]

Hi Sasa,

We have prepared a 3.6.6 (not 3.6.7 as I mistakenly typed above) pre-release tarball containing the fix for SERVER-35444 that you can download from here; click on the "binaries" link. Please install and run this with heap profiling enabled at your convenience.

Thanks,
Bruce

 

Comment by Sasa Skevin [ 21/Jun/18 ]

Thanks for the update. Yes, I would try 3.6.7 pre-release version. But can't do that before Sunday. If a crash happens in the mean time, will try to send the data.

Comment by Bruce Lucas (Inactive) [ 21/Jun/18 ]

Thanks for pointing that out Sasa, and apologies for missing it.

Unfortunately the profiling data collection wasn't successful due to SERVER-35444: if there are a lot of unique and/or large stack traces that need to be recorded by the profiling, it can overflow the 16 MB document size limit for the serverStatus document. The fix is not to record the stack traces per se in serverStatus, but record only the amount of memory used for each stack, as this is not needed because the stacks themselves are recorded in the log file.

We've fixed this for 4.0, but have not backported it to 3.6. We are working to do that and prepare a 3.6.7 pre-release for you to install for collecting heap profile data one one node, if you are willing to do that.

Meanwhile you might continue to run with the heap profiling enabled in your current version of 3.6. In this case the serverStatus document is only 16977838, just a little larger than 16 MB, so it is possible that due to random variation in the heap profiler samples, a future incident may not hit the 16 MB limit. Let me know if and when you another incident you would like me to look at.

Thanks for your help,
Bruce

 

Comment by Sasa Skevin [ 20/Jun/18 ]

The node is a secondary for the last couple of hours after it crashed. Before that all night long it was a primary and receiving chunks. Could you please look at this period?

Comment by Bruce Lucas (Inactive) [ 20/Jun/18 ]

Thanks Sasa. It looks like this node is not experiencing memory growth as it is currently a secondary. In the previous data in each case the node transitioned from secondary to primary about 6 hours after it was started, and at that point the memory growth began because the node starting processing chunk moves. This is seen in the graph above where the member state chart goes from yellow indicating secondary to green indicating primary. I didn't look into why the node transitioned from secondary to primary - I assumed either there was some action you took to cause it to happen, or perhaps the current primary that was processing chunk moves hit OOM causing a failover.

In any case please allow this node to continue collecting heap profiling data until it becomes primary, or if necessary force a failover by stepping down the current primary.

Comment by Sasa Skevin [ 20/Jun/18 ]

New diagnostic data with profiling option turned on is uploaded to secure portal.

Comment by Bruce Lucas (Inactive) [ 19/Jun/18 ]

Some numbers: in one instance there were 430 chunk moves before crashing, and this accumulated about 15 GB of excess memory, or about 35 MB per chunk move. The rate at which the excess memory increased ("allocated minus wt cache") seems to roughly correspond to the rate at which chunk moves are occurring (_recvChunkStart), suggesting this is memory accumulated per chunk move. There are several things happening at that rate that are possible suspects; the heap profiler information should tell us.

Comment by Bruce Lucas (Inactive) [ 19/Jun/18 ]

Hi Sasa,

Thanks for the data. It shows the node as you say repeatedly crashing after allocating 15 GB of memory aside from the WiredTiger cache. This allocated memory does indeed seem to be related to the chunk moves as the rate at which it increases appears to be roughly proportional to the number of chunk moves executed.

Unfortunately the normal data collection is not able to identify a specific cause. Would you be able to enable the built-in heap profiler on the primary of a node that is receiving chunks? To do this please restart mongod setting the heapProfilingEnabled parameter; for example, on the command line:

mongod --setParameter heapProfilingEnabled=true ...

Then, when the node encounters OOM, please upload

  • the contents of diagnostic.data
  • the complete mongod log file covering the time from the restart to the OOM.

Thanks,
Bruce

 

 

Comment by Sasa Skevin [ 19/Jun/18 ]

@Ramon Fernandez diagnostic data and last two logs uploaded.

Comment by Ramon Fernandez Marina [ 19/Jun/18 ]

ssasa, will you please upload the contents of the diagnostic.data directory for the affected node as well as the full logs? You can use this secure portal.

Thanks,
Ramón.

Generated at Thu Feb 08 04:40:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.