[SERVER-38170] mongod stopped due to out of memory Created: 16/Nov/18  Updated: 03/Dec/18  Resolved: 29/Nov/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.0.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Laxman P Assignee: Danny Hatcher (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

Our application is using mongo database for several days without any issue related to memory. We use sharding cluster DB, with 2 shards and each of them using two replica sets + arbiter.

Mongo Version 4.0.2

On Nov 15th 4:26 am UTC, mongod crashed due to memory issue on one of the primary mongod (VM is *02b)

We would like to understand the root cause of the issue. We are using wiredTiger with a cacheSize of 2GB.

Here are the sequence of events

1. Primary was 02b, and it was the first to fail due to out of memory (server has 3.75 GB and mongodb is configured with a wiredTiger cache of 2 GB, so it must be some other operation taking a lot of memory).
2. Application starts a purge process at 4 am and it failed with the following error : Unexpected error in purge process:

{ "serverUsed" : "127.0.0.1:27017" , "ok" : 1 , "n" : 148927 , "err" : "Write results unavailable from hitron-db-02b:27018 :: caused by :: Connection was closed" , "code" : 6}

The last exception was thrown while deviceLog was being purged, the cause is that mongod process stopped.
3. When 02b went down, 02a took over as PRIMARY at around 4:27. But it failed due to out of memory same as 02a.
4. At this point, 02a restarted several times, always failing with out of memory.
5. At around 7:00, I noticed the problem and decreased WiredTiger cache to 1 GB on both 02a and 02b to see if there was enough memory to startup. 02a still restarted several times but, miraculously, it started succesfully at 7:55 (after using a lot of memory, it freed the memory and is running now with 1 GB).
6. Service was restored and working well until 9:36, from that time, disk started going slower. I'm not sure if due to IOPS limit (Amazon allows for a certain burst but if run out of credits, it limits your IOPS). IOPS for that volume is 750, and we are using around 820-830 IOPS right now.
7. I started hitron-db-02b at around 10:00, and same as the other one, it started using a lot of memory but survived… after a while it freed some memory and using 100% CPU from "evict" threads (this, I figured out using "top -H" and pstack). State of mongod was ROLLBACK, and it did not start replication, I guessed because it was still rolling back.
8. At 11:27 rollback finished, but with some out of memory errors… and mongod was aborted .
9. At 11:31 I restarted mongod at 02b, but failed again due to OOM.
10. At 11:40, I changed temporarily instance type to c5.xlarge to increase capacity and see if it starts correctly.
11. At 12:11 I stopped both Wifi Servers because disk at 02a was almost full (because it waits for replica to be up to date and it does not allow to drop any collection)
12. At 12:20 02b failed again with OOM, so changed instance type to c5.2xlarge and tried again
13. One curious thing: while wifi servers were down, and 02b was down, 02a still used a lot of IOPS and CPU, all from one thread: "WT Reco.plog.rs". This can be something we should clarify with MongoDB Team.
14. After using up to 11GB memory, 02b starts to replicate from 02a.
15. At around 13:00, 02a becomes unresponsive, don't know why. I cleanly shutdown 02b, and shutdown 02a since it was starting to use more and more memory. It completes shutdown at 13:08.
16. 02a starts at 13:10, and starts recovery due to unclean shutdown
17. I managed to start 02a, and then 02b, with replication up to date, and then I shut down 02b to change instance type. At this point, 02a tried to use a big amount of memory, crashed with OOM and started recovery AGAIN. Both 02a and 02b are running recovery
18. db02b recovered and took over as primary

Some snippets from 02b mongod log file

2018-11-15T04:10:03.136+0000 I CONTROL  [LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true } at rs1
...
2018-11-15T04:26:31.478+0000 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 21, after connections: 21, after dur: 21, after extra_info: 102, after freeMonitoring: 251, after globalLock: 251, after locks: 251, after logicalSessionRecordCache: 251, after network: 251, after opLatencies: 261, after opcounters: 261, after opcountersRepl: 261, after repl: 553, after security: 553, after shardingStatistics: 563, after storageEngine: 563, after tcmalloc: 901, after transactions: 901, after transportSecurity: 901, after wiredTiger: 1358, at end: 1376 }
..
2018-11-15T04:26:42.467+0000 I COMMAND  [conn53] command admin.$cmd command: serverStatus { serverStatus: 1, $clusterTime: { clusterTime: Timestamp(1542256000, 1), signature: { hash: BinData(0, 33AC32E9BDEDD35FED6A1FC98C944CE2FD449627), keyId: 6569071982954938369 } }, $configServerState: { opTime: { ts: Timestamp(1542256000, 1), t: 13 } }, $db: "admin" } numYields:0 reslen:32596 locks:{ Global: { acquireCount: { r: 2 } } } protocol:op_msg 1958ms
2018-11-15T04:26:43.776+0000 E STORAGE  [conn33133] WiredTiger error (12) [1542256003:693212][2040:0x7f23f7c88700], file:hitron/collection-14--7480144262999383718.wt, WT_SESSION.commit_transaction: __realloc_func, 133: memory allocation of 401408 bytes failed: Cannot allocate memory Raw: [1542256003:693212][2040:0x7f23f7c88700], file:hitron/collection-14--7480144262999383718.wt, WT_SESSION.commit_transaction: __realloc_func, 133: memory allocation of 401408 bytes failed: Cannot allocate memory
 
2018-11-15T04:26:44.346+0000 E STORAGE  [conn33146] WiredTiger error (12) [1542256004:338744][2040:0x7f23f7a86700], file:hitron/collection-14--7480144262999383718.wt, WT_SESSION.commit_transaction: __wt_malloc, 88: memory allocation of 340292 bytes failed: Cannot allocate memory Raw: [1542256004:338744][2040:0x7f23f7a86700], file:hitron/collection-14--7480144262999383718.wt, WT_SESSION.commit_transaction: __wt_malloc, 88: memory allocation of 340292 bytes failed: Cannot allocate memory
2018-11-15T04:26:44.346+0000 E STORAGE  [thread32312] WiredTiger error (-31804) [1542256004:338606][2040:0x7f24059b9700], eviction-server: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic Raw: [1542256004:338606][2040:0x7f24059b9700], eviction-server: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic



 Comments   
Comment by Danny Hatcher (Inactive) [ 29/Nov/18 ]

Hello,

For a record to be deleted it has to be pulled into cache and cannot be evicted until it is persisted to disk. If you are deleting more records, more data has to be pulled into cache and evicted when the process ends which can produce a waterfall effect if your server was already under the strain of regular load. If you continue to experience issues and are unable to improve your hardware, you may wish to restrict how many deletes happen at any one time.

From your responses, I do not see anything to indicate a bug in the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-user group.

See also our Technical Support page for additional support resources.

Thank you,

Danny

Comment by Laxman P [ 29/Nov/18 ]

Hi all - Is there any update ?

Comment by Laxman P [ 22/Nov/18 ]

The purge process is in a single thread where we remove old entries from several collections. The purge process run every day and maybe at this instance there were more records than usual (maybe in the order of < 100k entries)
Does the remove operation of several records entail a lot of memory ?
Also when I restarted the mongo process and without the application running, mongod was taking a lot of cpu specifically the evict threads. How can that be explained ?
In the end, we started 02a with init sync and after several hours 02a became primary automatically and 02b crashed again. After this point, we started the application and everything came back to normal.

Comment by Danny Hatcher (Inactive) [ 19/Nov/18 ]

Hello,

From looking through the available diagnostic data, your replica set is simply overwhelmed by the load the application is sending to it. You mentioned that you were running a "purge process" when your problems started happening. Could you go into some detail as to what that entails? If your application was touching data that is not normally used, it has to draw that data into cache. That is an expensive operation and can cause the cache to fill up. I can see that the cache was struggling to evict old data in a timely matter and was forced to start using application threads to help. Disk utilization and wait times also became very high. All of these are signs that the nodes are under load that they cannot support.

From the application-side, you can attempt to increase your write concern. While higher write concerns provide many robustness benefits, they also can cause queries to take longer to return to the application. In your case, this may also be a benefit because it provides a way for the system to self-"throttle" the traffic and may help prevent your nodes from becoming overloaded.

If increasing the write concern is not sufficient to alleviate the problem, the best solution likely lies in allocating better hardware to the servers. I can see that CPU, disk, and RAM are all overloaded during the problem period; upgrading to a better machine will ensure your servers can handle the load.

Thank you,

Danny

Comment by Laxman P [ 16/Nov/18 ]

Hi Bruce - Data for secondary has also been uploaded now. CPU continues to be high for several hours now on the primary now. Can you provide some input on what "WT Reco.plog.rs" means and what the evict threads are meant to do (so we can try to figure out if there's a way to mitigate it from the application point of view) ?

Comment by Bruce Lucas (Inactive) [ 16/Nov/18 ]

laxmanpv, can you please also upload the diagnostic.data for the secondaries as well? The primary is being impacted by replication lag on the secondary and we'd like to see the cause for this.

Comment by Laxman P [ 16/Nov/18 ]

Another data point is that currently on 02b which is the only one in the replica set running, top -H shows a high cpu on "WT Reco.plog.rs" for several hours and the evict threads also being active
We saw this similar behavior yesterday.

% top -H
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8412 mongod    20   0 5288m 2.4g    0 R 89.6 66.4 307:57.40 WT Reco.plog.rs
 8405 mongod    20   0 5288m 2.4g    0 R 21.4 66.4 115:55.81 mongod
 8406 mongod    20   0 5288m 2.4g    0 R 21.4 66.4 115:02.72 mongod
 8408 mongod    20   0 5288m 2.4g    0 R 19.5 66.4 113:30.27 mongod
 8407 mongod    20   0 5288m 2.4g    0 R 15.6 66.4 112:35.47 mongod

pstack snippet

Thread 117 (Thread 0x7f3b06ebd700 (LWP 8412)):
#0  0x0000563fa821d792 in mongo::LockStats<long>::reset() ()
#1  0x0000563fa821bb5b in mongo::LockerImpl<false>::LockerImpl() ()
#2  0x0000563fa716b729 in mongo::ServiceContextMongoD::_newOpCtx(mongo::Client*, unsigned int) ()
#3  0x0000563fa89def11 in mongo::ServiceContext::makeOperationContext(mongo::Client*) ()
#4  0x0000563fa89d94b7 in mongo::Client::makeOperationContext() ()
#5  0x0000563fa7185dbb in ?? ()
#6  0x0000563fa7186570 in ?? ()
#7  0x0000563fa84e9b31 in mongo::BackgroundJob::jobBody() ()
#8  0x0000563fa8b9c9f0 in ?? ()
#9  0x00007f3b1274bde5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f3b1247830d in clone () from /lib64/libc.so.6
....
Thread 113 (Thread 0x7f3b0a6c4700 (LWP 8405)):
#0  0x00007f3b1274fcb2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000563fa71fea79 in __wt_cond_wait_signal ()
#2  0x0000563fa71e60e4 in ?? ()
#3  0x0000563fa71e8d87 in __wt_evict_thread_run ()
#4  0x0000563fa723f909 in ?? ()
#5  0x00007f3b1274bde5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f3b1247830d in clone () from /lib64/libc.so.6
Thread 112 (Thread 0x7f3b09ec3700 (LWP 8406)):
#0  0x00007f3b1274fcb2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000563fa71fea79 in __wt_cond_wait_signal ()
#2  0x0000563fa71ebfde in __wt_evict_thread_run ()
#3  0x0000563fa723f909 in ?? ()
#4  0x00007f3b1274bde5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f3b1247830d in clone () from /lib64/libc.so.6
Thread 111 (Thread 0x7f3b096c2700 (LWP 8407)):
#0  0x00007f3b1274fcb2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000563fa71fea79 in __wt_cond_wait_signal ()
#2  0x0000563fa71e60e4 in ?? ()
#3  0x0000563fa71e8d87 in __wt_evict_thread_run ()
#4  0x0000563fa723f909 in ?? ()
#5  0x00007f3b1274bde5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f3b1247830d in clone () from /lib64/libc.so.6
Thread 110 (Thread 0x7f3b08ec1700 (LWP 8408)):
#0  0x00007f3b1274fcb2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000563fa71fea79 in __wt_cond_wait_signal ()
#2  0x0000563fa71e60e4 in ?? ()
#3  0x0000563fa71e8d87 in __wt_evict_thread_run ()
#4  0x0000563fa723f909 in ?? ()
#5  0x00007f3b1274bde5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f3b1247830d in clone () from /lib64/libc.so.6
...

Comment by Laxman P [ 16/Nov/18 ]

Hi Kelsey - I have uploaded the data as you requested. This is from mongo instance "02b"

Comment by Kelsey Schubert [ 16/Nov/18 ]

Hi laxmanpv,

Thanks for your report. Would you please upload the an archive of the diagnostic.data directory and logs for nodes in the affected shard's replica set? For your convenience, I've created a secure upload portal that you can use to provide these files. This information will allow us to continue to investigate.

Thanks,
Kelsey

Generated at Thu Feb 08 04:48:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.