-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 4.0.2
-
Component/s: None
-
Labels:None
-
ALL
Our application is using mongo database for several days without any issue related to memory. We use sharding cluster DB, with 2 shards and each of them using two replica sets + arbiter.
Mongo Version 4.0.2
On Nov 15th 4:26 am UTC, mongod crashed due to memory issue on one of the primary mongod (VM is *02b)
We would like to understand the root cause of the issue. We are using wiredTiger with a cacheSize of 2GB.
Here are the sequence of events
1. Primary was 02b, and it was the first to fail due to out of memory (server has 3.75 GB and mongodb is configured with a wiredTiger cache of 2 GB, so it must be some other operation taking a lot of memory).
2. Application starts a purge process at 4 am and it failed with the following error : Unexpected error in purge process:
The last exception was thrown while deviceLog was being purged, the cause is that mongod process stopped.
3. When 02b went down, 02a took over as PRIMARY at around 4:27. But it failed due to out of memory same as 02a.
4. At this point, 02a restarted several times, always failing with out of memory.
5. At around 7:00, I noticed the problem and decreased WiredTiger cache to 1 GB on both 02a and 02b to see if there was enough memory to startup. 02a still restarted several times but, miraculously, it started succesfully at 7:55 (after using a lot of memory, it freed the memory and is running now with 1 GB).
6. Service was restored and working well until 9:36, from that time, disk started going slower. I'm not sure if due to IOPS limit (Amazon allows for a certain burst but if run out of credits, it limits your IOPS). IOPS for that volume is 750, and we are using around 820-830 IOPS right now.
7. I started hitron-db-02b at around 10:00, and same as the other one, it started using a lot of memory but survived… after a while it freed some memory and using 100% CPU from "evict" threads (this, I figured out using "top -H" and pstack). State of mongod was ROLLBACK, and it did not start replication, I guessed because it was still rolling back.
8. At 11:27 rollback finished, but with some out of memory errors… and mongod was aborted .
9. At 11:31 I restarted mongod at 02b, but failed again due to OOM.
10. At 11:40, I changed temporarily instance type to c5.xlarge to increase capacity and see if it starts correctly.
11. At 12:11 I stopped both Wifi Servers because disk at 02a was almost full (because it waits for replica to be up to date and it does not allow to drop any collection)
12. At 12:20 02b failed again with OOM, so changed instance type to c5.2xlarge and tried again
13. One curious thing: while wifi servers were down, and 02b was down, 02a still used a lot of IOPS and CPU, all from one thread: "WT Reco.plog.rs". This can be something we should clarify with MongoDB Team.
14. After using up to 11GB memory, 02b starts to replicate from 02a.
15. At around 13:00, 02a becomes unresponsive, don't know why. I cleanly shutdown 02b, and shutdown 02a since it was starting to use more and more memory. It completes shutdown at 13:08.
16. 02a starts at 13:10, and starts recovery due to unclean shutdown
17. I managed to start 02a, and then 02b, with replication up to date, and then I shut down 02b to change instance type. At this point, 02a tried to use a big amount of memory, crashed with OOM and started recovery AGAIN. Both 02a and 02b are running recovery
18. db02b recovered and took over as primary
Some snippets from 02b mongod log file
2018-11-15T04:10:03.136+0000 I CONTROL [LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true } at rs1 ... 2018-11-15T04:26:31.478+0000 I COMMAND [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 21, after connections: 21, after dur: 21, after extra_info: 102, after freeMonitoring: 251, after globalLock: 251, after locks: 251, after logicalSessionRecordCache: 251, after network: 251, after opLatencies: 261, after opcounters: 261, after opcountersRepl: 261, after repl: 553, after security: 553, after shardingStatistics: 563, after storageEngine: 563, after tcmalloc: 901, after transactions: 901, after transportSecurity: 901, after wiredTiger: 1358, at end: 1376 } .. 2018-11-15T04:26:42.467+0000 I COMMAND [conn53] command admin.$cmd command: serverStatus { serverStatus: 1, $clusterTime: { clusterTime: Timestamp(1542256000, 1), signature: { hash: BinData(0, 33AC32E9BDEDD35FED6A1FC98C944CE2FD449627), keyId: 6569071982954938369 } }, $configServerState: { opTime: { ts: Timestamp(1542256000, 1), t: 13 } }, $db: "admin" } numYields:0 reslen:32596 locks:{ Global: { acquireCount: { r: 2 } } } protocol:op_msg 1958ms 2018-11-15T04:26:43.776+0000 E STORAGE [conn33133] WiredTiger error (12) [1542256003:693212][2040:0x7f23f7c88700], file:hitron/collection-14--7480144262999383718.wt, WT_SESSION.commit_transaction: __realloc_func, 133: memory allocation of 401408 bytes failed: Cannot allocate memory Raw: [1542256003:693212][2040:0x7f23f7c88700], file:hitron/collection-14--7480144262999383718.wt, WT_SESSION.commit_transaction: __realloc_func, 133: memory allocation of 401408 bytes failed: Cannot allocate memory 2018-11-15T04:26:44.346+0000 E STORAGE [conn33146] WiredTiger error (12) [1542256004:338744][2040:0x7f23f7a86700], file:hitron/collection-14--7480144262999383718.wt, WT_SESSION.commit_transaction: __wt_malloc, 88: memory allocation of 340292 bytes failed: Cannot allocate memory Raw: [1542256004:338744][2040:0x7f23f7a86700], file:hitron/collection-14--7480144262999383718.wt, WT_SESSION.commit_transaction: __wt_malloc, 88: memory allocation of 340292 bytes failed: Cannot allocate memory 2018-11-15T04:26:44.346+0000 E STORAGE [thread32312] WiredTiger error (-31804) [1542256004:338606][2040:0x7f24059b9700], eviction-server: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic Raw: [1542256004:338606][2040:0x7f24059b9700], eviction-server: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic