Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.0.2
Component/s: None
Labels:
None

Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Our application is using mongo database for several days without any issue related to memory. We use sharding cluster DB, with 2 shards and each of them using two replica sets + arbiter.

Mongo Version 4.0.2

On Nov 15th 4:26 am UTC, mongod crashed due to memory issue on one of the primary mongod (VM is *02b)

We would like to understand the root cause of the issue. We are using wiredTiger with a cacheSize of 2GB.

Here are the sequence of events

1. Primary was 02b, and it was the first to fail due to out of memory (server has 3.75 GB and mongodb is configured with a wiredTiger cache of 2 GB, so it must be some other operation taking a lot of memory).
2. Application starts a purge process at 4 am and it failed with the following error : Unexpected error in purge process:

{ "serverUsed" : "127.0.0.1:27017" , "ok" : 1 , "n" : 148927 , "err" : "Write results unavailable from hitron-db-02b:27018 :: caused by :: Connection was closed" , "code" : 6}

The last exception was thrown while deviceLog was being purged, the cause is that mongod process stopped.
3. When 02b went down, 02a took over as PRIMARY at around 4:27. But it failed due to out of memory same as 02a.
4. At this point, 02a restarted several times, always failing with out of memory.
5. At around 7:00, I noticed the problem and decreased WiredTiger cache to 1 GB on both 02a and 02b to see if there was enough memory to startup. 02a still restarted several times but, miraculously, it started succesfully at 7:55 (after using a lot of memory, it freed the memory and is running now with 1 GB).
6. Service was restored and working well until 9:36, from that time, disk started going slower. I'm not sure if due to IOPS limit (Amazon allows for a certain burst but if run out of credits, it limits your IOPS). IOPS for that volume is 750, and we are using around 820-830 IOPS right now.
7. I started hitron-db-02b at around 10:00, and same as the other one, it started using a lot of memory but survived… after a while it freed some memory and using 100% CPU from "evict" threads (this, I figured out using "top -H" and pstack). State of mongod was ROLLBACK, and it did not start replication, I guessed because it was still rolling back.
8. At 11:27 rollback finished, but with some out of memory errors… and mongod was aborted .
9. At 11:31 I restarted mongod at 02b, but failed again due to OOM.
10. At 11:40, I changed temporarily instance type to c5.xlarge to increase capacity and see if it starts correctly.
11. At 12:11 I stopped both Wifi Servers because disk at 02a was almost full (because it waits for replica to be up to date and it does not allow to drop any collection)
12. At 12:20 02b failed again with OOM, so changed instance type to c5.2xlarge and tried again
13. One curious thing: while wifi servers were down, and 02b was down, 02a still used a lot of IOPS and CPU, all from one thread: "WT Reco.plog.rs". This can be something we should clarify with MongoDB Team.
14. After using up to 11GB memory, 02b starts to replicate from 02a.
15. At around 13:00, 02a becomes unresponsive, don't know why. I cleanly shutdown 02b, and shutdown 02a since it was starting to use more and more memory. It completes shutdown at 13:08.
16. 02a starts at 13:10, and starts recovery due to unclean shutdown
17. I managed to start 02a, and then 02b, with replication up to date, and then I shut down 02b to change instance type. At this point, 02a tried to use a big amount of memory, crashed with OOM and started recovery AGAIN. Both 02a and 02b are running recovery
18. db02b recovered and took over as primary

Some snippets from 02b mongod log file

2018-11-15T04:10:03.136+0000 I CONTROL  [LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true } at rs1
...
2018-11-15T04:26:31.478+0000 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 21, after connections: 21, after dur: 21, after extra_info: 102, after freeMonitoring: 251, after globalLock: 251, after locks: 251, after logicalSessionRecordCache: 251, after network: 251, after opLatencies: 261, after opcounters: 261, after opcountersRepl: 261, after repl: 553, after security: 553, after shardingStatistics: 563, after storageEngine: 563, after tcmalloc: 901, after transactions: 901, after transportSecurity: 901, after wiredTiger: 1358, at end: 1376 }
..
2018-11-15T04:26:42.467+0000 I COMMAND  [conn53] command admin.$cmd command: serverStatus { serverStatus: 1, $clusterTime: { clusterTime: Timestamp(1542256000, 1), signature: { hash: BinData(0, 33AC32E9BDEDD35FED6A1FC98C944CE2FD449627), keyId: 6569071982954938369 } }, $configServerState: { opTime: { ts: Timestamp(1542256000, 1), t: 13 } }, $db: "admin" } numYields:0 reslen:32596 locks:{ Global: { acquireCount: { r: 2 } } } protocol:op_msg 1958ms
2018-11-15T04:26:43.776+0000 E STORAGE  [conn33133] WiredTiger error (12) [1542256003:693212][2040:0x7f23f7c88700], file:hitron/collection-14--7480144262999383718.wt, WT_SESSION.commit_transaction: __realloc_func, 133: memory allocation of 401408 bytes failed: Cannot allocate memory Raw: [1542256003:693212][2040:0x7f23f7c88700], file:hitron/collection-14--7480144262999383718.wt, WT_SESSION.commit_transaction: __realloc_func, 133: memory allocation of 401408 bytes failed: Cannot allocate memory

2018-11-15T04:26:44.346+0000 E STORAGE  [conn33146] WiredTiger error (12) [1542256004:338744][2040:0x7f23f7a86700], file:hitron/collection-14--7480144262999383718.wt, WT_SESSION.commit_transaction: __wt_malloc, 88: memory allocation of 340292 bytes failed: Cannot allocate memory Raw: [1542256004:338744][2040:0x7f23f7a86700], file:hitron/collection-14--7480144262999383718.wt, WT_SESSION.commit_transaction: __wt_malloc, 88: memory allocation of 340292 bytes failed: Cannot allocate memory
2018-11-15T04:26:44.346+0000 E STORAGE  [thread32312] WiredTiger error (-31804) [1542256004:338606][2040:0x7f24059b9700], eviction-server: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic Raw: [1542256004:338606][2040:0x7f24059b9700], eviction-server: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic

Assignee:: Danny Hatcher (Inactive)
Reporter:: Laxman P
Participants:: Bruce Lucas, Danny Hatcher, Kelsey Schubert, Laxman P
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Nov 16 2018 12:13:12 AM UTC
Updated:: Dec 03 2018 04:49:46 PM UTC
Resolved:: Nov 29 2018 07:57:19 PM UTC

Details

Description

Attachments

Activity

People

Dates