[SERVER-34398] Mongo WiredTiger Memory Spike And OOM Isssue Created: 09/Apr/18  Updated: 23/Jul/18  Resolved: 21/Jun/18

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Alex Etling Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File disk_reads.png     File metrics.2018-04-08T00-12-09Z-00000     File metrics.2018-04-08T04-01-51Z-00000     PNG File mongo_memory_spike.png     PNG File wired_tiger_bytes_in_the_cache.png    
Operating System: ALL
Participants:

 Description   

Over the last couple of months we have seen the hot secondary on one of our replica sets have a memory spike and then get killed by the linux OOM_KILLER

*Setup: *

  • Mongod Version: 3.4.4
  • Replica Set Config
  • 3 boxes:
  • One primary and one hot secondary running with 8 core and ~64 GBs of memory with swap disabled
  • Each box runs in Amazon EC2
  • Each box has a 1500 GB file system mounted for its data directory
  • Each box has a 50 GB file system mounted for the db logs
  • One backup secondary with 4 cores and 32GBs of memory

The load on hot secondary is the load to keep in sync with the primary + some queries on 2 collections that exist in this replica set. These queries sometimes include table scans.

Every once in a while our hot secondary's memory usage will spike causing the oom_killer to kill the running mongod process. (see the mongo memory spike screenshot attached)

There are some other strange things going on with the mongo process running out of memory. One is that the WiredTiger cache size does not seem to increase during this time period. (See the cache usage image attached)

There also seems to be huge spike in the amount of data read from disk right before the mongod memory usage spikes - about 40GBs worth. (see the attached disk read spike). A weird thing here is that those disk reads are on the mounted file system which holds the mongo logs (and is located at `/`) and not the one that holds the mongo data.

I have noticed this issue: https://jira.mongodb.org/browse/SERVER-27909 which seems like it could be related?

I have also attached the diagnostic.data logs from during and after the incident below. I believe that the incident should be towards the end of metrics.2018-04-08T00-12-09Z-00000. Let me know if there is any other data you need me to provide. Any help would be greatly appreciated.



 Comments   
Comment by Kelsey Schubert [ 21/Jun/18 ]

Hi paetling@gmail.com,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Kelsey

Comment by Kelsey Schubert [ 10/Apr/18 ]

Hi paetling@gmail.com,

I've reviewed the metrics files you've provided, but unfortunately do not have enough information to conclusively diagnose this issue.

So we can continue to investigate, would you please restart mongod with --setParameter heapProfilingEnabled=true. After encountering the issue again, please upload the following information:

  1. the complete mongod log files since the restart. (You can look for heap profiler messages in the log file to verify that it is running correctly.)
  2. diagnostic.data covering the entire time since the restart

These files will record information that should enable us to track the source of the memory increase. For this purpose it is important that we have complete logs and diagnostic.data covering the time since the restart. Since the required files may be too large to attach to this ticket, I've generated a secure upload portal for you to use.

Thank you for your help,
Kelsey

Generated at Thu Feb 08 04:36:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.