[SERVER-22115] Primary Server Crashes due to Memory Leak Created: 10/Jan/16  Updated: 11/Feb/16  Resolved: 10/Feb/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Moshe Kaplan [X] Assignee: Ramon Fernandez Marina
Resolution: Incomplete Votes: 0
Labels: WTmem
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Crash_Memory.png     PNG File all_graphs.png     Text File iostat.log     Text File log_crash.txt    
Operating System: ALL
Steps To Reproduce:

Server details:
RAM: 14GB
Data Size: 47.5GB (storage size ~15GB)
cacheSizeGB: 7
4 Cores
3.0.8
CentOS Linux release 7.2.1511 (Core): Azure, Linux version 3.10.0-229.11.1.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP

Replica set: Primary, Secondary and Arbiter.
Engine: WiredTiger

Participants:

 Description   
  • A replica-set in Azure that was deployed few months ago.
  • This setup was started to crash since Fri, Jan 8 (log example attached).
  • No upgrade/change to the system or application was done in the last week.
  • The crash happens only when instance is primary, after few hours of operation. You can see that during these hours memory usage increases, while all other parameters like connections remain constant (see the attached files).
  • The replica set was 3.0.6, and was upgraded it today to 3.0.8. We also removed it from the replica set and recreated it. Yet the problem continues to happen on this machine.
  • This one of few replica sets in Azure (where this is not recreated), yet it is the most active one.


 Comments   
Comment by Moshe Kaplan [X] [ 11/Feb/16 ]

The customer is happy w/ the 3.2.1 installation and currently is not willing to put more effort into it.
Thanks for you help!

Comment by Ramon Fernandez Marina [ 10/Feb/16 ]

MosheKaplan, without either the ss.log file (from the 3.0 affected node) or the contents of diagnostic.data (from a 3.2 affected node) it's not possible for us to investigate further, so I'm going to close this ticket.

If this is still an issue for you please provide one of the two data options requested above and we'll reopen the ticket to take a closer look.

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 25/Jan/16 ]

MosheKaplan, if you're able to observe this behavior on a 3.2 node, can you please upload the contents of the diagnostic.data directory within your dbpath? This directory contains the same information that you collected above in the ss.log file, and should help us understand what's going on.

Comment by Moshe Kaplan [X] [ 18/Jan/16 ]

Checking for that.
P.S The major difference in 3.0.8 the cache was not utilized at all. In 3.2 it's actually utilized. I would look for that direction (memory leak in cache).

Comment by Ramon Fernandez Marina [ 18/Jan/16 ]

Thanks for the additional information MosheKaplan; when running the script above you should have ended up with another file, ss.log, which is the one that has the key information that can help debugging this issue. Can you please upload it as well?

Comment by Moshe Kaplan [X] [ 18/Jan/16 ]

Some more info:
1. iostat is attached
2. Scaling the machine to 32GB RAM did not help
3. Upgrade to 3.2 made a major improvement

Comment by Moshe Kaplan [X] [ 18/Jan/16 ]

iostat information

Comment by Ramon Fernandez Marina [ 11/Jan/16 ]

Sorry you're running into this issue MosheKaplan. In order to diagnose this problem, can you please run the following shell script while you reproduce the crash?

# Delay in seconds
delay=1
mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep($delay*1000)}" >ss.log &
iostat -k -t -x $delay >iostat.log &

You can adjust the delay depending on how long this issue takes to trigger; if it's, say, 24h, the delay can be 5s to prevent the resulting files from being too large. If you could then upload the ss.log and iostat.log files along with the mongod.log for the affected server that should give us sufficient information to understand the source of the problem.

Thanks,
Ramón.

Generated at Thu Feb 08 03:59:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.