[SERVER-27909] OOM and possible memory leaks Created: 03/Feb/17  Updated: 12/Feb/17  Resolved: 07/Feb/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.4.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Vitaly Dyatlov Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 1
Labels: Bug, crash
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File WTcursors.png     PNG File mongo-mem.png     File status_20170203_1730.json     File status_20170203_1800.json     File status_20170203_1830.json     File status_20170203_1900.json    
Issue Links:
Duplicate
duplicates WT-1228 Improve performance of WT_SESSION::op... Closed
Operating System: ALL
Participants:

 Description   

We have switched one of our servers to WT to test it out.
Mongo version: 3.4.1
aws: i2.2xlarge
FS: ext4
Swap: off
daemon: mongod

settings:

systemLog:
    destination: file
    path: "/var/log/mongodb/mongodb.log"
    quiet: false
    logAppend: true
    logRotate: reopen
net:
    http:
        enabled: true
        RESTInterfaceEnabled: true
    port: 27017
operationProfiling:
    mode: slowOp
    slowOpThresholdMs: 1000
 
storage:
    dbPath: /ssd/var/lib/mongodb
 
    engine: wiredTiger

after 24 hours it eated all server memory failed with OOM:

2017-02-02T11:08:44.353+0000 I NETWORK  [thread1] connection accepted from xxx.xxx.xxx.xxx:41681 #27019 (1611 connections now open)
2017-02-02T11:08:44.927+0000 F -        [statsSnapshot] out of memory.
 
 0x7f02698f3221 0x7f02698f2734 0x7f0269ab02cb 0x7f026a3e9d44 0x7f02694f2df2 0x7f02694ec34e 0x7f02694ec95c 0x7f02698629fd 0x7f026a3681d0 0x7f0266fd6184 0x7f0266d0337d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"7F0268374000","o":"157F221","s":"_ZN5mongo15printStackTraceERSo"},{"b":"7F0268374000","o":"157E734","s":"_ZN5mongo29reportOutOfMemoryErrorAndExitEv"},{"b":"7F0268374000","o":"173C2CB"},{"b":"7F0268374000","o":"2075D44","s":"tc_new"},{"b":"7F0268374000","o":"117EDF2","s":"_ZNK5mongo3Top8cloneMapERNS_21UnorderedFastKeyTableINS_10StringDataENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_14CollectionDataENS_15StringMapTraitsEEE"},{"b":"7F0268374000","o":"117834E","s":"_ZN5mongo9Snapshots12takeSnapshotEv"},{"b":"7F0268374000","o":"117895C","s":"_ZN5mongo19StatsSnapshotThread3runEv"},{"b":"7F0268374000","o":"14EE9FD","s":"_ZN5mongo13BackgroundJob7jobBodyEv"},{"b":"7F0268374000","o":"1FF41D0"},{"b":"7F0266FCE000","o":"8184"},{"b":"7F0266C09000","o":"FA37D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.1", "gitVersion" : "5e103c4f5583e2566a45d740225dc250baacfbd7", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.13.0-29-generic", "version" : "#53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "b" : "7F0268374000", "elfType" : 3, "buildId" : "B0B5767DC6B7AB9D129191D0FA1F55666B34170F" }, { "b" : "7FFFEDCFE000", "elfType" : 3, "buildId" : "3D068D088E7EAC15D9DA7C3AC912E783C0897EE7" }, { "b" : "7F0267EF0000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "15A6A69B862FAC0AA589921DB1B0DF8A673FEC81" }, { "b" : "7F0267B14000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "607C4A6F3428B6F065D55691AFDC4E343786A8A5" }, { "b" : "7F026790C000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "E2A6DD5048A0A051FD61043BDB69D8CC68192AB7" }, { "b" : "7F0267708000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "D0F537904076D73F29E4A37341F8A449E2EF6CD0" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x7f02698f3221] 
 mongod(_ZN5mongo29reportOutOfMemoryErrorAndExitEv+0x84) [0x7f02698f2734] 
 mongod(+0x173C2CB) [0x7f0269ab02cb]
 mongod(tc_new+0x224) [0x7f026a3e9d44]
 mongod(_ZNK5mongo3Top8cloneMapERNS_21UnorderedFastKeyTableINS_10StringDataENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_14CollectionDataENS_15StringMapTraitsEEE+0x62) [0x7f02694f2df2]
 mongod(_ZN5mongo9Snapshots12takeSnapshotEv+0x4E) [0x7f02694ec34e]
 mongod(_ZN5mongo19StatsSnapshotThread3runEv+0x9C) [0x7f02694ec95c] 
 mongod(_ZN5mongo13BackgroundJob7jobBodyEv+0x16D) [0x7f02698629fd] 
 mongod(+0x1FF41D0) [0x7f026a3681d0]
 libpthread.so.0(+0x8184) [0x7f0266fd6184] 
 libc.so.6(clone+0x6D) [0x7f0266d0337d]
-----  END BACKTRACE  -----
2017-02-02T11:20:05.538+0000 I CONTROL  [main] ***** SERVER RESTARTED *****

I restarted mongo server and added a swap - 10GB.. for some time memory usage was at about 65% and then started to go up again.
After 24h I have this picture:

root@xxxxxx:/var/log/mongodb# free -h
             total       used       free     shared    buffers     cached
Mem:           29G        29G       227M        36K       2.4M        62M
-/+ buffers/cache:        29G       292M
Swap:           9G       2.5G       7.5G

So all memory eaten and roughly 2.5GB of swap occupied.
I don't think it is correct nor expected behavior. Mongo should not eat more memory than available.
Server still running and Im trying to collect serverStatus info every 30 mins now.

Please find in attachment last 4 serverStatus reports.
With MMAPv1 we never had usage more than 15-20%, not saying that we never seen OOM.



 Comments   
Comment by Vitaly Dyatlov [ 12/Feb/17 ]

An update:

Collection create/removal helps for some time.

Today's picture, 4 days after we refreshed cursor cache (add/del collection):

root@xxxx:~# free -h
             total       used       free     shared    buffers     cached
Mem:           29G        29G       194M        40K       492K        20M
-/+ buffers/cache:        29G       215M
Swap:           9G         9G         0B

All memory got occupied.

Server load line: `load average: 120.84, 144.86, 123.1` (but usual load was 4-8)

We had to immediately restart the server.

Don't you have a condition in your code to limit cache usage? It's indeed a leak and huge unstability sign.
We constantly have about 1500-1600 open connections, why your cache always increase and never stop?

Comment by Kelsey Schubert [ 07/Feb/17 ]

Hi dyatlov,

The majority of users see significant improvements after switching to WiredTiger and run in production without issue. However, we are aware of a few specific workloads (many tables, workload with many small updates), where WiredTiger does not perform as well as MMAPv1. I understand the frustration of not being able to upgrade at this time, and we are working very hard to improve WiredTiger's performance for these use-cases.

In your case, there are significant number of data handles, which is exasperating this issue. In MMAPv1, it was preferable to have many collections, as it cannot provide document-locking that WiredTiger allows. In addition to WT-1228, we are working on improving how WiredTiger behaves with many tables. This work would also improve the performance you observe, and is currently tracked in SERVER-17675 and SERVER-21629.

Kind regards,
Thomas

Comment by Vitaly Dyatlov [ 07/Feb/17 ]

Thomas,

should I read your answer as "WiredTiger is not stable enough for production"?
In what use cases it's safe to use WT over MMAP?

Vitali.

Comment by Kelsey Schubert [ 07/Feb/17 ]

Hi dyatlov,

Thank you for providing the diagnostic.data. The cause of the the OOM appears to be an accumulation of WiredTiger cursors in the session cache. Near the beginning of the diagnostic.data, we see that an operation takes place that invalidates WT cursor cache, and the memory usage drops.

We're investigating long term solutions to this issue in WT-1228. While you could periodically invalidate the cursor cache by executing commands like dropping a collection, my current recommendation would be to stay on MMAPv1 until WT-1228 is resolved.

Please feel free to vote for WT-1228 and watch it for updates.

Kind regards,
Thomas

Comment by Vitaly Dyatlov [ 05/Feb/17 ]

A small update to this: occupied already 4.5G of swap file (peak was 4.7 last night).

Comment by Kelsey Schubert [ 03/Feb/17 ]

Hi dyatlov,

Please note that files uploaded this ticket are public. If you are concerned about the privacy of your logs, I've created a secure upload portal for you to use.

Kind regards,
Thomas

Comment by Kelsey Schubert [ 03/Feb/17 ]

Hi dyatlov,

Would you please attach an archive of the diagnostic.data directory and the complete logs for the affected mongod?

Thank you,
Thomas

Generated at Thu Feb 08 04:16:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.