[SERVER-17456] Mongodb 3.0 wiredTiger storage engine memory usage too high. Created: 04/Mar/15  Updated: 06/Sep/15  Resolved: 06/Sep/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.0
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Feng Yu Assignee: Bruce Lucas (Inactive)
Resolution: Duplicate Votes: 12
Labels: RF, WTmem, wttt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 09-05-detail.png     PNG File 09-05.png     PNG File all.png     Text File ay_mr01a_ss.txt     PNG File early.png     PNG File leak.png     Text File mongod.log     Text File mongod.log     File ss.log.xz     File ss.log.xz     File status.json     HTML File syslog     File syslog.1    
Issue Links:
Duplicate
duplicates SERVER-19673 Excessive memory allocated by WiredTi... Closed
duplicates SERVER-17386 Cursor cache causes excessive memory ... Closed
duplicates SERVER-20306 75% excess memory usage under WiredTi... Closed
Related
related to SERVER-16977 Memory increase trend when running ha... Closed
is related to SERVER-20159 Out of memory on index build during i... Closed
is related to SERVER-17424 WiredTiger uses substantially more me... Closed
is related to SERVER-20248 Memory growth in __wt_session_get_btr... Closed
Backwards Compatibility: Fully Compatible
Participants:

 Description   

My OS is ubuntu server 14.04.2, and my mongodb version is 3.0.

I make a stress test with the new storage engine wiredTiger. But I find the memory usage never reduce. Finally, killed by kernel because of "out of memory"

localhost kernel: [12422.302136] Out of memory: Kill process 5893 (mongod) score 405 or sacrifice child.

Here is the db.serverStatus():

> db.serverStatus()
{
	"host" : "rtdstest",
	"version" : "3.0.0",
	"process" : "mongod",
	"pid" : NumberLong(14049),
	"uptime" : 5109,
	"uptimeMillis" : NumberLong(5108843),
	"uptimeEstimate" : 4731,
	"localTime" : ISODate("2015-03-04T05:56:32.167Z"),
	"asserts" : {
		"regular" : 0,
		"warning" : 0,
		"msg" : 0,
		"user" : 0,
		"rollovers" : 0
	},
	"connections" : {
		"current" : 6,
		"available" : 51194,
		"totalCreated" : NumberLong(12)
	},
	"cursors" : {
		"note" : "deprecated, use server status metrics",
		"clientCursors_size" : 0,
		"totalOpen" : 0,
		"pinned" : 0,
		"totalNoTimeout" : 0,
		"timedOut" : 0
	},
	"extra_info" : {
		"note" : "fields vary by platform",
		"heap_usage_bytes" : 1181006888,
		"page_faults" : 59806
	},
	"globalLock" : {
		"totalTime" : NumberLong("5108855000"),
		"currentQueue" : {
			"total" : 0,
			"readers" : 0,
			"writers" : 0
		},
		"activeClients" : {
			"total" : 11,
			"readers" : 0,
			"writers" : 0
		}
	},
	"locks" : {
		"Global" : {
			"acquireCount" : {
				"r" : NumberLong(1040910),
				"w" : NumberLong(416818),
				"W" : NumberLong(4)
			}
		},
		"MMAPV1Journal" : {
			"acquireCount" : {
				"r" : NumberLong(2),
				"w" : NumberLong(5)
			}
		},
		"Database" : {
			"acquireCount" : {
				"r" : NumberLong(1040902),
				"w" : NumberLong(413816),
				"R" : NumberLong(8),
				"W" : NumberLong(3002)
			},
			"acquireWaitCount" : {
				"r" : NumberLong(526),
				"w" : NumberLong(2926),
				"W" : NumberLong(2787)
			},
			"timeAcquiringMicros" : {
				"r" : NumberLong(83807792),
				"w" : NumberLong(117579505),
				"W" : NumberLong(318323222)
			}
		},
		"Collection" : {
			"acquireCount" : {
				"r" : NumberLong(1113677),
				"w" : NumberLong(416816)
			}
		}
	},
	"network" : {
		"bytesIn" : 125421456,
		"bytesOut" : 10352393,
		"numRequests" : 417271
	},
	"opcounters" : {
		"insert" : 413816,
		"query" : 1,
		"update" : 0,
		"delete" : 0,
		"getmore" : 0,
		"command" : 3456
	},
	"opcountersRepl" : {
		"insert" : 0,
		"query" : 0,
		"update" : 0,
		"delete" : 0,
		"getmore" : 0,
		"command" : 0
	},
	"storageEngine" : {
		"name" : "wiredTiger"
	},
	"wiredTiger" : {
		"uri" : "statistics:",
		"LSM" : {
			"sleep for LSM checkpoint throttle" : 0,
			"sleep for LSM merge throttle" : 0,
			"rows merged in an LSM tree" : 0,
			"application work units currently queued" : 0,
			"merge work units currently queued" : 0,
			"tree queue hit maximum" : 0,
			"switch work units currently queued" : 0,
			"tree maintenance operations scheduled" : 0,
			"tree maintenance operations discarded" : 0,
			"tree maintenance operations executed" : 0
		},
		"async" : {
			"number of allocation state races" : 0,
			"number of operation slots viewed for allocation" : 0,
			"current work queue length" : 0,
			"number of flush calls" : 0,
			"number of times operation allocation failed" : 0,
			"maximum work queue length" : 0,
			"number of times worker found no work" : 0,
			"total allocations" : 0,
			"total compact calls" : 0,
			"total insert calls" : 0,
			"total remove calls" : 0,
			"total search calls" : 0,
			"total update calls" : 0
		},
		"block-manager" : {
			"mapped bytes read" : 0,
			"bytes read" : 661356544,
			"bytes written" : 3154300928,
			"mapped blocks read" : 0,
			"blocks pre-loaded" : 6,
			"blocks read" : 161464,
			"blocks written" : 732235
		},
		"cache" : {
			"tracked dirty bytes in the cache" : 145758110,
			"bytes currently in the cache" : 218847451,
			"maximum bytes configured" : 1073741824,
			"bytes read into cache" : 306,
			"bytes written from cache" : 1474341161,
			"pages evicted by application threads" : 0,
			"checkpoint blocked page eviction" : 0,
			"unmodified pages evicted" : 0,
			"page split during eviction deepened the tree" : 0,
			"modified pages evicted" : 3000,
			"pages selected for eviction unable to be evicted" : 0,
			"pages evicted because they exceeded the in-memory maximum" : 0,
			"pages evicted because they had chains of deleted items" : 0,
			"failed eviction of pages that exceeded the in-memory maximum" : 0,
			"hazard pointer blocked page eviction" : 0,
			"internal pages evicted" : 0,
			"maximum page size at eviction" : 249,
			"eviction server candidate queue empty when topping up" : 0,
			"eviction server candidate queue not empty when topping up" : 0,
			"eviction server evicting pages" : 0,
			"eviction server populating queue, but not evicting pages" : 0,
			"eviction server unable to reach eviction goal" : 0,
			"pages split during eviction" : 0,
			"pages walked for eviction" : 0,
			"eviction worker thread evicting pages" : 0,
			"in-memory page splits" : 0,
			"percentage overhead" : 8,
			"tracked dirty pages in the cache" : 7243,
			"pages currently held in the cache" : 18009,
			"pages read into cache" : 6,
			"pages written from cache" : 400314
		},
		"connection" : {
			"pthread mutex condition wait calls" : 235351,
			"files currently open" : 9007,
			"memory allocations" : 19023254,
			"memory frees" : 9436838,
			"memory re-allocations" : 23706,
			"total read I/Os" : 176475,
			"pthread mutex shared lock read-lock calls" : 3201232,
			"pthread mutex shared lock write-lock calls" : 797865,
			"total write I/Os" : 1152662
		},
		"cursor" : {
			"cursor create calls" : 2802022,
			"cursor insert calls" : 1523920,
			"cursor next calls" : 3012,
			"cursor prev calls" : 3003,
			"cursor remove calls" : 0,
			"cursor reset calls" : 1613872,
			"cursor search calls" : 1972827,
			"cursor search near calls" : 52143,
			"cursor update calls" : 0
		},
		"data-handle" : {
			"connection dhandles swept" : 6,
			"connection candidate referenced" : 448192,
			"connection sweeps" : 507,
			"connection time-of-death sets" : 181568,
			"session dhandles swept" : 325685,
			"session sweep attempts" : 13896
		},
		"log" : {
			"log buffer size increases" : 1,
			"total log buffer size" : 22503424,
			"log bytes of payload data" : 142640845,
			"log bytes written" : 161999360,
			"yields waiting for previous log file close" : 0,
			"total size of compressed records" : 67924473,
			"total in-memory size of compressed records" : 226948923,
			"log records too small to compress" : 6082,
			"log records not compressed" : 240504,
			"log records compressed" : 194371,
			"maximum log file size" : 104857600,
			"pre-allocated log files prepared" : 2,
			"number of pre-allocated log files to create" : 1,
			"pre-allocated log files used" : 1,
			"log read operations" : 0,
			"records processed by log scan" : 0,
			"log scan records requiring two reads" : 0,
			"log scan operations" : 0,
			"consolidated slot closures" : 364624,
			"logging bytes consolidated" : 144881408,
			"consolidated slot joins" : 394190,
			"consolidated slot join races" : 5,
			"slots selected for switching that were unavailable" : 7,
			"record size exceeded maximum" : 0,
			"failed to find a slot large enough for record" : 1,
			"consolidated slot join transitions" : 4,
			"log sync operations" : 132,
			"log write operations" : 440958
		},
		"reconciliation" : {
			"page reconciliation calls" : 392016,
			"page reconciliation calls for eviction" : 3000,
			"split bytes currently awaiting free" : 0,
			"split objects currently awaiting free" : 0
		},
		"session" : {
			"open cursor count" : 413754,
			"open session count" : 90
		},
		"thread-yield" : {
			"page acquire busy blocked" : 0,
			"page acquire eviction blocked" : 0,
			"page acquire locked blocked" : 0,
			"page acquire read blocked" : 0,
			"page acquire time sleeping (usecs)" : 0
		},
		"transaction" : {
			"transaction begins" : 428981,
			"transaction checkpoints" : 79,
			"transaction checkpoint currently running" : 1,
			"transaction checkpoint max time (msecs)" : 15011,
			"transaction checkpoint min time (msecs)" : 1,
			"transaction checkpoint most recent time (msecs)" : 14423,
			"transaction checkpoint total time (msecs)" : 201933,
			"transactions committed" : 428980,
			"transaction failures due to cache overflow" : 0,
			"transaction range of IDs currently pinned" : 1933,
			"transactions rolled back" : 1
		},
		"concurrentTransactions" : {
			"write" : {
				"out" : 0,
				"available" : 128,
				"totalTickets" : 128
			},
			"read" : {
				"out" : 1,
				"available" : 127,
				"totalTickets" : 128
			}
		}
	},
	"writeBacksQueued" : false,
	"mem" : {
		"bits" : 64,
		"resident" : 1153,
		"virtual" : 1299,
		"supported" : true,
		"mapped" : 0,
		"mappedWithJournal" : 0
	},
	"metrics" : {
		"commands" : {
			"buildInfo" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(6)
			},
			"createIndexes" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(3000)
			},
			"dbStats" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(3)
			},
			"getLog" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(5)
			},
			"getnonce" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(1)
			},
			"isMaster" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(403)
			},
			"listCollections" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(5)
			},
			"listDatabases" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(7)
			},
			"ping" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(6)
			},
			"replSetGetStatus" : {
				"failed" : NumberLong(5),
				"total" : NumberLong(5)
			},
			"serverStatus" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(1)
			},
			"top" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(9)
			},
			"whatsmyuri" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(5)
			}
		},
		"cursor" : {
			"timedOut" : NumberLong(0),
			"open" : {
				"noTimeout" : NumberLong(0),
				"pinned" : NumberLong(0),
				"total" : NumberLong(0)
			}
		},
		"document" : {
			"deleted" : NumberLong(0),
			"inserted" : NumberLong(413816),
			"returned" : NumberLong(0),
			"updated" : NumberLong(0)
		},
		"getLastError" : {
			"wtime" : {
				"num" : 0,
				"totalMillis" : 0
			},
			"wtimeouts" : NumberLong(0)
		},
		"operation" : {
			"fastmod" : NumberLong(0),
			"idhack" : NumberLong(0),
			"scanAndOrder" : NumberLong(0),
			"writeConflicts" : NumberLong(0)
		},
		"queryExecutor" : {
			"scanned" : NumberLong(0),
			"scannedObjects" : NumberLong(0)
		},
		"record" : {
			"moves" : NumberLong(0)
		},
		"repl" : {
			"apply" : {
				"batches" : {
					"num" : 0,
					"totalMillis" : 0
				},
				"ops" : NumberLong(0)
			},
			"buffer" : {
				"count" : NumberLong(0),
				"maxSizeBytes" : 268435456,
				"sizeBytes" : NumberLong(0)
			},
			"network" : {
				"bytes" : NumberLong(0),
				"getmores" : {
					"num" : 0,
					"totalMillis" : 0
				},
				"ops" : NumberLong(0),
				"readersCreated" : NumberLong(0)
			},
			"preload" : {
				"docs" : {
					"num" : 0,
					"totalMillis" : 0
				},
				"indexes" : {
					"num" : 0,
					"totalMillis" : 0
				}
			}
		},
		"storage" : {
			"freelist" : {
				"search" : {
					"bucketExhausted" : NumberLong(0),
					"requests" : NumberLong(0),
					"scanned" : NumberLong(0)
				}
			}
		},
		"ttl" : {
			"deletedDocuments" : NumberLong(0),
			"passes" : NumberLong(85)
		}
	},
	"ok" : 1
}



 Comments   
Comment by Bruce Lucas (Inactive) [ 06/Sep/15 ]

Hi Feng,

I have been able to reproduce an issue whose signature matches the latest issue you reported under 3.0.6. I opened SERVER-20306 to track this specific issue. At this point I think the current ticket has run its course, having covered several issues, so I will close it now. Please follow SERVER-20306 to track progress on this latest issue.

I want to really thank you for brining these issues to our attention, and for your patience as we have diagnosed and fixed them. Please don't hesitate at all to open a new ticket if you encounter any new issues.

Thanks,
Bruce

Comment by Bruce Lucas (Inactive) [ 06/Sep/15 ]

Feng, thanks for uploading that data. Here's what I'm seeing over the course of the run:

  • until the point marked above, allocated memory and virtual size fairly closely tracked WT cache size, and both were less than the configured 1 GB (specifically, around the target of 80% of the configured size).
  • at A there was a jump in virtual memory and heap size so that memory usage was roughly 1.5x configured WT cache size. I'll examine this below.
  • but allocated memory and WT cache memory remained bounded as expected at about 80% of the configured WT cache size.
  • after A all memory usage stats remained constant, so it appears there is no longer any issue of memory growing without bound.

However regarding the jump in virtual and heap memory to about 1.5x configured cache size, here's a detailed view:

It appears that the jump was caused by some activity that required reading data into the cache, marked above. Looking at the numbers (in MB) before and after the jump:

                                   before   after    diff
bytes read into cache                  51     639     587   (A)
bytes currently in the cache          737     818      81   (B)
current allocated bytes               797     851      55   (C)
virtual                               967    1562     595   (D)
heap size                             812    1383     571   (E)
central cache free bytes                8     499     491   (F)

  • A: the activity during this interval has required reading about 500 MB into the cache. This will require allocating about 500 MB of memory.
  • B: but WT has, as expected, kept the number of bytes in the cache about constant, around its target of 80% of the 1 GB cache. To do this it has had to evict about 500 MB of data from the cache.
  • C: as it evicted the data it has freed the memory for the evicted data, thus keeping total allocated bytes also about constant.
  • D, E, F: however during this time virtual memory, allocator heap size, and allocator central cache free bytes have all grown by about 500 MB.

Thus it appears that

  • the 500 MB of new allocations required to read the 500 MB of new data into the cache were not satisfied from the 500 MB of memory freed as WT evicted data from the cache,
  • but rather were satisfied from a fresh 500 MB of memory obtained from the OS,
  • and the 500 MB of freed memory were returned to the allocator central cache instead.

I suspect this might reflect a memory fragmentation issue: possibly the memory freed by evicting the 500 MB of data from the cache consisted of a lot of small non-contiguous buffers, whereas the data read from disk required larger buffers that could not be accomodated in the small non-contiguous regions of free space resulting from evicting the 500 MB from the cache, so had to be satisfied by obtaining new memory from the OS.

michael.cahill, is the scenario as outlined above plausible? If so, do you think there's anything that could be done to improve this? It does seem that, if true, this implies we could need worst case something like 2x the configured cache size - 1x for "large buffers" of stuff freshly read into cache, and 1x for "small buffers" of stuff accumulated as data ages in the cache. Getting to 2x would probably require a fairly pathological case, but here we seem to have been able to get to about 1.5x or so.

Comment by Bruce Lucas (Inactive) [ 06/Sep/15 ]

ben.woodrum@contactive.com, thanks for uploading those logs. I've opened a ticket in our Community Private project, SUPPORT-1434, so that we can continue exchanging information privately about this issue; that ticket will be visible only to us and to yourself. This will also separate out the discussion of your issue; it will keep things clearer if we don't presume at the outset that you are seeing the same issue as Feng.

Comment by Feng Yu [ 05/Sep/15 ]

I've uploaded my ss.log and mongod.log. Finally mongodb used 28%-30% memory.

Comment by Daniel Pasette (Inactive) [ 04/Sep/15 ]

Thanks for the update Ben. Thanks also for getting those DataDog PR's in.

Comment by Ben McCann [ 04/Sep/15 ]

For what it's worth, we've seen really great memory improvements going from 3.0.4 to 3.0.6. We went from filling up the cache and crashing multiple times per day to completely stable. We tracked down a lot of problems using a patched version of Datadog (see the current pending pull requests https://github.com/DataDog/dd-agent/pulls?utf8=%E2%9C%93&q=%5Bmongo%5D+is%3Apr+is%3Aopen+) that provided us lots of extra metrics on MongoDB and WiredTiger

Comment by Daniel Pasette (Inactive) [ 04/Sep/15 ]

Hi Ben,
You can upload your data to this endpoint where they are stored encrypted and viewable only by mongodb engineering staff:
https://10gen-httpsupload.s3.amazonaws.com/upload_forms/d6377b89-8e83-4529-b3ab-38d5cec57ecd.html

Comment by Ben Woodrum [ 04/Sep/15 ]

Hi Bruce,

I've also experienced this issue and have been following this ticket since its creation. I've just generated and captured the crash data for ss.log, syslog and mongodb.log and would like to send it privately to you as to help with this investigation. Is there a secure way I can send you our logs?

Comment by Bruce Lucas (Inactive) [ 02/Sep/15 ]

Hi Feng,

I've been investigating further, and it looks like the problem I reproduced above on 3.0.5 does not occur on 3.0.6. Since you are still seeing problematic behavior on 3.0.6 we'll need to take a look at your logs and serverStatus time series on 3.0.6. Can you please collect serverStatus time series like you did before (described here) as mongod memory usage grows unacceptably under 3.0.6, then upload that time series data and the corresponding mongod log file?

Thanks,
Bruce

Comment by Bruce Lucas (Inactive) [ 29/Aug/15 ]

Hi Feng,

Thanks for uploading the latest information, and sorry for the delay in responding. Based on that information I have been able to reproduce what I assume is the same slow leak you are seeing in your latest tests. Following graphs show total allocated memory, the portion of allocated memory used by the cache, and the portion of allocated memory not accounted for by the cache (in MB):

In my test cache usage grows during the first part of the run but stops growing when the configured cache limit is reached, but allocated memory continues growing, and we see that the difference between allocated memory and cache usage grows steadily throughout the run; this is the presumed leak. I will continue to investigate and will update you with my findings.

Comment by Feng Yu [ 28/Aug/15 ]

This issue haven't been solved for now(3.0.6). Keeping focus on this issue.

Comment by Bruce Lucas (Inactive) [ 13/Aug/15 ]

chengas123, I've opened SERVER-19941 to request that tcmalloc:true be enabled by default.

Comment by Ben McCann [ 13/Aug/15 ]

Is there a reason the tcmalloc stats aren't included by default and you need to specify the extra option in the serverStatus call? Is it expensive or something? I'm wondering if we should always run with it on or make it to

We're having this problem or a very similar problem on 3.0.4. We just setup DataDog, which collects hundreds of metrics about our instances (running with this patch to add WiredTiger metrics: https://github.com/DataDog/dd-agent/pull/1825), which let us figure out what was happening. I could add the tcmalloc option to DataDog to allow collecting and graphing those metrics as well, but want to understand that option better first so that I add support for it correctly.

Comment by Feng Yu [ 12/Aug/15 ]

Hi, Bruce Lucas. I've uploaded my logs(ss.log.xz, syslog, syslog.1, mongod.log).

Here is mongod.conf:

systemLog:
   destination: file
   path: "/var/log/mongodb/mongod.log"
   logAppend: true
   logRotate: reopen
storage:
   journal:
      enabled: true
   dbPath: "/mnt/mongodb_wt"
   engine: wiredTiger
   wiredTiger.engineConfig.cacheSizeGB: 1
net:
   bindIp: 127.0.0.1
   port: 27017

But mongodb used nearly half physical memory.

# ps aux | grep mongod
mongodb  12305  6.5 48.9 2813748 1982144 ?     S<sl 10:52  51:00 /usr/bin/mongod --config /etc/mongod.conf

And still grows slowly.

Comment by Feng Yu [ 12/Aug/15 ]

Wait. I will use the collection script on production server.
The production server has 4G physical memory. I've already set storage.wiredTiger.engineConfig.cacheSizeGB to 1. But mongodb will grow more than 50% memory usage within 6 hours. Maybe on production environment we can get more useful things.

I will restart mongodb and run collection script for now.

Comment by Bruce Lucas (Inactive) [ 11/Aug/15 ]

abcfy2, can you please repeat the serverStatus time series data collection that you did before while reproducing the most recent problem you describe above, the mongorestore with indexes? That is:

  • start mongod as usual
  • start the following data collection script:

    mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1000)}" >ss.log &
    

  • after the mongorestore when memory utilization is higher than you expect, please attach both syslog and the ss.log output and also mongod.log. This will give us a clearer picture of how memory usage varies over the course of the run.

Regarding the mongorestore, is there any possibility you could share the data with us so we can reproduce it in house? I understand that the dump is about 180 GB, which is a large data set for upload, but might just be feasible if you are willing to do it. If you are I can provide a secure, private upload location.

arunasf, praveenak - would you be so kind as to open separate tickets? We find that it keeps the analysis clearer if we focus each ticket on one problem. The symptom of excess memory usage is very generic and can have many causes, so we would prefer to separate the issues until and unless analysis shows that they are related. If you could collect and include the serverStatus timeseries as described above when you open the tickets that will help speed the process.

Thanks,
Bruce

Comment by Feng Yu [ 06/Aug/15 ]

I can't get the stack trace log. I've tried many times(set storage.wiredTiger.engineConfig.cacheSizeGB to 3 or 4), but mongodb just killed by kernel oom killer, and I find when "building index using bulk method", the memory usage grows very fast.

So I try to delete the indexes and restore again.

# rm -f  dump/**/{system.indexes.bson,*.metadata.json}
# mongorestore dump/

Without building indexes, the memory usage grows gently. But after a long time, the mongodb used much memory and caused DOS, but not oom.

# mongorestore failed
2015-08-06T02:37:24.060+0800    error: Closed explicitly
2015-08-06T02:37:24.111+0800    error: Closed explicitly
2015-08-06T02:37:24.111+0800    [#######################.]  ksRealtimeDB.ks-861  221.2 MB/223.1 MB  (99.2%)
2015-08-06T02:37:24.111+0800    [#######################.]  ksRealtimeDB.ks-718  222.3 MB/223.0 MB  (99.7%)
2015-08-06T02:37:24.111+0800    [##################......]  ksRealtimeDB.ks-804  172.8 MB/222.2 MB  (77.8%)
2015-08-06T02:37:24.111+0800    [........................]  ksRealtimeDB.ks-860    3.3 MB/218.5 MB   (1.5%)
2015-08-06T02:37:24.111+0800
2015-08-06T02:37:24.203+0800    error: Closed explicitly
2015-08-06T02:37:24.229+0800    error: Closed explicitly
2015-08-06T02:37:24.229+0800    no indexes to restore
2015-08-06T02:37:24.229+0800    finished restoring ksRealtimeDB.ks-718 (660668 documents)
2015-08-06T02:37:24.229+0800    restoring ksRealtimeDB.ks-878 from file dump/ksRealtimeDB/ks-878.bson
2015-08-06T02:47:23.091+0800    error: Closed explicitly
2015-08-06T02:47:23.127+0800    [#######################.]  ksRealtimeDB.ks-861  222.6 MB/223.1 MB  (99.8%)
2015-08-06T02:47:23.128+0800    [##################......]  ksRealtimeDB.ks-804  174.4 MB/222.2 MB  (78.5%)
2015-08-06T02:47:23.128+0800    [........................]  ksRealtimeDB.ks-860    3.6 MB/218.5 MB   (1.6%)
2015-08-06T02:47:23.128+0800
2015-08-06T02:47:23.128+0800    [########################]  ksRealtimeDB.ks-861  223.1 MB/223.1 MB  (100.0%)
2015-08-06T02:47:23.128+0800    [##################......]  ksRealtimeDB.ks-804  174.6 MB/222.2 MB   (78.6%)
2015-08-06T02:47:23.128+0800    [........................]  ksRealtimeDB.ks-860    4.6 MB/218.5 MB    (2.1%)
2015-08-06T02:47:23.128+0800
2015-08-06T02:47:23.156+0800    no indexes to restore
2015-08-06T02:47:23.156+0800    finished restoring ksRealtimeDB.ks-861 (685740 documents)
2015-08-06T02:47:23.160+0800    error: Closed explicitly
2015-08-06T02:47:23.201+0800    Failed: restore error: ksRealtimeDB.ks-860: error restoring from dump/ksRealtimeDB/ks-860.bson: insertion error: no reachable servers
 
# ps aux | grep mongo
mongodb   3966 21.8 88.4 3763336 3581324 ?     Ssl   8月05 253:38 /usr/bin/mongod --config /etc/mongod.conf

Comment by Praveen Akinapally [ 04/Aug/15 ]

We have upgraded our replicated sharded mongo db setup to the latest 3.0.5 but our memory usage issues didn't go away. We are facing the same issue as described by Arunas F. Memory usage increases over time and only restart releases allocated memory. Even if we set cacheSize to 1 GB or 2 GB it uses way more and System kills the process.

Today one of the four shard servers running on our primary instance failed with outofmemory error and stack trace. It has 13 GB cacheSize specified. Initially the mongod process used to take 5 GB and gradually it grown to 17 GB using the full memory of the system. Below is the stack trace:

2015-08-04T17:21:56.567+0000 I COMMAND  [conn19473] command admin.$cmd command: replSetUpdatePosition { replSetUpdatePosition: 1, optimes: [ { _id: ObjectId('55a85f47244866407d69f5f7'), optime: Timestamp 1438366559000|90, memberId: 0, cfgver: 1, config: { _id: 0, host: "XXX.com:27022", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('558de07e4547ad2a4f5537dd'), optime: Timestamp 1438708912000|290, memberId: 1, cfgver: 1, config: { _id: 1, host: "XXXX.com:27022", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('558ea820d26ca3ea78419218'), optime: Timestamp 1438708912000|291, memberId: 2, cfgver: 1, config: { _id: 2, host: "XXX.com:27022", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } } ] } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:97 locks:{} 335ms
2015-08-04T17:21:56.847+0000 I COMMAND  [conn19518] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "sg0s01", pv: 1, v: 1, from: "XXX.com:27022", fromId: 2, checkEmpty: false } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:200 locks:{} 771ms
2015-08-04T17:21:57.001+0000 F -        [conn19515] out of memory.
 
 0xf5ba59 0xf5b589 0xede852 0xb9a5dc 0xaae04e 0xab340d 0x80e06d 0xf0e9fb 0x7f2eb2a9b182 0x7f2eb156430d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B5BA59"},{"b":"400000","o":"B5B589"},{"b":"400000","o":"ADE852"},{"b":"400000","o":"79A5DC"},{"b":"400000","o":"6AE04E"},{"b":"400000","o":"6B340D"},{"b":"400000","o":"40E06D"},{"b":"400000","o":"B0E9FB"},{"b":"7F2EB2A93000","o":"8182"},{"b":"7F2EB1469000","o":"FB30D"}],"processInfo":{ "mongodbVersion" : "3.0.5", "gitVersion" : "8bc4ae20708dbb493cb09338d9e7be6698e4a3a3", "uname" : { "sysname" : "Linux", "release" : "3.13.0-29-generic", "version" : "#53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "695FC6828398A9DB1F99718671147885B5ED116D" }, { "b" : "7FFF2EAFE000", "elfType" : 3, "buildId" : "3D068D088E7EAC15D9DA7C3AC912E783C0897EE7" }, { "b" : "7F2EB2A93000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "FE662C4D7B14EE804E0C1902FB55218A106BC5CB" }, { "b" : "7F2EB2835000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "6C7AE380840DB9034D7763771B55E51B31BCAF14" }, { "b" : "7F2EB245B000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "3D522D8E04F5FD7904AE69B50CA8835A71024490" }, { "b" : "7F2EB2253000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7F2EB204F000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7F2EB1D4B000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "19EFDDAB11B3BF5C71570078C59F91CF6592CE9E" }, { "b" : "7F2EB1A45000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "574C6350381DA194C00FF555E0C1784618C05569" }, { "b" : "7F2EB182F000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "CC0D578C2E0D86237CA7B0CE8913261C506A629A" }, { "b" : "7F2EB1469000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "B571F83A8A6F5BB22D3558CDDDA9F943A2A67FD1" }, { "b" : "7F2EB2CB1000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf5ba59]
 mongod(_ZN5mongo29reportOutOfMemoryErrorAndExitEv+0x49) [0xf5b589]
 mongod(_ZN5mongo11mongoMallocEm+0x22) [0xede852]
 mongod(_ZN5mongo7getMoreEPNS_16OperationContextEPKcixRNS_5CurOpEiRbPb+0x24C) [0xb9a5dc]
 mongod(_ZN5mongo15receivedGetMoreEPNS_16OperationContextERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x35E) [0xaae04e]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x13CD) [0xab340d]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xDD) [0x80e06d]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x34B) [0xf0e9fb]
libpthread.so.0(+0x8182) [0x7f2eb2a9b182]
 libc.so.6(clone+0x6D) [0x7f2eb156430d]
-----  END BACKTRACE  -----

Comment by Arunas F [ 04/Aug/15 ]

We have the same problem. MongoDB 3.0.5 with WiredTiger storage engine on CentOS 6. MongoDB runs alone without replica set.
Memory usage increases over time, also performance degrades (queries run slower and slower). Until oom event occurs (we have no swap) and kernel kills mongo process.
In mongodb.log there is not stack traces or similar things. Kernel just kills the process and that's all.

Restarting the process helps a lot ( we do it at least twice a day now) - releases allocated memory and performance is great again.

Setting storage.wiredTiger.engineConfig.cacheSizeGB to 1GB makes no difference, memory usage goes over the roof, until process is killed.

MongoDB usage is from light to moderate, mostly read operations (90%). Data size ~150GB uncompressed. Some documents quite large - about 200Kb.

Inspecting of serverStatus() showed that WT "open cursor count" and "open session count" grows boundlessly.

Software is using java 2.13 driver to connect to database.

We were using Mongo 2.X with mmapv1 for several years without any problems.

Comment by Feng Yu [ 04/Aug/15 ]

I've tried set this value to 3, First time mongodb killed by kernel oom killer, not threw any stack traces in mongod.log. Then I killed all the others processes to release the memory and try again, finally mongodb remained at 75% memory usage.

I will set storage.wiredTiger.engineConfig.cacheSizeGB to 4 and try again.

Comment by Bruce Lucas (Inactive) [ 04/Aug/15 ]

It is possible for SERVER-19673 to affect the primary as well. Would you be able to increase storage.wiredTiger.engineConfig.cacheSizeGB in order to hopefully induce mongod to run into an out-of-memory condition and terminate with a stack trace in mongod.log that may help us identify the cause?

Comment by Feng Yu [ 04/Aug/15 ]

I find storage.wiredTiger.engineConfig.cacheSizeGB could not reduce the memory usage. By default, after restore a dump, mongodb used 35% memory.

# ps aux | grep mongo
mongodb    642  133 35.0 1643340 1419756 ?     Ssl  09:27 333:17 /usr/bin/mongod --config /etc/mongod.conf

But I change storage.wiredTiger.engineConfig.cacheSizeGB to 1, after mongorestore, mongodb used 38% memory.

# ps aux | grep mongod
mongodb  30586  129 38.0 1743628 1515648 ?     Ssl  16:05 182:38 /usr/bin/mongod --config /etc/mongod.conf

Comment by Feng Yu [ 01/Aug/15 ]

I'm not using replica set, but a single stand alone node. This node is under a heavy write load.

Comment by Bruce Lucas (Inactive) [ 01/Aug/15 ]

Can you confirm my understanding that this latest issue is occurring on 3.0.5 on a secondary? We have discovered a new issue with memory consumption under WT, SERVER-19673, that is particularly likely to affect the secondary under heavy replication. If you would like to confirm that this is the issue you have hit, you can increase the cache size in order to intentionally allow the secondary to reach an out-of-memory condition, and this may generate stack traces in the log file that will confirm where the out-of-memory condition occurred.

Comment by Feng Yu [ 31/Jul/15 ]

Maybe still memory leak. I've set storage.wiredTiger.engineConfig.cacheSizeGB to 1, but after restore from a 180GB dump directory, mongodb used 44% memory.

# ps aux | grep mongo
mongodb   2987 33.4 44.0 1975648 1784104 ?     Ssl   7月29 859:45 /usr/bin/mongod --config /etc/mongod.conf

Comment by Daniel Pasette (Inactive) [ 29/Jul/15 ]

You can set the cache size in megabytes, rather than in whole gigabytes. This is undocumented in the MongoDB documentation, but is visible in the underlying WiredTiger docs.

For example, the following is valid startup parameter to set your cache size to 200MB :

 --wiredTigerEngineConfigString="cache_size=200M".

Comment by Feng Yu [ 29/Jul/15 ]

I try to set storage.wiredTiger.engineConfig.cacheSizeGB: 1 in /etc/mongod.conf. But the mongodb still use 47.6% memory. I have to restart mongodb to release the memory usually.

# ps aux | grep mongo
root      4370  0.0  0.0  12156   928 pts/3    S+   21:12   0:00 grep --color=auto mongo
mongodb   5366  6.1 47.6 2087248 1929408 ?     Ssl  12:44  31:24 /usr/bin/mongod --config /etc/mongod.conf
 
# mongo
MongoDB shell version: 3.0.5
connecting to: test
> show dbs;
ksRealtimeDB     33.053GB
local             0.000GB
metadataMongoDB   0.014GB
> db.serverStatus()
{
	"host" : "ks",
	"version" : "3.0.5",
	"process" : "mongod",
	"pid" : NumberLong(5366),
	"uptime" : 30633,
	"uptimeMillis" : NumberLong(30632295),
	"uptimeEstimate" : 30334,
	"localTime" : ISODate("2015-07-29T13:14:55.076Z"),
	"asserts" : {
		"regular" : 0,
		"warning" : 0,
		"msg" : 0,
		"user" : 0,
		"rollovers" : 0
	},
	"connections" : {
		"current" : 11,
		"available" : 51189,
		"totalCreated" : NumberLong(11)
	},
	"cursors" : {
		"note" : "deprecated, use server status metrics",
		"clientCursors_size" : 0,
		"totalOpen" : 0,
		"pinned" : 0,
		"totalNoTimeout" : 0,
		"timedOut" : 0
	},
	"extra_info" : {
		"note" : "fields vary by platform",
		"heap_usage_bytes" : 1951008568,
		"page_faults" : 940
	},
	"globalLock" : {
		"totalTime" : NumberLong("30632271000"),
		"currentQueue" : {
			"total" : 0,
			"readers" : 0,
			"writers" : 0
		},
		"activeClients" : {
			"total" : 16,
			"readers" : 0,
			"writers" : 0
		}
	},
	"locks" : {
		"Global" : {
			"acquireCount" : {
				"r" : NumberLong(12778513),
				"w" : NumberLong(2239505),
				"W" : NumberLong(4)
			},
			"acquireWaitCount" : {
				"r" : NumberLong(1)
			},
			"timeAcquiringMicros" : {
				"r" : NumberLong(7224)
			}
		},
		"Database" : {
			"acquireCount" : {
				"r" : NumberLong(5269498),
				"w" : NumberLong(1172777),
				"R" : NumberLong(4),
				"W" : NumberLong(1066734)
			},
			"acquireWaitCount" : {
				"r" : NumberLong(2822),
				"W" : NumberLong(1375)
			},
			"timeAcquiringMicros" : {
				"r" : NumberLong(12840574),
				"W" : NumberLong(12555320)
			}
		},
		"Collection" : {
			"acquireCount" : {
				"r" : NumberLong(5617480),
				"w" : NumberLong(1172777)
			}
		}
	},
	"network" : {
		"bytesIn" : 654568095,
		"bytesOut" : 211110891,
		"numRequests" : 2252134
	},
	"opcounters" : {
		"insert" : 1172774,
		"query" : 1398,
		"update" : 0,
		"delete" : 0,
		"getmore" : 33,
		"command" : 1075299
	},
	"opcountersRepl" : {
		"insert" : 0,
		"query" : 0,
		"update" : 0,
		"delete" : 0,
		"getmore" : 0,
		"command" : 0
	},
	"storageEngine" : {
		"name" : "wiredTiger"
	},
	"wiredTiger" : {
		"uri" : "statistics:",
		"LSM" : {
			"sleep for LSM checkpoint throttle" : 0,
			"sleep for LSM merge throttle" : 0,
			"rows merged in an LSM tree" : 0,
			"application work units currently queued" : 0,
			"merge work units currently queued" : 0,
			"tree queue hit maximum" : 0,
			"switch work units currently queued" : 0,
			"tree maintenance operations scheduled" : 0,
			"tree maintenance operations discarded" : 0,
			"tree maintenance operations executed" : 0
		},
		"async" : {
			"number of allocation state races" : 0,
			"number of operation slots viewed for allocation" : 0,
			"current work queue length" : 0,
			"number of flush calls" : 0,
			"number of times operation allocation failed" : 0,
			"maximum work queue length" : 0,
			"number of times worker found no work" : 0,
			"total allocations" : 0,
			"total compact calls" : 0,
			"total insert calls" : 0,
			"total remove calls" : 0,
			"total search calls" : 0,
			"total update calls" : 0
		},
		"block-manager" : {
			"mapped bytes read" : 0,
			"bytes read" : 1964785664,
			"bytes written" : 20398272512,
			"mapped blocks read" : 0,
			"blocks pre-loaded" : 215404,
			"blocks read" : 471600,
			"blocks written" : 2791937
		},
		"cache" : {
			"tracked dirty bytes in the cache" : 562019570,
			"tracked bytes belonging to internal pages in the cache" : 47628364,
			"bytes currently in the cache" : 851385954,
			"tracked bytes belonging to leaf pages in the cache" : 1026113460,
			"maximum bytes configured" : 1073741824,
			"tracked bytes belonging to overflow pages in the cache" : 0,
			"bytes read into cache" : 232338402,
			"bytes written from cache" : 16237644684,
			"pages evicted by application threads" : 0,
			"checkpoint blocked page eviction" : 1,
			"unmodified pages evicted" : 450,
			"page split during eviction deepened the tree" : 0,
			"modified pages evicted" : 663,
			"pages selected for eviction unable to be evicted" : 89,
			"pages evicted because they exceeded the in-memory maximum" : 0,
			"pages evicted because they had chains of deleted items" : 0,
			"failed eviction of pages that exceeded the in-memory maximum" : 0,
			"hazard pointer blocked page eviction" : 1,
			"internal pages evicted" : 0,
			"maximum page size at eviction" : 1496235,
			"eviction server candidate queue empty when topping up" : 12,
			"eviction server candidate queue not empty when topping up" : 9,
			"eviction server evicting pages" : 93,
			"eviction server populating queue, but not evicting pages" : 14,
			"eviction server unable to reach eviction goal" : 0,
			"pages split during eviction" : 588,
			"pages walked for eviction" : 581490,
			"eviction worker thread evicting pages" : 1109,
			"in-memory page splits" : 0,
			"percentage overhead" : 8,
			"tracked dirty pages in the cache" : 1763,
			"pages currently held in the cache" : 10025,
			"pages read into cache" : 11109,
			"pages written from cache" : 1873598
		},
		"connection" : {
			"pthread mutex condition wait calls" : 1127704,
			"files currently open" : 1507,
			"memory allocations" : 40274294,
			"memory frees" : 33063359,
			"memory re-allocations" : 9711,
			"total read I/Os" : 473140,
			"pthread mutex shared lock read-lock calls" : 3299221,
			"pthread mutex shared lock write-lock calls" : 491083,
			"total write I/Os" : 3966531
		},
		"cursor" : {
			"cursor create calls" : 2029287,
			"cursor insert calls" : 4158946,
			"cursor next calls" : 1541183,
			"cursor prev calls" : 197271,
			"cursor remove calls" : 0,
			"cursor reset calls" : 7219504,
			"cursor search calls" : 5217589,
			"cursor search near calls" : 35225,
			"cursor update calls" : 0
		},
		"data-handle" : {
			"connection dhandles swept" : 0,
			"connection candidate referenced" : 0,
			"connection sweeps" : 3054,
			"connection time-of-death sets" : 458879,
			"session dhandles swept" : 1109,
			"session sweep attempts" : 537
		},
		"log" : {
			"log buffer size increases" : 0,
			"total log buffer size" : 8388608,
			"log bytes of payload data" : 388280660,
			"log bytes written" : 494305792,
			"yields waiting for previous log file close" : 0,
			"total size of compressed records" : 372863683,
			"total in-memory size of compressed records" : 909213742,
			"log records too small to compress" : 3207,
			"log records not compressed" : 88213,
			"log records compressed" : 1082733,
			"maximum log file size" : 104857600,
			"pre-allocated log files prepared" : 5,
			"number of pre-allocated log files to create" : 1,
			"pre-allocated log files used" : 4,
			"log read operations" : 0,
			"log release advances write LSN" : 1174153,
			"records processed by log scan" : 11,
			"log scan records requiring two reads" : 2,
			"log scan operations" : 5,
			"consolidated slot closures" : 0,
			"logging bytes consolidated" : 0,
			"consolidated slot joins" : 0,
			"consolidated slot join races" : 0,
			"slots selected for switching that were unavailable" : 0,
			"record size exceeded maximum" : 0,
			"failed to find a slot large enough for record" : 0,
			"consolidated slot join transitions" : 0,
			"log sync operations" : 1797,
			"log sync_dir operations" : 5,
			"log server thread advances write LSN" : 0,
			"log write operations" : 1174153
		},
		"reconciliation" : {
			"page reconciliation calls" : 1429966,
			"page reconciliation calls for eviction" : 393,
			"split bytes currently awaiting free" : 17288,
			"split objects currently awaiting free" : 16
		},
		"session" : {
			"open cursor count" : 3400,
			"open session count" : 17
		},
		"thread-yield" : {
			"page acquire busy blocked" : 0,
			"page acquire eviction blocked" : 0,
			"page acquire locked blocked" : 0,
			"page acquire read blocked" : 4672350,
			"page acquire time sleeping (usecs)" : 22619981
		},
		"transaction" : {
			"transaction begins" : 2255968,
			"transaction checkpoints" : 429,
			"transaction checkpoint generation" : 429,
			"transaction checkpoint currently running" : 1,
			"transaction checkpoint max time (msecs)" : 16326,
			"transaction checkpoint min time (msecs)" : 76,
			"transaction checkpoint most recent time (msecs)" : 6671,
			"transaction checkpoint total time (msecs)" : 3036981,
			"transactions committed" : 1173290,
			"transaction failures due to cache overflow" : 0,
			"transaction range of IDs currently pinned by a checkpoint" : 621,
			"transaction range of IDs currently pinned" : 0,
			"transactions rolled back" : 1082678
		},
		"concurrentTransactions" : {
			"write" : {
				"out" : 0,
				"available" : 128,
				"totalTickets" : 128
			},
			"read" : {
				"out" : 1,
				"available" : 127,
				"totalTickets" : 128
			}
		}
	},
	"writeBacksQueued" : false,
	"mem" : {
		"bits" : 64,
		"resident" : 2016,
		"virtual" : 2238,
		"supported" : true,
		"mapped" : 0,
		"mappedWithJournal" : 0
	},
	"metrics" : {
		"commands" : {
			"buildInfo" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(10)
			},
			"count" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(37)
			},
			"createIndexes" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(1066044)
			},
			"getLastError" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(10)
			},
			"getLog" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(1)
			},
			"isMaster" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(9192)
			},
			"listCollections" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(1)
			},
			"listDatabases" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(1)
			},
			"replSetGetStatus" : {
				"failed" : NumberLong(1),
				"total" : NumberLong(1)
			},
			"serverStatus" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(1)
			},
			"whatsmyuri" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(1)
			}
		},
		"cursor" : {
			"timedOut" : NumberLong(0),
			"open" : {
				"noTimeout" : NumberLong(0),
				"pinned" : NumberLong(0),
				"total" : NumberLong(0)
			}
		},
		"document" : {
			"deleted" : NumberLong(0),
			"inserted" : NumberLong(1172774),
			"returned" : NumberLong(194265),
			"updated" : NumberLong(0)
		},
		"getLastError" : {
			"wtime" : {
				"num" : 0,
				"totalMillis" : 0
			},
			"wtimeouts" : NumberLong(0)
		},
		"operation" : {
			"fastmod" : NumberLong(0),
			"idhack" : NumberLong(0),
			"scanAndOrder" : NumberLong(0),
			"writeConflicts" : NumberLong(0)
		},
		"queryExecutor" : {
			"scanned" : NumberLong(1340203),
			"scannedObjects" : NumberLong(1340203)
		},
		"record" : {
			"moves" : NumberLong(0)
		},
		"repl" : {
			"apply" : {
				"batches" : {
					"num" : 0,
					"totalMillis" : 0
				},
				"ops" : NumberLong(0)
			},
			"buffer" : {
				"count" : NumberLong(0),
				"maxSizeBytes" : 268435456,
				"sizeBytes" : NumberLong(0)
			},
			"network" : {
				"bytes" : NumberLong(0),
				"getmores" : {
					"num" : 0,
					"totalMillis" : 0
				},
				"ops" : NumberLong(0),
				"readersCreated" : NumberLong(0)
			},
			"preload" : {
				"docs" : {
					"num" : 0,
					"totalMillis" : 0
				},
				"indexes" : {
					"num" : 0,
					"totalMillis" : 0
				}
			}
		},
		"storage" : {
			"freelist" : {
				"search" : {
					"bucketExhausted" : NumberLong(0),
					"requests" : NumberLong(0),
					"scanned" : NumberLong(0)
				}
			}
		},
		"ttl" : {
			"deletedDocuments" : NumberLong(0),
			"passes" : NumberLong(509)
		}
	},
	"ok" : 1
}
> 

Comment by Feng Yu [ 29/Jul/15 ]

Hi Michael Cahill . This server running many other services like tomcat, mysql, redis, and so on. Mongodb used so much memory and caused swap leak, it degraded the performance.

The storage.wiredTiger.engineConfig.cacheSizeGB seems to be a useful option, but it can only set to a integer number, is this possible to set a float number like 0.5 ?

The last, after drop database will not release the memory maybe not very well. I have to restart mongod service to release memory. Maybe introduce garbage collection is better, cleaned the garbage data in memory regularly.

Comment by Michael Cahill (Inactive) [ 29/Jul/15 ]

abcfy2, that behavior sounds like what I would expect. The default WiredTiger cache size is half of RAM (so 2GB in your case), and it is not unusual for other memory allocations in mongod to total 400MB, so a total of 2.4GB is not suprising.

I gather that the load completed, so presumably there was no problem with memory growing larger and larger over time.

Is the memory use actually causing a problem? MongoDB has configured WiredTiger not to evict trees from cache unless there is memory pressure, so the RAM use will not go down automatically because a MongoDB workload ends.

The best starting point for memory sizing with WiredTiger is probably: http://docs.mongodb.org/manual/reference/configuration-options/#storage.wiredTiger.engineConfig.cacheSizeGB

Comment by Feng Yu [ 29/Jul/15 ]

Hi developers. Did this issue fix on 3.0.5 ? I find the related issue fix version/s is 3.0.5.
But when I upgrade to 3.0.5, this issue also appeared. Wiredtiger cost so much memory usage and cause swap leak(over 50% usage)

I try to restore from a dump which has 200GB. After that mongod will use over 60% memory(physical memory 4G). And db.dropDatabase() will not release the memory.

When did this issue will be fixed ? And what's the minimum memory did mongodb need ?

Comment by Jean-Pierre [ 19/Jun/15 ]

I am experience the same issue, please help

here is the result of db.serverStatus(

{tcmalloc:true}

)

"storageEngine" :

{ "name" : "wiredTiger" }

,
"tcmalloc" : {
"generic" :

{ "current_allocated_bytes" : NumberLong("10818777200"), "heap_size" : NumberLong("13169491968") }

,
"tcmalloc" :

{ "pageheap_free_bytes" : NumberLong(1584193536), "pageheap_unmapped_bytes" : 283467776, "max_total_thread_cache_bytes" : NumberLong(1073741824), "current_total_thread_cache_bytes" : 47305000, "central_cache_free_bytes" : 421448760, "transfer_cache_free_bytes" : 14299696, "thread_cache_free_bytes" : 47305000, "aggressive_memory_decommit" : 0 }

,
"formattedString" : "-----------------------------------------------\nMALLOC: 10818777200 (10317.6 MiB) Bytes in use by application\nMALLOC: + 1584193536 ( 1510.8 MiB) Bytes in page heap freelist\nMALLOC: + 421448760 ( 401.9 MiB) Bytes in central cache freelist\nMALLOC: + 14299696 ( 13.6 MiB) Bytes in transfer cache freelist\nMALLOC: + 47305000 ( 45.1 MiB) Bytes in thread cache freelists\nMALLOC: + 39727264 ( 37.9 MiB) Bytes in malloc metadata\nMALLOC: ------------\nMALLOC: = 12925751456 (12327.0 MiB) Actual memory used (physical + swap)\nMALLOC: + 283467776 ( 270.3 MiB) Bytes released to OS (aka unmapped)\nMALLOC: ------------\nMALLOC: = 13209219232 (12597.3 MiB) Virtual address space used\nMALLOC:\nMALLOC: 437763 Spans in use\nMALLOC: 221 Thread heaps in use\nMALLOC: 8192 Tcmalloc page size\n-----------------------------------------------\nCall ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).\nBytes released to the OS take up virtual address space but no physical memory.\n"
},
"wiredTiger" : {
"uri" : "statistics:",
"LSM" :

{ "sleep for LSM checkpoint throttle" : 0, "sleep for LSM merge throttle" : 0, "rows merged in an LSM tree" : 0, "application work units currently queued" : 0, "merge work units currently queued" : 0, "tree queue hit maximum" : 0, "switch work units currently queued" : 0, "tree maintenance operations scheduled" : 0, "tree maintenance operations discarded" : 0, "tree maintenance operations executed" : 0 }

,
"async" :

{ "number of allocation state races" : 0, "number of operation slots viewed for allocation" : 0, "current work queue length" : 0, "number of flush calls" : 0, "number of times operation allocation failed" : 0, "maximum work queue length" : 0, "number of times worker found no work" : 0, "total allocations" : 0, "total compact calls" : 0, "total insert calls" : 0, "total remove calls" : 0, "total search calls" : 0, "total update calls" : 0 }

,
"block-manager" :

{ "mapped bytes read" : 0, "bytes read" : 161093787648, "bytes written" : 95925149696, "mapped blocks read" : 0, "blocks pre-loaded" : 1447, "blocks read" : 15805695, "blocks written" : 11314236 }

,
"cache" :

{ "tracked dirty bytes in the cache" : 63008714, "tracked bytes belonging to internal pages in the cache" : 5224866301, "bytes currently in the cache" : 11166329226, "tracked bytes belonging to leaf pages in the cache" : 8733777411, "maximum bytes configured" : 13958643712, "tracked bytes belonging to overflow pages in the cache" : 0, "bytes read into cache" : 403477130712, "bytes written from cache" : 245260253309, "pages evicted by application threads" : 10, "checkpoint blocked page eviction" : 24548, "unmodified pages evicted" : 7523027, "page split during eviction deepened the tree" : 0, "modified pages evicted" : 7972566, "pages selected for eviction unable to be evicted" : 400587, "pages evicted because they exceeded the in-memory maximum" : 54, "pages evicted because they had chains of deleted items" : 6, "failed eviction of pages that exceeded the in-memory maximum" : 0, "hazard pointer blocked page eviction" : 332338, "internal pages evicted" : 326946, "maximum page size at eviction" : 10485970, "eviction server candidate queue empty when topping up" : 116131, "eviction server candidate queue not empty when topping up" : 1005839, "eviction server evicting pages" : 310014, "eviction server populating queue, but not evicting pages" : 1108831, "eviction server unable to reach eviction goal" : 5, "pages split during eviction" : 616954, "pages walked for eviction" : 12768203795, "eviction worker thread evicting pages" : 15592287, "in-memory page splits" : 51, "percentage overhead" : 8, "tracked dirty pages in the cache" : 4018, "pages currently held in the cache" : 303763, "pages read into cache" : 15805426, "pages written from cache" : 11313730 }

,
"connection" :

{ "pthread mutex condition wait calls" : 2891587, "files currently open" : 27, "memory allocations" : 489279295, "memory frees" : 475486585, "memory re-allocations" : 9801, "total read I/Os" : 15805728, "pthread mutex shared lock read-lock calls" : 16867, "pthread mutex shared lock write-lock calls" : 73414, "total write I/Os" : 20721524 }

,
"cursor" :

{ "cursor create calls" : 8208, "cursor insert calls" : 37627405, "cursor next calls" : 347708668, "cursor prev calls" : 1824, "cursor remove calls" : 18813460, "cursor reset calls" : 215092881, "cursor search calls" : 98875308, "cursor search near calls" : 130497814, "cursor update calls" : 0 }

,
"data-handle" :

{ "connection dhandles swept" : 0, "connection candidate referenced" : 207, "connection sweeps" : 14423, "connection time-of-death sets" : 217, "session dhandles swept" : 487, "session sweep attempts" : 481 }

,
"log" :

{ "log buffer size increases" : 0, "total log buffer size" : 1048576, "log bytes of payload data" : 10840117536, "log bytes written" : 11398885120, "yields waiting for previous log file close" : 0, "total size of compressed records" : 10840085895, "total in-memory size of compressed records" : 15422419601, "log records too small to compress" : 244, "log records not compressed" : 133, "log records compressed" : 9406776, "maximum log file size" : 104857600, "pre-allocated log files prepared" : 109, "number of pre-allocated log files to create" : 1, "pre-allocated log files used" : 108, "log read operations" : 0, "log release advances write LSN" : 9407153, "records processed by log scan" : 10, "log scan records requiring two reads" : 0, "log scan operations" : 5, "consolidated slot closures" : 0, "logging bytes consolidated" : 0, "consolidated slot joins" : 0, "consolidated slot join races" : 0, "slots selected for switching that were unavailable" : 0, "record size exceeded maximum" : 0, "failed to find a slot large enough for record" : 0, "consolidated slot join transitions" : 0, "log sync operations" : 422, "log sync_dir operations" : 102, "log server thread advances write LSN" : 0, "log write operations" : 9407153 }

,
"reconciliation" :

{ "page reconciliation calls" : 10364692, "page reconciliation calls for eviction" : 6135116, "split bytes currently awaiting free" : 4824, "split objects currently awaiting free" : 6 }

,
"session" :

{ "open cursor count" : 1374, "open session count" : 126 }

,
"thread-yield" :

{ "page acquire busy blocked" : 543, "page acquire eviction blocked" : 0, "page acquire locked blocked" : 26980285, "page acquire read blocked" : 14675, "page acquire time sleeping (usecs)" : 24088026 }

,
"transaction" :

{ "transaction begins" : 14443774, "transaction checkpoints" : 245, "transaction checkpoint generation" : 244, "transaction checkpoint currently running" : 1, "transaction checkpoint max time (msecs)" : 121972, "transaction checkpoint min time (msecs)" : 1, "transaction checkpoint most recent time (msecs)" : 56556, "transaction checkpoint total time (msecs)" : 2350919, "transactions committed" : 9406866, "transaction failures due to cache overflow" : 0, "transaction range of IDs currently pinned by a checkpoint" : 0, "transaction range of IDs currently pinned" : 1, "transactions rolled back" : 5036908 }

,
"concurrentTransactions" : {
"write" :

{ "out" : 0, "available" : 128, "totalTickets" : 128 }

,
"read" :

{ "out" : 1, "available" : 127, "totalTickets" : 128 }

}
},
"writeBacksQueued" : false,
"mem" :

{ "bits" : 64, "resident" : 12546, "virtual" : 13495, "supported" : true, "mapped" : 0, "mappedWithJournal" : 0 }

,

Comment by Ramon Fernandez Marina [ 15/Jun/15 ]

All,

we believe the root cause of this issue is the same one reporter in SERVER-17386, so I'm closing this ticket as a duplicate.

cronos, it could be useful if you could share your reproducer, so when SERVER-17386 is fixed we can verify that the fix addresses your specific scenario as well.

Regards,
Ramón.

Comment by Gal Waleczny [ 15/Jun/15 ]

"session" :

{ "open cursor count" : 67390, "open session count" : 15059 }

I have the same problem, i can reproduce it every time in 30 seconds, so if you would like more data let me know. I am using simple benchmark using ab and node.js with 2 same queries on 2 collections that return same data. Memory usage goes to 4 GB in 10-20 seconds while benchmark is on. I've added status.json file.

Comment by Patrick Easters [ 03/Jun/15 ]

We've been running into this issue too. I've attached ay_mr01a_ss.txt with the output of db.serverStatus({tcmalloc:true}) taken just a few minutes before it was killed. The server had ~190MB of RAM free out of 20GB. This seems to happen about every 30 hours for us.

I'm on CentOS 7

Linux ay-prtp-mr01a.ayetier.systems 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09:22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Jun  3 10:36:09 ay-prtp-mr01a kernel: Out of memory: Kill process 2215 (mongod) score 975 or sacrifice child
Jun  3 10:36:09 ay-prtp-mr01a kernel: Killed process 2215 (mongod) total-vm:22956780kB, anon-rss:19821844kB, file-rss:0kB

Comment by yuantao.meng [ 29/May/15 ]

I am experiencing the same issue and hope this can help you.
uname -a
Linux xxx 2.6.32-504.3.3.el6.centos.plus.x86_64 #1 SMP Wed Dec 17 01:21:03 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

server_status.json

{
	"host" : "xxx",
	"version" : "3.0.3",
	"process" : "mongod",
	"pid" : NumberLong(1003),
	"uptime" : 234783,
	"uptimeMillis" : NumberLong(234782832),
	"uptimeEstimate" : 233275,
	"localTime" : ISODate("2015-05-29T20:42:06.124Z"),
	"asserts" : {
		"regular" : 0,
		"warning" : 0,
		"msg" : 0,
		"user" : 87,
		"rollovers" : 0
	},
	"connections" : {
		"current" : 98,
		"available" : 81822,
		"totalCreated" : NumberLong(16881)
	},
	"cursors" : {
		"note" : "deprecated, use server status metrics",
		"clientCursors_size" : 1,
		"totalOpen" : 1,
		"pinned" : 0,
		"totalNoTimeout" : 0,
		"timedOut" : 3
	},
	"extra_info" : {
		"note" : "fields vary by platform",
		"heap_usage_bytes" : 1792327968,
		"page_faults" : 5287
	},
	"globalLock" : {
		"totalTime" : NumberLong("234782976000"),
		"currentQueue" : {
			"total" : 0,
			"readers" : 0,
			"writers" : 0
		},
		"activeClients" : {
			"total" : 124,
			"readers" : 0,
			"writers" : 0
		}
	},
	"locks" : {
		"Global" : {
			"acquireCount" : {
				"r" : NumberLong(11671352),
				"w" : NumberLong(78476913),
				"R" : NumberLong(203),
				"W" : NumberLong(8)
			},
			"acquireWaitCount" : {
				"r" : NumberLong(1),
				"w" : NumberLong(58),
				"R" : NumberLong(1),
				"W" : NumberLong(1)
			},
			"timeAcquiringMicros" : {
				"r" : NumberLong(1736),
				"w" : NumberLong(495182),
				"R" : NumberLong(300591),
				"W" : NumberLong(4)
			}
		},
		"Database" : {
			"acquireCount" : {
				"r" : NumberLong(11671139),
				"w" : NumberLong(64296245),
				"R" : NumberLong(213),
				"W" : NumberLong(14180668)
			},
			"acquireWaitCount" : {
				"r" : NumberLong(217),
				"w" : NumberLong(726),
				"W" : NumberLong(230)
			},
			"timeAcquiringMicros" : {
				"r" : NumberLong(15032),
				"w" : NumberLong(62960),
				"W" : NumberLong(4285)
			}
		},
		"Collection" : {
			"acquireCount" : {
				"r" : NumberLong(2592648),
				"w" : NumberLong(48184423)
			}
		},
		"oplog" : {
			"acquireCount" : {
				"r" : NumberLong(9129204),
				"w" : NumberLong(16111823)
			}
		}
	},
	"network" : {
		"bytesIn" : 7260734541,
		"bytesOut" : 20284843790,
		"numRequests" : 20453492
	},
	"opcounters" : {
		"insert" : 1216138,
		"query" : 971862,
		"update" : 6429396,
		"delete" : 177781,
		"getmore" : 4053392,
		"command" : 7463080
	},
	"opcountersRepl" : {
		"insert" : 22306881,
		"query" : 0,
		"update" : 3202,
		"delete" : 17996278,
		"getmore" : 0,
		"command" : 0
	},
	"repl" : {
		"setName" : "gifshow_comment",
		"setVersion" : 3,
		"ismaster" : true,
		"secondary" : false,
		"hosts" : [
			"s3.zw:27018",
			"sd-d6.zw:27018",
			"sd-d7.zw:27018"
		],
		"primary" : "sd-d7.zw:27018",
		"me" : "sd-d7.zw:27018",
		"electionId" : ObjectId("55674b8ec486a02be5c82573"),
		"rbid" : 404533360
	},
	"storageEngine" : {
		"name" : "wiredTiger"
	},
	"tcmalloc" : {
		"generic" : {
			"current_allocated_bytes" : NumberLong("49036969696"),
			"heap_size" : NumberLong("49967849472")
		},
		"tcmalloc" : {
			"pageheap_free_bytes" : 272842752,
			"pageheap_unmapped_bytes" : 123133952,
			"max_total_thread_cache_bytes" : NumberLong(1073741824),
			"current_total_thread_cache_bytes" : 53402688,
			"central_cache_free_bytes" : 453545296,
			"transfer_cache_free_bytes" : 27955088,
			"thread_cache_free_bytes" : 53402688,
			"aggressive_memory_decommit" : 0
		},
		"formattedString" : "------------------------------------------------\nMALLOC:    49036969696 (46765.3 MiB) Bytes in use by application\nMALLOC: +    272842752 (  260.2 MiB) Bytes in page heap freelist\nMALLOC: +    453545296 (  432.5 MiB) Bytes in central cache freelist\nMALLOC: +     27955088 (   26.7 MiB) Bytes in transfer cache freelist\nMALLOC: +     53402688 (   50.9 MiB) Bytes in thread cache freelists\nMALLOC: +    170029216 (  162.2 MiB) Bytes in malloc metadata\nMALLOC:   ------------\nMALLOC: =  50014744736 (47697.8 MiB) Actual memory used (physical + swap)\nMALLOC: +    123133952 (  117.4 MiB) Bytes released to OS (aka unmapped)\nMALLOC:   ------------\nMALLOC: =  50137878688 (47815.2 MiB) Virtual address space used\nMALLOC:\nMALLOC:        2290756              Spans in use\nMALLOC:            147              Thread heaps in use\nMALLOC:           8192              Tcmalloc page size\n------------------------------------------------\nCall ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).\nBytes released to the OS take up virtual address space but no physical memory.\n"
	},
	"wiredTiger" : {
		"uri" : "statistics:",
		"LSM" : {
			"sleep for LSM checkpoint throttle" : 0,
			"sleep for LSM merge throttle" : 0,
			"rows merged in an LSM tree" : 0,
			"application work units currently queued" : 0,
			"merge work units currently queued" : 0,
			"tree queue hit maximum" : 0,
			"switch work units currently queued" : 0,
			"tree maintenance operations scheduled" : 0,
			"tree maintenance operations discarded" : 0,
			"tree maintenance operations executed" : 0
		},
		"async" : {
			"number of allocation state races" : 0,
			"number of operation slots viewed for allocation" : 0,
			"current work queue length" : 0,
			"number of flush calls" : 0,
			"number of times operation allocation failed" : 0,
			"maximum work queue length" : 0,
			"number of times worker found no work" : 0,
			"total allocations" : 0,
			"total compact calls" : 0,
			"total insert calls" : 0,
			"total remove calls" : 0,
			"total search calls" : 0,
			"total update calls" : 0
		},
		"block-manager" : {
			"mapped bytes read" : 0,
			"bytes read" : 24622915584,
			"bytes written" : 351526281216,
			"mapped blocks read" : 0,
			"blocks pre-loaded" : 2750,
			"blocks read" : 1470378,
			"blocks written" : 30992724
		},
		"cache" : {
			"tracked dirty bytes in the cache" : 148551891,
			"tracked bytes belonging to internal pages in the cache" : 165144664,
			"bytes currently in the cache" : 50610056164,
			"tracked bytes belonging to leaf pages in the cache" : 66406848424,
			"maximum bytes configured" : 66571993088,
			"tracked bytes belonging to overflow pages in the cache" : 0,
			"bytes read into cache" : 29445741246,
			"bytes written from cache" : 313199529011,
			"pages evicted by application threads" : 0,
			"checkpoint blocked page eviction" : 13226,
			"unmodified pages evicted" : 0,
			"page split during eviction deepened the tree" : 2,
			"modified pages evicted" : 78115,
			"pages selected for eviction unable to be evicted" : 2482,
			"pages evicted because they exceeded the in-memory maximum" : 3169,
			"pages evicted because they had chains of deleted items" : 9898,
			"failed eviction of pages that exceeded the in-memory maximum" : 889,
			"hazard pointer blocked page eviction" : 1705,
			"internal pages evicted" : 356,
			"maximum page size at eviction" : 10497085,
			"eviction server candidate queue empty when topping up" : 3129,
			"eviction server candidate queue not empty when topping up" : 196,
			"eviction server evicting pages" : 18630,
			"eviction server populating queue, but not evicting pages" : 173,
			"eviction server unable to reach eviction goal" : 0,
			"pages split during eviction" : 2669,
			"pages walked for eviction" : 562506307,
			"eviction worker thread evicting pages" : 50607,
			"in-memory page splits" : 2587,
			"percentage overhead" : 8,
			"tracked dirty pages in the cache" : 205,
			"pages currently held in the cache" : 1297333,
			"pages read into cache" : 1368314,
			"pages written from cache" : 30788552
		},
		"connection" : {
			"pthread mutex condition wait calls" : 16758169,
			"files currently open" : 48,
			"memory allocations" : 954581752,
			"memory frees" : 785342202,
			"memory re-allocations" : 822914,
			"total read I/Os" : 1522884,
			"pthread mutex shared lock read-lock calls" : 765801,
			"pthread mutex shared lock write-lock calls" : 432862,
			"total write I/Os" : 89470557
		},
		"cursor" : {
			"cursor create calls" : 459706,
			"cursor insert calls" : 177472350,
			"cursor next calls" : 107673340,
			"cursor prev calls" : 2888903,
			"cursor remove calls" : 99771867,
			"cursor reset calls" : 416692862,
			"cursor search calls" : 96158328,
			"cursor search near calls" : 72805618,
			"cursor update calls" : 0
		},
		"data-handle" : {
			"connection dhandles swept" : 0,
			"connection candidate referenced" : 98524,
			"connection sweeps" : 36372,
			"connection time-of-death sets" : 98554,
			"session dhandles swept" : 98500,
			"session sweep attempts" : 4381
		},
		"log" : {
			"log buffer size increases" : 1,
			"total log buffer size" : 7888896,
			"log bytes of payload data" : 14434150786,
			"log bytes written" : 18157866880,
			"yields waiting for previous log file close" : 0,
			"total size of compressed records" : 4911046605,
			"total in-memory size of compressed records" : 9835096752,
			"log records too small to compress" : 25076133,
			"log records not compressed" : 35432212,
			"log records compressed" : 9736540,
			"maximum log file size" : 104857600,
			"pre-allocated log files prepared" : 178,
			"number of pre-allocated log files to create" : 1,
			"pre-allocated log files used" : 177,
			"log read operations" : 0,
			"log release advances write LSN" : 11242,
			"records processed by log scan" : 33067,
			"log scan records requiring two reads" : 19351,
			"log scan operations" : 3,
			"consolidated slot closures" : 58475323,
			"logging bytes consolidated" : 18569641984,
			"consolidated slot joins" : 70347430,
			"consolidated slot join races" : 499137,
			"slots selected for switching that were unavailable" : 46233890,
			"record size exceeded maximum" : 0,
			"failed to find a slot large enough for record" : 1,
			"consolidated slot join transitions" : 204318,
			"log sync operations" : 10933,
			"log sync_dir operations" : 155,
			"log server thread advances write LSN" : 58464391,
			"log write operations" : 70262480
		},
		"reconciliation" : {
			"page reconciliation calls" : 22727633,
			"page reconciliation calls for eviction" : 5662,
			"split bytes currently awaiting free" : 483823,
			"split objects currently awaiting free" : 6
		},
		"session" : {
			"open cursor count" : 4080,
			"open session count" : 140
		},
		"thread-yield" : {
			"page acquire busy blocked" : 5,
			"page acquire eviction blocked" : 324,
			"page acquire locked blocked" : 3306197,
			"page acquire read blocked" : 263169319,
			"page acquire time sleeping (usecs)" : 795089487
		},
		"transaction" : {
			"transaction begins" : 80889468,
			"transaction checkpoints" : 3546,
			"transaction checkpoint generation" : 3546,
			"transaction checkpoint currently running" : 0,
			"transaction checkpoint max time (msecs)" : 32295,
			"transaction checkpoint min time (msecs)" : 2,
			"transaction checkpoint most recent time (msecs)" : 1551,
			"transaction checkpoint total time (msecs)" : 8492572,
			"transactions committed" : 70285520,
			"transaction failures due to cache overflow" : 0,
			"transaction range of IDs currently pinned by a checkpoint" : 0,
			"transaction range of IDs currently pinned" : 39,
			"transactions rolled back" : 10666794
		},
		"concurrentTransactions" : {
			"write" : {
				"out" : 0,
				"available" : 128,
				"totalTickets" : 128
			},
			"read" : {
				"out" : 1,
				"available" : 127,
				"totalTickets" : 128
			}
		}
	},
	"writeBacksQueued" : false,
	"mem" : {
		"bits" : 64,
		"resident" : 47584,
		"virtual" : 48632,
		"supported" : true,
		"mapped" : 0,
		"mappedWithJournal" : 0
	},
	"metrics" : {
		"commands" : {
			"buildInfo" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(1318)
			},
			"delete" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(177781)
			},
			"getLastError" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(1318)
			},
			"getLog" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(4)
			},
			"getnonce" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(1)
			},
			"insert" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(1196251)
			},
			"isMaster" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(1046456)
			},
			"listCollections" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(2)
			},
			"ping" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(19)
			},
			"replSetElect" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(25)
			},
			"replSetFresh" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(106)
			},
			"replSetGetStatus" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(133)
			},
			"replSetHeartbeat" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(232451)
			},
			"replSetUpdatePosition" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(6181191)
			},
			"serverStatus" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(43)
			},
			"update" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(6429396)
			},
			"whatsmyuri" : {
				"failed" : NumberLong(0),
				"total" : NumberLong(13)
			}
		},
		"cursor" : {
			"timedOut" : NumberLong(3),
			"open" : {
				"noTimeout" : NumberLong(0),
				"pinned" : NumberLong(0),
				"total" : NumberLong(1)
			}
		},
		"document" : {
			"deleted" : NumberLong(148383),
			"inserted" : NumberLong(1216138),
			"returned" : NumberLong(17187205),
			"updated" : NumberLong(6429396)
		},
		"getLastError" : {
			"wtime" : {
				"num" : 0,
				"totalMillis" : 0
			},
			"wtimeouts" : NumberLong(0)
		},
		"operation" : {
			"fastmod" : NumberLong(0),
			"idhack" : NumberLong(721797),
			"scanAndOrder" : NumberLong(10),
			"writeConflicts" : NumberLong(17)
		},
		"queryExecutor" : {
			"scanned" : NumberLong(1589246),
			"scannedObjects" : NumberLong(35525321)
		},
		"record" : {
			"moves" : NumberLong(0)
		},
		"repl" : {
			"apply" : {
				"batches" : {
					"num" : 7090319,
					"totalMillis" : 482498
				},
				"ops" : NumberLong(40306361)
			},
			"buffer" : {
				"count" : NumberLong(0),
				"maxSizeBytes" : 268435456,
				"sizeBytes" : NumberLong(0)
			},
			"network" : {
				"bytes" : NumberLong("6927059542"),
				"getmores" : {
					"num" : 6996686,
					"totalMillis" : 116533294
				},
				"ops" : NumberLong(40306361),
				"readersCreated" : NumberLong(1)
			},
			"preload" : {
				"docs" : {
					"num" : 0,
					"totalMillis" : 0
				},
				"indexes" : {
					"num" : 0,
					"totalMillis" : 0
				}
			}
		},
		"storage" : {
			"freelist" : {
				"search" : {
					"bucketExhausted" : NumberLong(0),
					"requests" : NumberLong(0),
					"scanned" : NumberLong(0)
				}
			}
		},
		"ttl" : {
			"deletedDocuments" : NumberLong(8083424),
			"passes" : NumberLong(3901)
		}
	},
	"ok" : 1
}

Comment by Bruce Lucas (Inactive) [ 28/May/15 ]

mamoos1, to help us determine what the cause is we would like to look at serverStatus information.

  • If the node is currently in a state where memory utilization is high can you please run the following mongo shell command and post the complete output here:

    db.serverStatus({tcmalloc:true})
    

  • If it is not currently in the state where it's using a lot of memory, or if you are willing to restart the mongod so we can monitor it while memory usage grows, can you please start the following monitoring process on the node that has the problem:

      mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(10*1000)}" >ss.log &
    

    When it reaches a point where it's using a lot of memory (for example the OS kills it) please attach the ss.log file for our analysis.

Thanks,
Bruce

Comment by Roy [ 25/May/15 ]

Hello,

I am experiencing the exact same issue, however, I do not have multiple cursors open.

When I look in MMS it looks as if the resident memory is slowly rising until ultimately the process is killed by the OS.
I'm using MongoDB 3.0.3 on Ubuntu 14.04.2, with wiredTiger as the storage engine. It's in a replicaSet (Secondary member) where the other members are 2.6.9, not sure if relevant.
I have no special memory configuration (Probably the cache is configured as half of the available memory - this should be the default, right?)
I'm actually not so sure where to start my debugging from, since this is just slowly, but steadily rising, where all other metrics (cursors, page faults, opcounters, connections, storage, queues, etc.) remain pretty constant.
I used to have the exact same application work against a 2.6.9 node, and it worked perfectly. The application runs different queries, inc. count, aggregations and map reduces.

Comment by Vasily Timofeev [ 30/Apr/15 ]

Hello, I added manual cursor closing and now it works well (mmapv1).

Comment by Vasily Timofeev [ 30/Apr/15 ]

Server status when 85% memory is used.

 
> db.serverStatus()
{
        "host" : "ws-9-mongodb",
        "version" : "3.0.2",
        "process" : "mongod",
        "pid" : NumberLong(13182),
        "uptime" : 457,
        "uptimeMillis" : NumberLong(457300),
        "uptimeEstimate" : 446,
        "localTime" : ISODate("2015-04-30T11:31:45.767Z"),
        "asserts" : {
                "regular" : 0,
                "warning" : 0,
                "msg" : 0,
                "user" : 0,
                "rollovers" : 0
        },
        "backgroundFlushing" : {
                "flushes" : 7,
                "total_ms" : 53,
                "average_ms" : 7.571428571428571,
                "last_ms" : 30,
                "last_finished" : ISODate("2015-04-30T11:31:08.574Z")
        },
        "connections" : {
                "current" : 1,
                "available" : 51199,
                "totalCreated" : NumberLong(606)
        },
        "cursors" : {
                "note" : "deprecated, use server status metrics",
                "clientCursors_size" : 160332,
                "totalOpen" : 160332,
                "pinned" : 0,
                "totalNoTimeout" : 0,
                "timedOut" : 0
        },
        "dur" : {
                "commits" : 4,
                "journaledMB" : 0,
                "writeToDataFilesMB" : 0,
                "compression" : 0,
                "commitsInWriteLock" : 0,
                "earlyCommits" : 0,
                "timeMs" : {
                        "dt" : 0,
                        "prepLogBuffer" : 0,
                        "writeToJournal" : 0,
                        "writeToDataFiles" : 0,
                        "remapPrivateView" : 0,
                        "commits" : 0,
                        "commitsInWriteLock" : 0
                }
        },
        "extra_info" : {
                "note" : "fields vary by platform",
                "heap_usage_bytes" : 1713573064,
                "page_faults" : 30
        },
        "globalLock" : {
                "totalTime" : NumberLong(457299000),
                "currentQueue" : {
                        "total" : 0,
                        "readers" : 0,
                        "writers" : 0
                },
                "activeClients" : {
                        "total" : 11,
                        "readers" : 0,
                        "writers" : 0
                }
        },
        "locks" : {
                "Global" : {
                        "acquireCount" : {
                                "r" : NumberLong(178491),
                                "w" : NumberLong(18031),
                                "W" : NumberLong(5)
                        }
                },
                "MMAPV1Journal" : {
                        "acquireCount" : {
                                "r" : NumberLong(209740),
                                "w" : NumberLong(45129),
                                "R" : NumberLong(4426),
                                "W" : NumberLong(98)
                        },
                        "acquireWaitCount" : {
                                "r" : NumberLong(132),
                                "w" : NumberLong(200),
                                "R" : NumberLong(51),
                                "W" : NumberLong(20)
                        },
                        "timeAcquiringMicros" : {
                                "r" : NumberLong(315583),
                                "w" : NumberLong(821871),
                                "R" : NumberLong(84356),
                                "W" : NumberLong(9091)
                        }
                },
                "Database" : {
                        "acquireCount" : {
                                "r" : NumberLong(178488),
                                "w" : NumberLong(17813),
                                "R" : NumberLong(3),
                                "W" : NumberLong(218)
                        },
                        "acquireWaitCount" : {
                                "w" : NumberLong(1),
                                "W" : NumberLong(18)
                        },
                        "timeAcquiringMicros" : {
                                "w" : NumberLong(68),
                                "W" : NumberLong(2842)
                        }
                },
                "Collection" : {
                        "acquireCount" : {
                                "R" : NumberLong(178565),
                                "W" : NumberLong(17815)
                        },
                        "acquireWaitCount" : {
                                "R" : NumberLong(31251),
                                "W" : NumberLong(9244)
                        },
                        "timeAcquiringMicros" : {
                                "R" : NumberLong(113015872),
                                "W" : NumberLong(19730141)
                        }
                },
                "Metadata" : {
                        "acquireCount" : {
                                "W" : NumberLong(1)
                        }
                }
        },
        "network" : {
                "bytesIn" : 30605343,
                "bytesOut" : 43043585,
                "numRequests" : 196383
        },
        "opcounters" : {
                "insert" : 17814,
                "query" : 176583,
                "update" : 1,
                "delete" : 0,
                "getmore" : 0,
                "command" : 1983
        },
        "opcountersRepl" : {
                "insert" : 0,
                "query" : 0,
                "update" : 0,
                "delete" : 0,
                "getmore" : 0,
                "command" : 0
        },
        "storageEngine" : {
                "name" : "mmapv1"
        },
        "writeBacksQueued" : false,
        "mem" : {
                "bits" : 64,
                "resident" : 1716,
                "virtual" : 6333,
                "supported" : true,
                "mapped" : 2160,
                "mappedWithJournal" : 4320
        },
        "metrics" : {
                "commands" : {
                        "createIndexes" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(200)
                        },
                        "createUser" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(1)
                        },
                        "getLastError" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(2)
                        },
                        "getLog" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(3)
                        },
                        "insert" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(17813)
                        },
                        "isMaster" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(235)
                        },
                        "replSetGetStatus" : {
                                "failed" : NumberLong(3),
                                "total" : NumberLong(3)
                        },
                        "saslContinue" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(1012)
                        },
                        "saslStart" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(506)
                        },
                        "serverStatus" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(13)
                        },
                        "usersInfo" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(2)
                        },
                        "whatsmyuri" : {
                                "failed" : NumberLong(0),
                                "total" : NumberLong(6)
                        }
                },
                "cursor" : {
                        "timedOut" : NumberLong(0),
                        "open" : {
                                "noTimeout" : NumberLong(0),
                                "pinned" : NumberLong(0),
                                "total" : NumberLong(160332)
                        }
                },
                "document" : {
                        "deleted" : NumberLong(0),
                        "inserted" : NumberLong(17814),
                        "returned" : NumberLong(1603322),
                        "updated" : NumberLong(1)
                },
                "getLastError" : {
                        "wtime" : {
                                "num" : 0,
                                "totalMillis" : 0
                        },
                        "wtimeouts" : NumberLong(0)
                },
                "operation" : {
                        "fastmod" : NumberLong(0),
                        "idhack" : NumberLong(0),
                        "scanAndOrder" : NumberLong(0),
                        "writeConflicts" : NumberLong(0)
                },
                "queryExecutor" : {
                        "scanned" : NumberLong(1683805),
                        "scannedObjects" : NumberLong(1603329)
                },
                "record" : {
                        "moves" : NumberLong(0)
                },
                "repl" : {
                        "apply" : {
                                "batches" : {
                                        "num" : 0,
                                        "totalMillis" : 0
                                },
                                "ops" : NumberLong(0)
                        },
                        "buffer" : {
                                "count" : NumberLong(0),
                                "maxSizeBytes" : 268435456,
                                "sizeBytes" : NumberLong(0)
                        },
                        "network" : {
                                "bytes" : NumberLong(0),
                                "getmores" : {
                                        "num" : 0,
                                        "totalMillis" : 0
                                },
                                "ops" : NumberLong(0),
                                "readersCreated" : NumberLong(0)
                        },
                        "preload" : {
                                "docs" : {
                                        "num" : 0,
                                        "totalMillis" : 0
                                },
                                "indexes" : {
                                        "num" : 0,
                                        "totalMillis" : 0
                                }
                        }
                },
                "storage" : {
                        "freelist" : {
                                "search" : {
                                        "bucketExhausted" : NumberLong(0),
                                        "requests" : NumberLong(18147),
                                        "scanned" : NumberLong(0)
                                }
                        }
                },
                "ttl" : {
                        "deletedDocuments" : NumberLong(0),
                        "passes" : NumberLong(7)
                }
        },
        "ok" : 1
}

Comment by Vasily Timofeev [ 30/Apr/15 ]

Now I have catched the same issue on mmapv1, after 3 minutes and ~200000 operations insert/find.

Comment by Vasily Timofeev [ 30/Apr/15 ]

After ~10 seconds and 10000 operations insert/find

https://yadi.sk/i/YuWf5yebgMCVR

Comment by Vasily Timofeev [ 30/Apr/15 ]

Hi, Dan.

In the last test the mongo server (3.0.2 engine=wiredTiger) has huge memory leaks after 10 seconds of execution (~12000 operations insert 10% find 90%).

Code:

 
var bm = require('../libs/basic/basic-mongo');
var async = require('async');
var _ = require('lodash');
var DbState = bm.DbState;
var s = { i: 0, e: 0, f: 0 };
var generator = {
    messageRecord: function () {
        var rnd = _.random(100000000000);
        var rndTo = _.random(100000000000);
        return {
            toId: rndTo + 'to',
            message: 'one two three four five one two three four five one two three four five one two three four five one two three four five one two three four five one two three four five one two three four five one two three four five one two three four five ',
            ownerId: rnd + 'user',
            updatedAt: Date.now(),
            createdAt: Date.now()
        };
    },
    userRecord: function () {
        var rnd = _.random(100000000000);
        var cityRnd = _.random(100000);
        return {
            name: 'name ' + rnd,
            email: 'mail' + rnd + '@gmail.com',
            phone: rnd,
            age: _.random(100),
            gender: _.random(1),
            cityId: cityRnd,
            loc: { type: 'Point', coordinates: [180 - Math.random() * 360, 90 - Math.random() * 180] },
            groups: [],
            updatedAt: Date.now(),
            createdAt: Date.now()
        };
    },
    getUserRecord: function () {
        var fromRnd = _.random(100);
        var toRnd = _.random(fromRnd, 100);
        return { query: { age: { $gt: fromRnd, $lt: toRnd }, gender: _.random(1) }, fields: { _id: 1 }, limit: 10 };
    },
    locationRecord: function () {
        var rnd = _.random(100000000000);
        return {
            name: 'The Point ' + _.random(1000000),
            ownerId: rnd + 'user',
            loc: { type: 'Point', coordinates: [180 - Math.random() * 360, 90 - Math.random() * 180] }
        };
    }
};
var collectionName = 'testUserCollectionToKillMongoIn5seconds';
var collectionKeys = [{ age: 1, gender: 1, cityId: 1 }, { email: 1 }];
var collectionInsertRecord = generator.userRecord;
var collectionFindRecord = generator.getUserRecord;
function createServerTestPoint() {
    var bmi = new bm.BasicMongo('mongodb://user:password@host/database');
    bmi.on(DbState.CONNECTED, connectHandler);
    bmi.on(DbState.READY, readyHandler);
    bmi.connect();
    function connectHandler() {
        console.log('mng:connected');
        bmi.init([{ name: collectionName, keys: collectionKeys }]);
    }
    function readyHandler() {
        console.log('mng:ready');
        var collection = bmi.getClient().collection(collectionName);
        function insertOneRecord(cb) {
            collection.insert(collectionInsertRecord(), function (e, result) {
                (e ? s.e++ : s.i++);
                console.log('insert ', s.i, s.e, !!result);
                cb(e, result);
            });
        }
        function getOneRecord(cb) {
            var getData = collectionFindRecord();
            collection.find(getData.query, getData.fields).limit(getData.limit).toArray(function (e, result) {
                (e ? s.e++ : s.f++);
                console.log('find ', s.f, s.e, !!result);
                cb(e, result);
            });
        }
        var insertSeries = _.chain(_.range(1000)).map(function (v) {
            return insertOneRecord;
        }).value();
        var findSeries = _.chain(_.range(10000)).map(function (v) {
            return getOneRecord;
        }).value();
        var resultSeries = _.shuffle([].concat(insertSeries, findSeries));
        console.log('Start execute %d operations', resultSeries.length, resultSeries);
        async.series(resultSeries, function (e, results) {
            console.log('Done', e, results.length, _.first(results), _.last(results));
        });
    }
    return bmi;
}
_.chain(_.range(100)).each(function (v) { createServerTestPoint(); });
//# sourceMappingURL=basic-mongo-test.js.map

Config:

# mongod.conf
 
#where to log
logpath=/var/log/mongodb/mongod.log
 
logappend=true
 
# fork and run in background
fork=true
 
#port=27017
 
dbpath=/var/lib/mongo-wt
storageEngine=wiredTiger
wiredTigerEngineConfigString=cache_size=1G,direct_io=[data]
 
# location of pidfile
pidfilepath=/var/run/mongodb/mongod.pid
 
# Listen to local interface only. Comment out to listen on all interfaces.
#bind_ip=127.0.0.1
 
# Disables write-ahead journaling
# nojournal=true
 
# Enables periodic logging of CPU utilization and I/O wait
cpu=true
 
# Turn on/off security.  Off is currently the default
#noauth=true
auth=true
 
# Verbose logging output.
#verbose=true
 
# Inspect all client data for validity on receipt (useful for
# developing drivers)
#objcheck=true
 
# Enable db quota management
#quota=true
 
# Set oplogging level where n is
#   0=off (default)
#   1=W
#   2=R
#   3=both
#   7=W+some reads
#diaglog=0
 
# Ignore query hints
#nohints=true
 
# Enable the HTTP interface (Defaults to port 28017).
httpinterface=true
 
# Turns off server-side scripting.  This will result in greatly limited
# functionality
#noscripting=true
 
# Turns off table scans.  Any query that would do a table scan fails.
#notablescan=true
 
# Disable data file preallocation.
#noprealloc=true
 
# Specify .ns file size for new databases.
# nssize=<size>
 
# Replication Options
 
# in replicated mongo databases, specify the replica set name here
#replSet=setname
# maximum size in megabytes for replication operation log
#oplogSize=1024
# path to a key file storing authentication info for connections
# between replica set members
#keyFile=/path/to/keyfile

Comment by Daniel Pasette (Inactive) [ 28/Apr/15 ]

Hi vasily.timofeev@gmail.com, can you post your stress test? Also, which version of mongodb are you testing with and what are the startup parameters you are using?

Comment by Vasily Timofeev [ 28/Apr/15 ]

I have same issue with WiredTiger on Centos 7.0.
I created a simple stress test. It contains insert to collection that has 2 indexes (string and numeric).
After 2-3 minutes mongod have used all memory (2GB) and was killed by a system manager.

Comment by Bruce Lucas (Inactive) [ 10/Mar/15 ]

Thanks for the additional information. I have not been able to exactly replicate the specific pattern of memory growth, cursor growth, etc. that your runs showed; to do that I think I would need to set up the complete application, as the specific pattern of those metrics depends on the exact sequence of operations.

However I have done some more investigation on the question of the amount of memory used by cursors, and, based on my preliminary results, it seems to be quite a bit higher than I had thought - about 1.4-1.5 kB per cursor, in my tests. If this holds up it would mean that the 580 k cursors we see in your runs would account for about 800 MB or so of memory, or about half of the memory utilization you see. So at this point I think that SERVER-17386 accounts for most of the issue you are seeing.

I will continue to investigate and will keep you posted.

Comment by Feng Yu [ 10/Mar/15 ]

Thanks. I will have a try.

I make a stress test. So each collection will grow up, and memory usage will never reduce. Finally oom.

Only mmapv1 storage engine cost so much less memory.

Comment by Gabriel [ 10/Mar/15 ]

Unfortuantly I can't find the reason of your issue. Maybe you can move the lines

def mongo = MongoConnectionPoolUtils.instance.mongo
def mongoDB = mongo.getDB("testDB")

out of the eventbus handler closure? What happens if you remove the ensureIndex() call?

How many objects do you store in each collection approximately? Maybe you can reuse the mongoDB.getCollection:

if (!colname !=  "test-$id") {
  colname =  "test-$id";
  coll = mongoDB.getCollection(colname); 
}

Comment by Feng Yu [ 10/Mar/15 ]

@Gabriel
Of course. Please check:

MongoConnectionPoolUtils.groovy

package com.myModule.dataaccess.mongo
 
import com.mongodb.Mongo
import com.mongodb.MongoOptions
import com.mongodb.ServerAddress
 
class MongoConnectionPoolUtils {
 
    private def mongo
    private static String address = "127.0.0.1"
    private static String port = "27017"
    private static def instance = new MongoConnectionPoolUtils()
 
    private MongoConnectionPoolUtils(){
        def addr = new ServerAddress(address,port as int)
        def opt = new MongoOptions()
        opt.connectionsPerHost = 50
        opt.threadsAllowedToBlockForConnectionMultiplier = 50
        mongo = new Mongo(addr, opt)
    }
 
    public static synchronized getInstance(){
        instance
    }
 
    def getMongo(){
        mongo
    }
}

Yes, you're right. I find the ensureIndex has been deprecated since 3.0. But this code should work for mongo 2.0 to mongo 3.0. So I can't change ensureIndex to createIndex.

Comment by Gabriel [ 10/Mar/15 ]

Maybe I can help: The code is executed in a single thread (see http://vertx.io/manual.html#concurrency )

LOOP
1) get the db connection via connection pool (class MongoConnectionPoolUtils is not shown here ...)
2) Gets a collection with a given name ("test-" + $id). If the collection does not exist, a new collection is created.
3) call deprecated ensureIndex on that collection (Creates an index if the index does not already exist.)
4) insert data
5) print the time spent (only if it's more than 5 seconds)

@Fen Yu : Could you give us the code of MongoConnectionPoolUtils?

Comment by Feng Yu [ 10/Mar/15 ]

Sorry, I don't know the detail about the relationship between threads, collection creation, index creation, and inserts. These codes are in a vert.x vertical. Here is the starter code:

class MongoWorker extends Verticle{
 
    def mongodbAddress = "dataaccess.mongodb"
    def eb
    def log
 
    def start(){
        eb = vertx.eventBus
        log = container.logger
        vertx.eventBus.registerHandler(mongodbAddress){msg->
            def data = msg.body
            def startTime = System.currentTimeMillis()
            def mongo = MongoConnectionPoolUtils.instance.mongo
            def id = data.get("id")
            def mongoDB = mongo.getDB("testDB")
            def coll = mongoDB.getCollection("test-$id")
            coll.ensureIndex([timestamp:-1] as BasicDBObject)
            coll.insert(data.data as BasicDBObject)
            def spendTime = System.currentTimeMillis() - startTime
            if(spendTime > 5000){
                log.info("mongo save spent ${spendTime} ms")
            }
        }
    }
}

It's in a worker. Every json object will be stored in mongodb, and with a timestamp index. When I use mmapv1 storage engine, the memory usage is normal. But when I set to wiredTiger, the same code will cause mongodb out of memory.

Comment by Bruce Lucas (Inactive) [ 09/Mar/15 ]

Thanks for the additional detail. Can you describe the overall flow of the application, perhaps using pseudo-code. In particular I need to know the relationship between threads, collection creation, index creation, and inserts. Is your program doing ensureIndex for each insertion (as the snippet above might suggest), or are you creating each collection, then doing ensureIndex, then inserting some number of documents in a loop? Does the test run you showed me above include creating the collections as part of the run, or are those created prior to the run?
Thanks,
Bruce

Comment by Feng Yu [ 09/Mar/15 ]

Hi @Bruce Lucas. My program is written in groovy. Here is some related codes:

test.groovy

    import com.mongodb.Mongo
    import com.mongodb.MongoOptions
    import com.mongodb.ServerAddress
    private def mongo
    private static String address = "127.0.0.1"
    private static String port = "27017"
 
    def addr = new ServerAddress(address,port as int)
    def opt = new MongoOptions()
    opt.connectionsPerHost = 50
    opt.threadsAllowedToBlockForConnectionMultiplier = 50
    mongo = new Mongo(addr, opt)
 
    def mongoDB = mongo.getDB("testDB")
    def coll = mongoDB.getCollection("test-$id")
    coll.ensureIndex([timestamp:-1] as BasicDBObject)
    coll.insert(data.data as BasicDBObject)

There is about 3000 collections in testDB. Here is an example of an object:

"_id" : ObjectId("54f81f0044f4d4a4dd146777"), "31" : "1.0", "32" : "1.0", "34" : "0.0", "35" : "0.0", "10" : "1.0", "9" : "0.0", "7" : "1.0", "8" : "0.0", "6" : "2118.0", "1" : "4.0", "2" : "1.0", "3" : "84.0", "5" : "2894.0", "4" : "2573.0", "timestamp" : NumberLong("1425546342041"), "a" : 1, "b" : 1, "c" : 169.52

You can see I use a thread pool.

Comment by Bruce Lucas (Inactive) [ 05/Mar/15 ]

Thanks for providing those logs, it is very helpful. Here are some relevant stats from ss.log:

  • The number of allocated bytes grows to 1.6 GB, whereas there are only about 470 MB of data in the WiredTiger cache. We need to account for all the extra memory usage.
  • At the point marked A, the number of open cursors begins to climb dramatically, reaching about 580k by the end of the run. I suspect that SERVER-17386 (which involves the way mongod caches WT cursors) is involved here. To confirm that the number of open cursors is accounted for by SERVER-17386, I would like to know: how many threads the application uses, how many total collections plus indexes are involved, and what application behavior changed at A to account for the sudden increasing trend in the number of cursors. (Note: these are WiredTiger cursors, not MongoDB cursors, and they don't map 1-1, so I don't suspect that the application explicitly opened 580k cursors.)
  • My back-of-the-evelope calculation suggests however that the number of cursors only accounts for 200 MB or so of memory.

Focusing on the early part of the run before A:

  • The number of bytes in the cache has risen to about 105 MB, while at the same time the number of allocated bytes has risen to about 561 MB. Based on the fact that the rise was correlated, I think you might be encountering SERVER-17424. However, the excessive memory usage in that case was about 20%, whereas here it is about 400%. So to confirm that it is the same issue I would like to do some memory profiling on your app, if you would be willing to share it with us.

To summarize my requests:

  1. Was there an application behavior change at A (e.g. number of connections, number of collections or indexes, type of operation)?
  2. How many threads (connections) does the app use, and how many total collections plus indexes?
  3. Can you share the Java app with us (let me know if you want a private upload location)?

Thanks,
Bruce

Comment by Feng Yu [ 05/Mar/15 ]

Hi @Bruce Lucas. Please check the log. I use a java program to make this test. But java process oom first.

After the java program crashed by oom, The mongodb memory usage is still not released.

ps aux | grep mongo
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mongodb  17896 16.4 41.1 1812976 1666736 ?     Ssl  09:48  18:33 /usr/bin/mongod --config /etc/mongod.conf

I've uploaded my ss.log and syslog. Please check.

Comment by Feng Yu [ 04/Mar/15 ]

Of course. I did this test on my test server, no important data here.

Wait please. I will upload the log file after I do your steps.

The test server is a virtualbox guest with 4G memory, without swap.

And I find wiredTiger memory usage will never reduce. I use mongorestore to restore a 70GB db. After finished mongodb memory usage is stayed at about 2G. But mmapv1 is nomal, and can auto release the memory usage.

Comment by Bruce Lucas (Inactive) [ 04/Mar/15 ]

Can you tell me how much physical memory the system has? Also, can you please attach the syslog file (/var/log/syslog* on Ubuntu) that covers the time when the OOM condition occurred? This will contain additional information about system memory usage that will help us understand the incident.

Is this issue readily reproducible? If so, can you please do the following:

  • start mongod as usual
  • start the following data collection script:

    mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1000)}" >ss.log &

  • after the OOM condition occurs, please attach both syslog and the ss.log output. This will give us a clearer picture of how memory usage varies over the course of the run.

Finally, we may want to try to reproduce this issue in-house. Is there any possibility you would be able to share the load generator with us? We could provide a private upload location if you don't want to share it on this public ticket.

Thanks,
Bruce

Comment by Feng Yu [ 04/Mar/15 ]

Edit: Here is my /etc/mongod.conf:
cat /etc/mongod.conf
systemLog:
destination: file
path: "/var/log/mongodb/mongodb.log"
logAppend: true
storage:
journal:
enabled: true
dbPath: "/mnt/mongodb"
engine: wiredTiger
net:
bindIp: 127.0.0.1
port: 27017

I test with mmapv1, it works well very. The memory usage is normal.

Generated at Thu Feb 08 03:44:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.