[SERVER-19795] mongod memory consumption higher than WT cache size (at least on Windows 2008 R2) Created: 06/Aug/15  Updated: 11/Jan/18  Resolved: 26/Sep/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Marc Girollet Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: RF, WTmem
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

W 2008 R2


Attachments: PNG File 3.1.8 direct IO off.png     PNG File HighMemoryUsage.png     PNG File MongoNoRelease_1.png     PNG File MongoPartialMemRelease.png     PNG File MongoShutdownCacheRelease.png     PNG File OS Caching.png     PNG File ShutDownMongoD.png     PNG File WT_RAMMAP.png     PNG File WT_WORKLOAD.png     PNG File direct_IO_ON.png     PNG File mongo_318.png     File serverstatus.json     PNG File sqlserver.png    
Issue Links:
Related
related to SERVER-20991 WiredTiger caching strategy needs imp... Closed
related to WT-1990 Prevent Windows from mapping.wt files Closed
Operating System: Windows
Participants:
Case:

 Description   

Although mongod.exe process memory is well capped ,as required by the --wiredTigerCacheSizeGB, the OS (Windows 2008 R2) keeps huge part of os memory cache active on files.

See screen shot of RMAP tool :

  • green : no problem because of use of --wiredTigerEngineConfigString direct_io=[data] , no data file part is cached by the os, no problem.
  • red : the os does keep in ram large parts (see column "Active File"), this is ok , but the issue here is that the sum of ram used because of mongod ( "mongod privateByte" + "os active file" ) exceed --wiredTigerCacheSizeGB , in fact at then end it would have the whole size of the database in memory. And this would lead to OOM we cannot prevent since we cannot cap "OS active file" locked by the OS as asked by WT.

the "--wiredTigerEngineConfigString direct_io=[data]" is a workaround to this problem , but then it lead to too much slowness for queries and in fact is not applicable for our volumes of process and data.

Could you make --wiredTigerCacheSizeGB take in acount the whole ram used because of mongo (os+process) , please ?

Mongo64_2008+\mongod.exe" --port 4444 --dbpath D:\Homeware\XOne_services\PreBETA\MongoData\CurvesFR --directoryperdb --journal --nohttpinterface --wiredTigerCacheSizeGB 1 --wiredTigerDirectoryForIndexes --replSet MongoServiceCacheCurves --oplogSize 1024 --storageEngine wiredTiger --auth --keyFile x.keyfile



 Comments   
Comment by Nick Judson [ 19/Oct/15 ]

@Michael Cahill - I found this an interesting read: http://winntfs.com/2012/11/29/windows-write-caching-part-2-an-overview-for-application-developers/

..."Some well known applications such as Microsoft SQL and the Microsoft JET database (ships with Windows Server SKUs) specify FILE_FLAG_NO_BUFFERING with the CreateFile API."...

Is there a flag where we can get verbose output from WT (and track the frequency of flushes etc.)?

Comment by Nick Judson [ 07/Oct/15 ]

Thanks for looking @Michael Cahill - appreciated. I don't know if/what the fix might be and unfortunately I don't have a C++ build environment. Looking at the windows docs and experimenting with artificially restricted system caches doesn't reveal much of interest - other than for my workload it doesn't appear to provide any benefit.

I don't see how it's possible to restrict the system file cache by process, so it may be out of your hands. Any in-house Windows experts able to comment?

The part I find odd is that the cache never seems to be released. I'm wondering if there is something non-standard about the way WT creates/locks files.

Comment by Michael Cahill (Inactive) [ 06/Oct/15 ]

nick@innsenroute.com, the os_cache_max setting in WiredTiger currently relies on posix_fadvise: we don't have a Windows implementation so it will not have any effect.

I'm genuinely sorry that you are having problems using WiredTiger on Windows. It is a good sign that enabling direct I/O helps in some cases: what I suspect that relies on is the I/O subsystem being fast enough that reads / writes don't stall by going to disk synchronously.

I did a review today of what interfaces are available on Windows and how WiredTiger uses them. One thing I noticed is that WiredTiger's direct_io setting does two things on Windows: it sets FILE_FLAG_NO_BUFFERING to disable filesystem cache and FILE_FLAG_WRITE_THROUGH to make writes synchronous to disk.

I don't think WiredTiger needs FILE_FLAG_WRITE_THROUGH for correctness: we will also call FlushFileBuffers for durability. Given that, it may be worth trying performance runs without writethrough semantics. Unfortunately, doing that requires a source code change:

diff --git a/src/third_party/wiredtiger/src/os_win/os_open.c b/src/third_party/wiredtiger/src/os_win/os_open.c
index c7b3040..633db36 100644
--- a/src/third_party/wiredtiger/src/os_win/os_open.c
+++ b/src/third_party/wiredtiger/src/os_win/os_open.c
@@ -79,7 +79,7 @@ __wt_open(WT_SESSION_IMPL *session,
                dwCreationDisposition = OPEN_EXISTING;
 
        if (dio_type && FLD_ISSET(conn->direct_io, dio_type)) {
-               f |= FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH;
+               f |= FILE_FLAG_NO_BUFFERING;
                direct_io = true;
        }

If anyone is prepared to give this a try, please let me know.

Comment by Nick Judson [ 02/Oct/15 ]

I'm wondering about the os_max_cache setting in WT. If that's settable then it would be interesting to try.

Comment by Stephen JANNIN [ 02/Oct/15 ]

During our activation of WiredTiger in production in July, we tried to use direct_io during few days, but we had many slow requests. Memory was not leaked anymore, but performance was catastrophic.
We finally went back to mmap.

Comment by Nick Judson [ 02/Oct/15 ]

Retest with Direct IO off took 7h:32m vs, 5h:48m for a ~30% speed difference. Direct IO also uses 6GB less physical RAM.

Comment by Nick Judson [ 01/Oct/15 ]

Ramon, there seem to be a few issues at play here:

1. MongoDB memory usage > WT Cache size due to TCMalloc cache going over 1GB. (SERVER-20104).
2. MongoDB memory usage > WT Cache size due to OS caching of WT files.
2A. On 64-bit Windows, there is no limit to how much physical memory the OS will use for file buffering, and, it appears that for certain workloads it will use all available physical memory. This cache does not appear to be released until MongoDB is shut down - even when the system is under memory stress and MongoDb is idle. This makes it impossible to constrain MongoDb using WT, and other processes on the same box may suffer memory starvation. (This ticket)
2B. In the case of 2A, MongoDb performance itself drops by ~25%, when compared to direct IO. Possibly this is due to a larger OS cache getting flushed continually. Perhaps this is similar to why, for certain workloads, a smaller WT cache size improves performance.

To re-iterate, for a pleb such as myself, it seems odd that MongoDb soaks up all the system memory, when other databases I've used do not. Surely this cannot be the expected behavior (this ticket).

If you want me to create a ticket for 2B I will, but it seems strongly correlated to this ticket.

Comment by Ramon Fernandez Marina [ 01/Oct/15 ]

nick@innsenroute.com, will you please open a separate ticket and post your results in it? Whether better memory management is needed on Windows, or whether it makes sense to make directIO the default on this platform, these are broader and different topics than the behavior/bug described in this ticket (using the --wiredTigerCacheSizeGB to limit mongod's memory consumption).

Thanks,
Ramón.

Comment by Nick Judson [ 01/Oct/15 ]

ok - so I ran a test last night with my standard workload on 3.1.8 with WT cache size set to 4 GB and direct IO turned on. Fastest run I have ever seen - finishing in 5h:48m - fantastic, consistent performance (see attached). Lots of system memory left idle.

I'm currently running the same test without direct IO. Initially, the speeds are faster, but an hour or so into the test the speeds have dropped by 25%. The OS has paged over 6GB of WT files into memory and is bumping up against the physical memory limit.

The direct IO off test is still running, but when it finishes I'll upload the perf chart for that.

Comment by Nick Judson [ 29/Sep/15 ]

I also tested mongod releasing memory back to the OS, and I can confirm that it does not. Neither Mongod nor the OS cache are released.

Edit: it took a while but it did release half of the process memory, but none of the file cache memory.

Killed MongoD and the WT file cache was released, and SQL Server climbed back up to 1GB (see screen shot).

Comment by Nick Judson [ 29/Sep/15 ]

A few notes for a baseline:

I configured sql server with a 1GB cap and ran my test. See the SqlServer attachment which shows the overall memory usage is 5.2GB, with almost nothing cached by the OS and 1GB physical RAM in use by the SQL process.

I ran the same test with 3.1.8 with WT cache set to 1GB (see Mongo_318 attachment). It shows the overall usage at 7GB, with 2.5 GB OS cache and MongoD using 1.3 GB.

SQL = (System usage) + 1GB = 5.2 GB.
Mongo = (System usage) + 1.3GB + 2.5 GB = 7GB.

It's difficult to illustrate but in the SQL scenario there is less memory pressure and other processes are consuming more memory. In the Mongo scenario, many of those same processes are using less memory. I suspect if that were taken into account, the system memory would be closer to 8GB for MongoD. So on this short test, MongoDb is consuming 3.8 GB of RAM with a WT cache size set to 1GB. It uses 1.8 to 2.8 GB more memory than sql server restricted to 1GB.

I will re-test on my work machine which is much beefier - but from my earlier results it appears the OS cache is 5GB in that scenario. From my notes it appears on 3.1.7 with WT cache set to 3GB, the actual usage is 10 GB!

Comment by Nick Judson [ 28/Sep/15 ]

And one totally unrelated comment:

"...Note that setting a large value for the WiredTiger cache to improve performance..." isn't necessarily true. I was surprised to learn that lowering the WT cache size for insert-heavy workloads had a surprisingly large impact on performance (improvement that is).

Comment by Nick Judson [ 28/Sep/15 ]

A few comments:

1. Use --wiredTigerCacheSizeGB to limit the wiredtiger cache. This works.
2. There is currently an issue with TCMalloc + MongoDb where an indeterminate amount of memory is being used in addition to the WT cache. My understanding is that TCMalloc's cache was to be limited to ~1GB, however this isn't currently the case, and significantly more may be used (there is an existing ticket for this).
3. The OS is caching/buffering WT's files. This is expected. What is not expected is the amount of memory the OS is using for this. There appears to be an OS cache size roughly equal to the WT cache size, although this may just be coincidence in my testing.
4. The OS does not appear to be releasing the WT file cache, as other processes on my test box end up getting starved for memory.
5. The OS does not appear to be releasing the WT file cache as physical memory is exhausted, and even with only MongoDB on the box, paging kicks in and hurts performance.
6. The OS cache memory is tricky to track down because it doesn't show up in task manager - it requires special tools (rammap) to show where the memory is getting soaked up.

As previously mentioned, I will attempt to run some further analysis and perhaps compare this behavior with Sql Server. Is this something @Mark Benvenuto could comment on?

Comment by Stephen JANNIN [ 28/Sep/15 ]

I disagree too. We cannot use WiredTiger in these conditions.

Comment by Nick Judson [ 26/Sep/15 ]

Ramon,

I disagree. I'll perform further testing but I don't think this makes sense.

Comment by Ramon Fernandez Marina [ 26/Sep/15 ]

marc.girollet@sgcib.com, if my understanding of the issue description above is correct this is expected behavior: the --wiredTigerCacheSizeGB only limits the WiredTiger cache, not the amount of memory consumed by mongod.

There's currently no global limit one can set to limit the amount of memory used by mongod. Also the OS may use additional memory for buffering, which may be released to other processes if there's memory pressure. Note that setting a large value for the WiredTiger cache to improve performance will reduce the amount of memory the OS may use for buffering, which may have a negative effect on performance.

Regards,
Ramón.

Comment by Nick Judson [ 25/Aug/15 ]

[ShutDownMongoD.png] This shows the memory drop when mongod is shut down (~10G). Task manager details tab shows the working set to be ~5G.

Comment by Eitan Klein [ 25/Aug/15 ]

Attached file that highlight the visibility of this issue for windows users.

Comment by Nick Judson [ 25/Aug/15 ]

Mods, please link this to https://jira.mongodb.org/browse/WT-1990?jql=text%20~%20%22Cache%20windows%22 which appears to be the original ticket.

Comment by Nick Judson [ 25/Aug/15 ]

I also see this behavior on my systems. Even though in task manager Mongo will appear to use 5GB (even with CacheSize set to 3GB, but that is a different issue), the actual memory usage will be close to double that. RamMap doesn't yet support windows 10 but it's easy to see that the overall system memory usage drops by close to double what is reporting by the task manager details tab when MongoD is shut down.

I don't think this is the correct behavior and, for example, a capped Sql Server instance does not appear to use the OS caching in this manner (ie, a cap of 4GB = 4GB physical memory usage).

I'm working with Eitan on some other related issues but I've mentioned this to him.

Comment by Stephen JANNIN [ 07/Aug/15 ]

Comparing wiredTiger source Code and mmapv1 source code, I have a few remarks :

1/ difference in the way we create and use memory mapped files.

  • wiredTiger : use of MapViewOfFile
  • mmapv1 : use of MapViewOfFileEx, and you specify a requested base address using "getNextMemoryMappedFileLocation".

2/ wiredTiger : you never call FlushViewOfFile nor FlushFileBuffers, where I think something happens in mmapv1. Maybe in data_file_sync.cpp, where we call flushAllFiles in a loop in a thread.

Comment by Marc Girollet [ 07/Aug/15 ]

Hi Michael,
Thanks for your reply , we actually are using 3.0.4 (and have tested 3.0.5) versions , both have the problem.

A well stable case is leaking 1,172,768K (previous screen shot) ,updates (100 per min) queries (150 per min)
Net in 3MB, Net Out 16 MB , DB size on disk : 1.44 GB

Attached , activity Graph , server status

--------------------

We managed to restrict the OS cache use (see http://www.uwe-sieber.de/ntcacheset_e.html ) we used SetSystemFileCacheSize.

The point is that we didn't have to do this with mmap engine

Comment by Michael Cahill (Inactive) [ 06/Aug/15 ]

marc.girollet@sgcib.com, I have moved this ticket to the SERVER project, which deals with MongoDB issues.

Note that the WiredTiger cache is not designed to include all sources of memory allocated by mongod. WiredTiger itself allocates memory for various purposes other than the page cache, such as for buffering log records and caching keys and values in cursors.

To make progress with the pattern of memory use you are seeing, please include more information about the version of MongoDB you are running and the workload that is causing this behavior. The direct I/O configuration is included to avoid having the operating system cache filesystem buffers, but that is intended for performance tuning rather than to avoid out-of-memory issues.

Generated at Thu Feb 08 03:52:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.