[SERVER-20991] WiredTiger caching strategy needs improvement on Windows Created: 17/Oct/15  Updated: 06/Jun/17  Resolved: 12/Nov/15

Status: Closed
Project: Core Server
Component/s: Storage, WiredTiger
Affects Version/s: None
Fix Version/s: 3.2.0-rc3

Type: Bug Priority: Major - P3
Reporter: Nick Judson Assignee: Mark Benvenuto
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on WT-2200 Change WiredTiger caching strategy on... Closed
depends on SERVER-21349 WiredTiger changes for 3.2.0-rc3 Closed
Related
related to SERVER-29465 Add warning about Windows SystemFileC... Closed
is related to SERVER-19795 mongod memory consumption higher than... Closed
is related to WT-1990 Prevent Windows from mapping.wt files Closed
Backwards Compatibility: Fully Compatible
Operating System: Windows
Steps To Reproduce:

Run MongoDb with WT (tested on 3.1.8 and 3.2RC0) on Windows with a cache size significantly less than physical RAM. Run an insert-heavy workload (I can provide one).

Sprint: Platform C (11/20/15)
Participants:

 Description   

When deploying MongoDb with WT on Windows OSs, physical memory is exhausted by file system caching, irrespective of WiredTigerCacheSizeGB. This makes it impossible to limit the amount of physical memory used by MongoDb, which means it cannot be deployed alongside other processes on the same physical hardware.

The file system cache dedicated to WT I/O does not appear to be released under memory stress, and other processes may be starved for memory.

Once the file system cache has consumed the majority of physical memory, idling MongoDb has no effect on releasing file system memory. The only way to reclaim the file system memory is to restart MongoDb.

This effectively makes MongoDb with WT unusable as an embedded database - an area I think that MongoDb (the company) underestimates.



 Comments   
Comment by Mark Benvenuto [ 20/Oct/16 ]

The KB article that talks about the issue for Windows 2008 R2. It appears to repro with MongoDB and YCSB on Windows 2012 R2 for me.
https://support.microsoft.com/en-us/kb/976618

https://blogs.msdn.microsoft.com/ntdebugging/2007/11/27/too-much-cache/
https://blogs.msdn.microsoft.com/ntdebugging/2009/02/06/microsoft-windows-dynamic-cache-service/
https://blogs.technet.microsoft.com/yongrhee/2010/02/16/windows-7-and-windows-server-2008-r2-do-you-still-need-the-microsoft-windows-dynamic-cache-service/#3477119

Here is a powershell script that will automatically adjust the cache size:
http://serverfault.com/questions/325277/windows-server-2008-r2-metafile-ram-usage/527466#527466

Comment by Daniel Pasette (Inactive) [ 16/Dec/15 ]

See SERVER-21792. You can still use this approach for workloads which it improves, but we need to go back to the drawing board for the general solution.

Comment by Mark Benvenuto [ 16/Dec/15 ]

To get the old behavior after the revert, use --wiredTigerEngineConfigString="direct_io=(data)".

It was reverted after investigating this ticket: SERVER-21792.

Comment by Nick Judson [ 16/Dec/15 ]

Why was it reverted?

Comment by Mark Benvenuto [ 16/Dec/15 ]

The reverts were made against 3.2.1 and 3.3.0. They original fix change shipped in the release 3.2.0-rc3 up and including 3.2.0. It was reverted in 3.2.1 & 3.3.0.

Comment by Githook User [ 15/Dec/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: Revert "SERVER-20991 Change WiredTiger caching strategy on Windows"

This reverts commit 884644ac56de2edc2223b75ceabe9e1a6fef6dab.
Branch: v3.2
https://github.com/mongodb/mongo/commit/c7b065227470a27c40c45f07a8c967b7aa7af9db

Comment by Githook User [ 15/Dec/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: Revert "SERVER-20991 Change WiredTiger caching strategy on Windows"

This reverts commit 884644ac56de2edc2223b75ceabe9e1a6fef6dab.
Branch: master
https://github.com/mongodb/mongo/commit/439a56d7af3ce4bad983f5829b3485bb0af7f6c3

Comment by Githook User [ 12/Nov/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-20991 Change WiredTiger caching strategy on Windows
Branch: master
https://github.com/mongodb/mongo/commit/884644ac56de2edc2223b75ceabe9e1a6fef6dab

Comment by Nick Judson [ 07/Nov/15 ]

I've confirmed with mark that his initial build appears to solve the problem. Thanks for the update.

Comment by Daniel Pasette (Inactive) [ 07/Nov/15 ]

Hi Nick, we're still ironing out some issues in code review. We'll make a build available next week for test.

Comment by Nick Judson [ 02/Nov/15 ]

Mark - is it possible to get copies of these different builds to load test? I don't have a machine that will build Mongo/WT.

Comment by Mark Benvenuto [ 02/Nov/15 ]

I ran the same tests to compare buffering vs FILE_FLAG_WRITE_THROUGH without FILE_FLAG_NO_BUFFERING. I ran these tests on a different machine then the first example, and saw very different I/O performance as compared to the first host.

In this case, I do not see a performance difference between buffering, and FILE_FLAG_WRITE_THROUGH. Note that system cache still buffers writes, and such 100% of the machine is in use.

Test Machine:
c3.2xlarge - 2x80 GB SSD disk - RAID 1, 15 GB RAM, 8 CPU
Git Hash# 08bcdca185912cf9f8c6c6bf7faa94b23ea76583 (~3.2.0rc1)

Test Case:
YCSB - 8 threads load, 24 thread run

Workload 1:
5000000 Records
10000000 Operations
10 fields, each 100 bytes, zipfian request distribution
50% Read, 50% Update

Load Phase (Insert Only)

CreateFile Buffering WT Cache Size (GB) Run 1 - Duration(seconds) Run 2 - Duration(seconds)
Buffered 8 160
FILE_FLAG_WRITE_THROUGH 8 166 169

Run Phase (50% Read, 50% Update)

CreateFile Buffering WT Cache Size (GB) Run 1 - Duration(seconds) Run 2 - Duration(seconds)
Buffered 8 528
FILE_FLAG_WRITE_THROUGH 8 505 504
Comment by Nick Judson [ 31/Oct/15 ]

Looks like a solid 15% perf improvement. I assume this also stops Windows from loading the .wt files into the system buffer cache?

Comment by Mark Benvenuto [ 30/Oct/15 ]

I have investigated the performance with YCSB in three scenarios (load, 50/50, 50/45/5) on AWS with three variations. In short, we can address most of the issues by enabling FILE_FLAG_NO_BUFFERING as the default in WiredTiger. This has better performance then using Direct I/O or the normal buffering because the Windows will not page out mongod.exe.

Variations

  1. Normal = Default MongoDB
  2. Non-Buffered = FILE_FLAG_NO_BUFFERING for data files
    #Direct_IO = FILE_FLAG_NO_BUFFERING && FILE_FLAG_WRITE_THROUGH for data files

Test Machine:
c3.2xlarge - 2x80 GB SSD disk - RAID 1, 15 GB RAM, 8 CPU
Git Hash# 08bcdca185912cf9f8c6c6bf7faa94b23ea76583 (~3.2.0rc1)

Test Case:
YCSB - 8 threads load, 24 thread run

Workload 1:
5000000 Records
10000000 Operations
10 fields, each 100 bytes, zipfian request distribution
50% Read, 50% Update

Load Phase (Insert Only)

CreateFile Buffering WT Cache Size (GB) Run 1 - Duration(seconds) Run 2 - Duration(seconds)
Buffered 8 318 321
Non-Buffered 8 311 311
Direct_IO 8 312 314

Run Phase (50% Read, 50% Update)

CreateFile Buffering WT Cache Size (GB) Run 1 - Duration(seconds) Run 2 - Duration(seconds)
Buffered 8 847 839
Non-Buffered 8 708 711
Direct_IO 8 769 766

Workload 2:
5000000 Records
10000000 Operations
10 fields, each 100 bytes, zipfian request distribution
50% Read, 45% Update, 5% Insert

Run Phase (50% Read, 45% Update, 5% Insert)

CreateFile Buffering WT Cache Size (GB) Run 1 - Duration(seconds) Run 2 - Duration(seconds)
Buffered 8 900 905
Non-Buffered 8 816 740
Direct_IO 8 878 874

POC Patch:

diff --git a/src/third_party/wiredtiger/SConscript b/src/third_party/wiredtiger/SConscript
index 29ffc58..b83951f 100644
--- a/src/third_party/wiredtiger/SConscript
+++ b/src/third_party/wiredtiger/SConscript
@@ -38,6 +39,7 @@ if env.TargetOSIs('windows'):
     ])
     if get_option('allocator') == 'tcmalloc':
         env.InjectThirdPartyIncludePaths(libraries=['gperftools'])
+        env.Append(CPPDEFINES=['HAVE_POSIX_MEMALIGN'])
         env.Append(CPPDEFINES=['HAVE_LIBTCMALLOC'])
 elif env.TargetOSIs('osx'):
     env.Append(CPPPATH=["build_darwin"])
diff --git a/src/third_party/wiredtiger/src/os_win/os_open.c b/src/third_party/wiredtiger/src/os_win/os_open.c
index c7b3040..143fb3a 100644
--- a/src/third_party/wiredtiger/src/os_win/os_open.c
+++ b/src/third_party/wiredtiger/src/os_win/os_open.c
@@ -91,7 +91,7 @@ __wt_open(WT_SESSION_IMPL *session,
        /* Disable read-ahead on trees: it slows down random read workloads. */
        if (dio_type == WT_FILE_TYPE_DATA ||
            dio_type == WT_FILE_TYPE_CHECKPOINT)
-               f |= FILE_FLAG_RANDOM_ACCESS;
+               f |= FILE_FLAG_RANDOM_ACCESS | FILE_FLAG_NO_BUFFERING;
 
        filehandle = CreateFileA(path,
                                (GENERIC_READ | GENERIC_WRITE),

Comment by Michael Cahill (Inactive) [ 26/Oct/15 ]

nick@innsenroute.com, MongoDB does not currently make any use of the mmap feature in WiredTiger, and it would be quite difficult to change that in any general purpose way, since access must be read-only. I am hopeful that the impact of direct I/O on reads will be less pronounced without FILE_FLAG_NO_BUFFERING but we'll have to wait until we run some tests to find out.

Comment by Nick Judson [ 26/Oct/15 ]

Michael Cahill - I don't see any write performance hit on the three drives I tested: SSD, 7.2K, and a sad USB external (5.4K I think). Read performance suffers big time. Is it possible to try out read-only (capped) mmap for read cache?

Since my implementation is a message queue, I like that the data I've just written is part of the WT cache, as it get's read back out very quickly. After it's been read though, I very rarely ever need it again.

Comment by Michael Cahill (Inactive) [ 26/Oct/15 ]

alessandro.gherardi@schneider-electric.com, my hope is that avoiding filesystem cache with FILE_FLAG_WRITE_THROUGH will alleviate the issue reported here without the performance degradation that can result from FILE_FLAG_NO_BUFFERING (which is highly dependent on the speed of the underlying volume).

All I am suggesting at this point is that we will investigate the performance implications: any discussion about how this would be configured will only happen once there is demonstration of a benefit.

Comment by Alessandro Gherardi [ 26/Oct/15 ]

>>>> For the record, setting "mmap=false" will have no effect
Thanks for clarifying.

>>>> We will investigate a mode that allows FILE_FLAG_WRITE_THROUGH without FILE_FLAG_NO_BUFFERING
Since the goal of this ticket is to cap the amount of physical memory used by MongoDb, can you please explain why one would want FILE_FLAG_WRITE_THROUGH without FILE_FLAG_NO_BUFFERING?

Per https://support.microsoft.com/en-us/kb/99794, FILE_FLAG_WRITE_THROUGH still causes data to be stored in the disk cache.

Is your intent to allow MongoDB administrators to configure FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING via 2 separate WT options, so that MongoDB administrator have more control on how WT behaves?

Comment by Michael Cahill (Inactive) [ 26/Oct/15 ]

For the record, setting "mmap=false" will have no effect on MongoDB runs with the default btree access method in WiredTiger. The "mmap" setting only applies to read-only access to WiredTiger checkpoints, which MongoDB does not currently use.

We will investigate a mode that allows FILE_FLAG_WRITE_THROUGH without FILE_FLAG_NO_BUFFERING.

Comment by Nick Judson [ 25/Oct/15 ]

http://source.wiredtiger.com/develop/tuning.html

Many Linux systems do not support mixing O_DIRECT and memory mapping or normal I/O to the same file, and attempting to do so can result in data loss or corruption. For this reason:
•WiredTiger silently ignores the setting of the mmap configuration to the wiredtiger_open function in those cases, and will never memory map a file which is configured for direct I/O

I think setting direct_io=[data] is equivalent to direct_io=[data] AND mmap=false.

I don't know how the WT cache operates vs the file system cache, so I can't comment on what makes the most sense. I actually think mmapping would be ok if it's for read-only usage and is capped by WT (again I'm no expert here but I suspect the view on the file(s) can be configured to be a specific size). Direct_IO seems to be great for heavy writing (on Windows), but we definitely want some read cache somewhere.

Comment by Alessandro Gherardi [ 25/Oct/15 ]

Hi Nick,
Thanks for your reply.

>>>> I ran my test using that option (which is not compatible with direct_io)
Where in the documentation did you find out that mmap=false is not compatible with direct_io?

>>>> and the system cache still fills up
That makes sense. If one sets mmap=false WITHOUT also setting direct_io, Windows uses the system cache, and that ends up consuming all available RAM.

>>>> I'm still thinking 'write_through' plus a read-only cache (WT or MMap) would be ideal.
I'm concerned that, if WT is allowed to use memory-mapped files (i.e., mmap=true), WT can end up consuming RAM in an unchecked fashion. In other words, it's not any better than the MMAPv1 storage engine in terms of hogging up RAM. With mmap=false instead, WT uses its cache whose size is capped via wiredTigerCacheSizeGB .

In other words, for the purpose of capping the RAM used by WT, it seems to me that specifying BOTH mmap=false AND direct_io=[data] is the best option.

What do you think?

Comment by Nick Judson [ 24/Oct/15 ]

Alessandro. No I don't. I ran my test using that option (which is not compatible with direct_io) and the system cache still fills up (although perhaps not quite as quickly). On my workload it appears to be roughly equal in speed to direct_io, although queries do run much faster once they've been hit once and loaded into cache.

I'm still thinking 'write_through' plus a read-only cache (WT or MMap) would be ideal.

Comment by Alessandro Gherardi [ 24/Oct/15 ]

Hi Nick,
When you run your workload test, do you also set: wiredTigerEngineConfigString = mmap=false in the mongoD configuration file?

Without mmap=false, WT uses memory-mapped files rather than the WT cache for reads. That defeats the purpose of trying to cap the RAM that mongoD uses via the wiredTigerCacheSizeGB option. I'm actually wondering if wiredTigerCacheSizeGB is completely useless unless one also sets mmap=false.

Thoughts?

Comment by Nick Judson [ 21/Oct/15 ]

Would it be possible to have writes be direct to disk (direct_io) but reads be loaded into the cache - either WT or OS? The write data could go into the cache, but it wouldn't get flushed to disk (writes would be direct). Having a read-only cache might solve the problem of large/constant flushes and also allow for fast queries.

Comment by Nick Judson [ 19/Oct/15 ]

http://winntfs.com/2012/11/29/windows-write-caching-part-2-an-overview-for-application-developers/

Apparently SQL Server uses direct IO (MS SQL is what my product currently uses).

Also, if I understand correctly, a large WT cache will hamper write speed (which is what I see in my tests), but obviously it's beneficial when reading the data. If the WT cache doesn't contain the data to be read, a read request is passed to the IO subsystem. The system file cache (if used), may have the required data in memory in which case no disk IO is required. WT stores uncompressed data in its cache whereas the OS will be storing compressed bytes.

Is it likely that there will be a significant overlap in what WT has cached and what the OS has cached - essentially double-buffering data (both uncompressed and compressed)? If that is the case, does it make sense to max out the RAM with WT cache and use direct_IO?

For write-heavy workloads like mine, the following paragraph may explain why a large WT/system file cache can hurt performance:

..."Liberal use of the FlushFileBuffers API can severely affect system throughput. This is because at the file system layer, it is quite clear what data blocks belong to what file. So when FlushFileBuffers is invoked, it is also apparent what data buffers need to be written out to media. However, at the block storage layer – shown as “Sector I/O” in Figure 1, it is difficult to track what blocks are associated with what files. Consequently the only way to honor any FlushFileBuffers call is to make sure all data is flushed to media. Therefore, not only is more data written out than originally intended, but the larger amount of data can affect other I/O optimizations such as queuing of the writes."

Comment by Nick Judson [ 19/Oct/15 ]

Testing against my workload, I get better performance with direct IO on both SSD and 7.2K spinners, with memory usage capped at the WT cache size. I haven't run a full spectrum of tests, but so far (with my workload) I'm not sure what the downside of direct IO is.

Comment by Nick Judson [ 17/Oct/15 ]

This ticket (SERVER-19795) illustrates the file system buffering consuming all the physical RAM.

Generated at Thu Feb 08 03:55:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.