[SERVER-20991] WiredTiger caching strategy needs improvement on Windows Created: 17/Oct/15 Updated: 06/Jun/17 Resolved: 12/Nov/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage, WiredTiger |
| Affects Version/s: | None |
| Fix Version/s: | 3.2.0-rc3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Nick Judson | Assignee: | Mark Benvenuto |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Operating System: | Windows | ||||||||||||||||||||||||||||
| Steps To Reproduce: | Run MongoDb with WT (tested on 3.1.8 and 3.2RC0) on Windows with a cache size significantly less than physical RAM. Run an insert-heavy workload (I can provide one). |
||||||||||||||||||||||||||||
| Sprint: | Platform C (11/20/15) | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
When deploying MongoDb with WT on Windows OSs, physical memory is exhausted by file system caching, irrespective of WiredTigerCacheSizeGB. This makes it impossible to limit the amount of physical memory used by MongoDb, which means it cannot be deployed alongside other processes on the same physical hardware. The file system cache dedicated to WT I/O does not appear to be released under memory stress, and other processes may be starved for memory. Once the file system cache has consumed the majority of physical memory, idling MongoDb has no effect on releasing file system memory. The only way to reclaim the file system memory is to restart MongoDb. This effectively makes MongoDb with WT unusable as an embedded database - an area I think that MongoDb (the company) underestimates. |
| Comments |
| Comment by Mark Benvenuto [ 20/Oct/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The KB article that talks about the issue for Windows 2008 R2. It appears to repro with MongoDB and YCSB on Windows 2012 R2 for me. https://blogs.msdn.microsoft.com/ntdebugging/2007/11/27/too-much-cache/ Here is a powershell script that will automatically adjust the cache size: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 16/Dec/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
See | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Benvenuto [ 16/Dec/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To get the old behavior after the revert, use --wiredTigerEngineConfigString="direct_io=(data)". It was reverted after investigating this ticket: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 16/Dec/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Why was it reverted? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Benvenuto [ 16/Dec/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The reverts were made against 3.2.1 and 3.3.0. They original fix change shipped in the release 3.2.0-rc3 up and including 3.2.0. It was reverted in 3.2.1 & 3.3.0. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 15/Dec/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}Message: Revert " This reverts commit 884644ac56de2edc2223b75ceabe9e1a6fef6dab. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 15/Dec/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}Message: Revert " This reverts commit 884644ac56de2edc2223b75ceabe9e1a6fef6dab. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 12/Nov/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}Message: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 07/Nov/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've confirmed with mark that his initial build appears to solve the problem. Thanks for the update. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 07/Nov/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Nick, we're still ironing out some issues in code review. We'll make a build available next week for test. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 02/Nov/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Mark - is it possible to get copies of these different builds to load test? I don't have a machine that will build Mongo/WT. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Benvenuto [ 02/Nov/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I ran the same tests to compare buffering vs FILE_FLAG_WRITE_THROUGH without FILE_FLAG_NO_BUFFERING. I ran these tests on a different machine then the first example, and saw very different I/O performance as compared to the first host. In this case, I do not see a performance difference between buffering, and FILE_FLAG_WRITE_THROUGH. Note that system cache still buffers writes, and such 100% of the machine is in use. Test Machine: Test Case: Workload 1: Load Phase (Insert Only)
Run Phase (50% Read, 50% Update)
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 31/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Looks like a solid 15% perf improvement. I assume this also stops Windows from loading the .wt files into the system buffer cache? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Benvenuto [ 30/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have investigated the performance with YCSB in three scenarios (load, 50/50, 50/45/5) on AWS with three variations. In short, we can address most of the issues by enabling FILE_FLAG_NO_BUFFERING as the default in WiredTiger. This has better performance then using Direct I/O or the normal buffering because the Windows will not page out mongod.exe. Variations
Test Machine: Test Case: Workload 1: Load Phase (Insert Only)
Run Phase (50% Read, 50% Update)
Workload 2: Run Phase (50% Read, 45% Update, 5% Insert)
POC Patch:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 26/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
nick@innsenroute.com, MongoDB does not currently make any use of the mmap feature in WiredTiger, and it would be quite difficult to change that in any general purpose way, since access must be read-only. I am hopeful that the impact of direct I/O on reads will be less pronounced without FILE_FLAG_NO_BUFFERING but we'll have to wait until we run some tests to find out. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 26/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Michael Cahill - I don't see any write performance hit on the three drives I tested: SSD, 7.2K, and a sad USB external (5.4K I think). Read performance suffers big time. Is it possible to try out read-only (capped) mmap for read cache? Since my implementation is a message queue, I like that the data I've just written is part of the WT cache, as it get's read back out very quickly. After it's been read though, I very rarely ever need it again. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 26/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
alessandro.gherardi@schneider-electric.com, my hope is that avoiding filesystem cache with FILE_FLAG_WRITE_THROUGH will alleviate the issue reported here without the performance degradation that can result from FILE_FLAG_NO_BUFFERING (which is highly dependent on the speed of the underlying volume). All I am suggesting at this point is that we will investigate the performance implications: any discussion about how this would be configured will only happen once there is demonstration of a benefit. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alessandro Gherardi [ 26/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
>>>> For the record, setting "mmap=false" will have no effect >>>> We will investigate a mode that allows FILE_FLAG_WRITE_THROUGH without FILE_FLAG_NO_BUFFERING Per https://support.microsoft.com/en-us/kb/99794, FILE_FLAG_WRITE_THROUGH still causes data to be stored in the disk cache. Is your intent to allow MongoDB administrators to configure FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING via 2 separate WT options, so that MongoDB administrator have more control on how WT behaves? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 26/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For the record, setting "mmap=false" will have no effect on MongoDB runs with the default btree access method in WiredTiger. The "mmap" setting only applies to read-only access to WiredTiger checkpoints, which MongoDB does not currently use. We will investigate a mode that allows FILE_FLAG_WRITE_THROUGH without FILE_FLAG_NO_BUFFERING. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 25/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
http://source.wiredtiger.com/develop/tuning.html
I think setting direct_io=[data] is equivalent to direct_io=[data] AND mmap=false. I don't know how the WT cache operates vs the file system cache, so I can't comment on what makes the most sense. I actually think mmapping would be ok if it's for read-only usage and is capped by WT (again I'm no expert here but I suspect the view on the file(s) can be configured to be a specific size). Direct_IO seems to be great for heavy writing (on Windows), but we definitely want some read cache somewhere. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alessandro Gherardi [ 25/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Nick, >>>> I ran my test using that option (which is not compatible with direct_io) >>>> and the system cache still fills up >>>> I'm still thinking 'write_through' plus a read-only cache (WT or MMap) would be ideal. In other words, for the purpose of capping the RAM used by WT, it seems to me that specifying BOTH mmap=false AND direct_io=[data] is the best option. What do you think? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 24/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Alessandro. No I don't. I ran my test using that option (which is not compatible with direct_io) and the system cache still fills up (although perhaps not quite as quickly). On my workload it appears to be roughly equal in speed to direct_io, although queries do run much faster once they've been hit once and loaded into cache. I'm still thinking 'write_through' plus a read-only cache (WT or MMap) would be ideal. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alessandro Gherardi [ 24/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Nick, Without mmap=false, WT uses memory-mapped files rather than the WT cache for reads. That defeats the purpose of trying to cap the RAM that mongoD uses via the wiredTigerCacheSizeGB option. I'm actually wondering if wiredTigerCacheSizeGB is completely useless unless one also sets mmap=false. Thoughts? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 21/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Would it be possible to have writes be direct to disk (direct_io) but reads be loaded into the cache - either WT or OS? The write data could go into the cache, but it wouldn't get flushed to disk (writes would be direct). Having a read-only cache might solve the problem of large/constant flushes and also allow for fast queries. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 19/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
http://winntfs.com/2012/11/29/windows-write-caching-part-2-an-overview-for-application-developers/ Apparently SQL Server uses direct IO (MS SQL is what my product currently uses). Also, if I understand correctly, a large WT cache will hamper write speed (which is what I see in my tests), but obviously it's beneficial when reading the data. If the WT cache doesn't contain the data to be read, a read request is passed to the IO subsystem. The system file cache (if used), may have the required data in memory in which case no disk IO is required. WT stores uncompressed data in its cache whereas the OS will be storing compressed bytes. Is it likely that there will be a significant overlap in what WT has cached and what the OS has cached - essentially double-buffering data (both uncompressed and compressed)? If that is the case, does it make sense to max out the RAM with WT cache and use direct_IO? For write-heavy workloads like mine, the following paragraph may explain why a large WT/system file cache can hurt performance: ..."Liberal use of the FlushFileBuffers API can severely affect system throughput. This is because at the file system layer, it is quite clear what data blocks belong to what file. So when FlushFileBuffers is invoked, it is also apparent what data buffers need to be written out to media. However, at the block storage layer – shown as “Sector I/O” in Figure 1, it is difficult to track what blocks are associated with what files. Consequently the only way to honor any FlushFileBuffers call is to make sure all data is flushed to media. Therefore, not only is more data written out than originally intended, but the larger amount of data can affect other I/O optimizations such as queuing of the writes." | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 19/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Testing against my workload, I get better performance with direct IO on both SSD and 7.2K spinners, with memory usage capped at the WT cache size. I haven't run a full spectrum of tests, but so far (with my workload) I'm not sure what the downside of direct IO is. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 17/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This ticket ( |