[SERVER-12733] Flush mmap files in parallel to achieve better flush times on Windows Created: 14/Feb/14  Updated: 06/Dec/22  Resolved: 14/Sep/18

Status: Closed
Project: Core Server
Component/s: MMAPv1, Performance, Storage
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Critical - P2
Reporter: Anil Kumar Assignee: Backlog - Storage Execution Team
Resolution: Won't Fix Votes: 1
Labels: Windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Storage Execution
Participants:

 Description   

Presently mmap'ed file flush happens in sequence for multiple files. This results in long flush times on Azure / Windows platform where the OS is not able to do concurrent flushes of the contents of the file. See SERVER-12401 for more details.

The issue is especially critical if there are a lot of random updates that dirty large parts of the mmap'ed region. One way to do it, that is available in the short term and can be done solely on mongod side is to flush database files in parallel. We could see this does result in parallel data flush by OS and achieves higher throughput. This could be one of the ways to get better flush times on all platforms.

The proposed changes would consist for number 2 are as follows:
1. A fixed number of threads (8 is proposed) using mongo::ThreadPool to process file flushes.
2. MongoFile::_flushAll will now schedule 1 file flush per file into this thread pool. When all flush requests are done, _flushAll will finish.
3. A change of _globalFlushMutex (a Windows only lock) to a Read Write Lock so that WRITETODATAFILES would take an exclusive lock, and file flushes would take a read lock. Individual flies are allowed to flush in parallel of each other per SERVER-7378, but not in parallel with WRITETODATAFILES. Also, we will ensure the lock is only held for the duration of the FlushViewOfFile call, and not the additional FlushFileBuffers call.



 Comments   
Comment by Andy Schwerin [ 24/Mar/14 ]

On systems supporting the Posix mmap interface, you might be able to get away with one thread that msyncs all the files two times, once with the MS_ASYNC flag set, and once with the MS_SYNC flag set. The first pass would let the OS start scheduling all of the disk i/o, while the second would essentially provide a barrier to wait for it to finish. Maybe there's a better way to handle that barrier, too.

Comment by Alexander Komyagin [ 14/Feb/14 ]

i won't expect it to make any significant difference on platforms that are already doing async disk IO under the hood (e.g. linux), but should help for windows according to our tests

Generated at Thu Feb 08 03:29:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.