[SERVER-9754] Every write to journal costs additional seek to update file metadata Created: 22/May/13  Updated: 11/Jul/16  Resolved: 05/Jun/13

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: 2.4.3
Fix Version/s: 2.4.6, 2.5.1

Type: Improvement Priority: Major - P3
Reporter: Dmitry Naumov Assignee: Tad Marshall
Resolution: Done Votes: 0
Labels: pull-request
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows 7 Professional x64 SP1


Attachments: JPEG File Pattern1.jpg     JPEG File Pattern2.jpg    
Backwards Compatibility: Fully Compatible
Participants:

 Description   

Journal file is created with FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH attributes, but FILE_FLAG_WRITE_THROUGH is not needed here and even worse it flushes NTFS metadata on every write operation. Removing it doesn't affect durability (FILE_FLAG_NO_BUFFERING does it's job), but greatly improves performance and reduces disk load.

You can read detailed explanation here http://dmitry-naumov.blogspot.ru/2013/05/mongodb-vs-ms-sql-how-to-write-to.html



 Comments   
Comment by auto [ 29/Jul/13 ]

Author:

{u'username': u'DmitryNaumov', u'name': u'Dmitry Naumov', u'email': u'dnaumov@paladyne.com'}

Message: SERVER-9754 Remove FILE_FLAG_WRITE_THROUGH for journal files

Signed-off-by: Tad Marshall <tad@10gen.com>
Branch: v2.4
https://github.com/mongodb/mongo/commit/896ad48104be797f76dd8e65d18ff21a7be34d38

Comment by Dmitry Naumov [ 05/Jun/13 ]

I hope those unlucky who don't have RAID-0 will get larger profit from it.

Comment by Tad Marshall [ 05/Jun/13 ]

Suggest backport to 2.4 and 2.2.

Comment by auto [ 05/Jun/13 ]

Author:

{u'username': u'DmitryNaumov', u'name': u'Dmitry Naumov', u'email': u'dnaumov@paladyne.com'}

Message: SERVER-9754 Remove FILE_FLAG_WRITE_THROUGH for journal files

Signed-off-by: Tad Marshall <tad@10gen.com>
Branch: master
https://github.com/mongodb/mongo/commit/54e11a3cfa5de27351da1c9c55d539b3b17a0b00

Comment by Tad Marshall [ 05/Jun/13 ]

I ran a subset of the cases in your test and verified the performance improvement you saw. I ran mongod.exe with --journalCommitInterval 2 and --syncdelay 10000, without and with your patch:

// no patch
|Threads|Document Size|Elapsed|
|      1|         1718|  11749|
|      1|         4418|  11806|
|      2|         1718|  11711|
|      2|         4418|  12045|
|      4|         1718|  12116|
|      4|         4418|  12384|
 
// patch
|Threads|Document Size|Elapsed|
|      1|         1718|  10002|
|      1|         4418|  10003|
|      2|         1718|  10028|
|      2|         4418|  10029|
|      4|         1718|  10078|
|      4|         4418|  10204|

For the case where the client is using j:true and the journalCommitInterval is small, this gives a 17% or better speed improvement. I'm sold. In the cases where clock time is not affected as much, we're still cutting disk I/O and so freeing up the system for other activity.

Comment by Dmitry Naumov [ 05/Jun/13 ]

Default project configuration builds into Any CPU, so on x64 machine it will run as x64 executable, there is no need to change anything.
Unfortunately I don't know the way to setup journalCommitInterval directly from benchmark, as you did in your js benchmark.

Comment by Tad Marshall [ 04/Jun/13 ]

Yes, I had not set journalCommitInterval ... user error! VS apparently defaulted to 32-bit ... ? I haven't played with .NET on a 64-bit machine, and I guess I was expecting it to JIT to "native" (i.e. 64-bit) code ... I'll retest with JournalCommitInterval and 64-bit and see what happens. Testing the slightly stale code from this morning, it at least is producing lines of output in minutes instead of no output in 20 minutes.

I'm testing on a RAID 0 pair of spinning disks at first. I'll rebuild and let you know what I find tomorrow.

Thanks for the tip(s)!

Comment by Dmitry Naumov [ 04/Jun/13 ]

I've updated my github repo:
Changed to VS 2010
Added package restore
Added progress indication

Comment by Dmitry Naumov [ 04/Jun/13 ]

Tad,
you should create you new project targeting .NET 4.0 or you can switch target framework on existing project.
Concerning you benchmark performance the only idea I have is that you possibly forget set journalCommitInterval to 2 and syncDelay to something huge.
To debug this I would run it under debugger (use F5 to start) and then hit pause to see thread's stacks.

Comment by Tad Marshall [ 04/Jun/13 ]

Hi Dmitry,

I'm trying to run your C# performance test and not having much luck. This is probably user error on my part, but perhaps you could guide me in what to look for.

I cloned your https://github.com/DmitryNaumov/MongoDbVsMsSqlInsertPerformance repo and created a new project in Visual Studio 2010 (since I didn't want to install VS 2012 on this machine). I installed the MongoDB C# driver from NuGet and added references until VS stopped complaining. I'm using .NET 4.0 libraries, though it asked for 2.0 ... is that bad?

When I run your program, it prints a header and then appears to run at slower than a sedated snail. In 20 minutes of elapsed time, it has consumed 0.202 seconds of total CPU time and it is doing 120 bytes of I/O per second. Performance Monitor shows it sending and receiving about 4 KB per second on the network. Looking at the threads, all I ever see is waiting ...

How would you debug this?

Tad

Comment by Dmitry Naumov [ 31/May/13 ]

Thank you, Tad! This improvement won't be possible without your involvement and patience reading my posts in tangled english.
Hope we will continue our cooperative work and another issue, related to single-threaded compression.

Comment by Tad Marshall [ 31/May/13 ]

Hi Dmitry,

Thanks for investigating cluster sizes; it simplifies life a little to not have to modify code to optimize for that.

I've been tied up with some other things and so have been slow to do my verification testing of your patch, but I think that we are ready to make the change you suggested (remove the FILE_FLAG_WRITE_THROUGH flag) as soon as I get to testing. It shouldn't be much longer.

I see that you've reopened your pull request, so we can merge it and you get will credit for the change in the Git changelog. A small change that will make a big difference!

Thanks again for all your work and for making MongoDB faster on Windows! A lot of people will benefit from what you've done, and we really appreciate it!

Tad

Comment by Dmitry Naumov [ 30/May/13 ]

Tad,
you are right about cluster sizes. Benchmark shows no difference in results for different cluster sizes from 4K to 64K.
And documentation confirms it:

Alignment and File Access Requirements
As previously discussed, an application must meet certain requirements when working with files opened with FILE_FLAG_NO_BUFFERING. The following specifics apply:
File access sizes, including the optional file offset in the OVERLAPPED structure, if specified, must be for a number of bytes that is an integer multiple of the volume sector size. For example, if the sector size is 512 bytes, an application can request reads and writes of 512, 1,024, 1,536, or 2,048 bytes, but not of 335, 981, or 7,171 bytes.
File access buffer addresses for read and write operations should be physical sector-aligned, which means aligned on addresses in memory that are integer multiples of the volume's physical sector size. Depending on the disk, this requirement may not be enforced.

Application developers should take note of new types of storage devices being introduced into the market with a physical media sector size of 4,096 bytes. The industry name for these devices is "Advanced Format". As there may be compatibility issues with directly introducing 4,096 bytes as the unit of addressing for the media, a temporary compatibility solution is to introduce devices that emulate a regular 512-byte sector storage device but make available information about the true sector size through standard ATA and SCSI commands.

Comment by Dmitry Naumov [ 29/May/13 ]

Tad,
It was obvious all performance improvements are related to reduced number of disk head seeks, but now I'm lucky to find another software which, as well as my benchmarks, confirms my theory.

Cluster sizes: in my understanding to write 8K block to 64K cluster OS should read 64K cluster, then map your modified 8K into cluster and after that write all 64K.
I'm going to create small separate partition so I can play with different cluster sizes and we'll see am I right or not. Again, it's very important to interpret results correcly. Multiple factors exist like we already had journalCommitInterval and document size + compression and disk head seeks etc and all them combined can lead to wrong positive or negative result. Anyway there is already profit for me as I practiced many useful tools to analyze storage performance.

I think you can easily reproduce same test with slightly modified version of you .js benchmark - results, I hope, should be the same.

Comment by Tad Marshall [ 29/May/13 ]

Hi Dmitry,

You keep posting interesting stuff ... thanks!

Re: single-threaded journal compression. I think you are right that this hurts performance in the case where writers are using j:true. I don't know the "snappy" algorithm enough to say whether we could change to multi-threaded and keep the disk format unchanged. If not, this is a delicate change potentially affecting forward-and-backward compatibility. Thanks for filing SERVER-9802 for this question!

Re: Windows cluster size and 8 KB blocks. My understanding is that cluster size is used in disk space allocation and does not determine disk I/O the way physical sector size does. That is, I think that an 8 KB write with a 64 KB cluster size will still only write 8 KB. If your measurements say otherwise, that would be interesting. Larger clusters make for a smaller partition bitmap and potentially less fragmentation and so are more efficient in that sense, but I'm not sure that we would write faster to a 64 KB cluster file system by writing larger blocks. Let me know if I'm wrong!

Re: charts and performance improvement. It does seem like your change cuts the number of disk seeks dramatically and this probably explains most of the performance gain.

I'll try your C# program and let you know if I have trouble duplicating your result. I still haven't tried PerfView but I will do that as well.

Thanks again for all your work and research on this!

Tad

Comment by Dmitry Naumov [ 29/May/13 ]

Actually I've found another ace in my sleeve
Here are two graphs from Windows Performance Analyzer. First one is for unpatched version:


Green line at 0 is $MFT file, green line at 280M is journal file (j._2)
Vertical grey lines all between them are disk head seeks!

Now compare with patched version:

Green line at 20M is journal. There are a few seeks, most of them to $MFT. That's it.

Comment by Dmitry Naumov [ 28/May/13 ]

Here are multi-threaded benchmark results. It is the same benchmark but with fixed document size equal to 8918.
There are 10K documents per thread to insert. So 1 thread should process 10K documents, 2 threads process 20K documents and so on.

Threads Documents Unpatched Patched
1 10000 22870 11545
2 20000 21904 11381
3 30000 23801 18800
4 40000 26650 19169
5 50000 27827 20784
6 60000 29929 22083
7 70000 31585 24373
8 80000 33100 27405

I expected to see better results for patched version, so I've started my yet another investigation.

dur.cpp - durThread() is responsible for persisting pending journal writes (and also it is the place where we add at least 1 ms delay to our operation). As I understand there is dedicated thread which runs endless loop withing this method.
durThread calls durThreadGroupCommit() which calls groupCommit() which calls _groupCommit() which calls (yep, I know this looks annoying but wait a second) WRITETOJOURNAL which has this interesting line:

j.journal(h, uncompressed);

uncompressed looks very interesting!

        void Journal::journal(const JSectHeader& h, const AlignedBuilder& uncompressed) {
            RACECHECK
            static AlignedBuilder b(32*1024*1024);
            /* buffer to journal will be
               JSectHeader
               compressed operations
               JSectFooter
            */
            const unsigned headTailSize = sizeof(JSectHeader) + sizeof(JSectFooter);
            const unsigned max = maxCompressedLength(uncompressed.len()) + headTailSize;
            b.reset(max);
 
            {
                dassert( h.sectionLen() == (unsigned) 0xffffffff ); // we will backfill later
                b.appendStruct(h);
            }
 
            size_t compressedLength = 0;
            rawCompress(uncompressed.buf(), uncompressed.len(), b.cur(), &compressedLength);
            verify( compressedLength < 0xffffffff );
            verify( compressedLength < max );
            b.skip(compressedLength);

Wait a second!!! Do you really compress pending buffers in single thread?! I have 8 threads sleeping and waiting journal to be flushed on disk!!!

Tad, tell me that I'm wrong!
UPDATED: I don't want to insult anyone. I just meant that now it is all clear: unpatched version is limited by disk I/O and I expected to see better results for patched version because disk idles, but it is limited by single thread compression.

I've put all my aces on the table, I've no idea what to prove further. May be you'll suggest any additional tests to run?

Comment by Dmitry Naumov [ 28/May/13 ]

Tad, meanwhile I'm collecting multi-threaded benchmark result, I want to point out on another issue, imho, I saw in code. If I remember correctly, all writes to journal are made by 8K blocks. Which is fine untill my underlying filesystem cluster is less or equal 8K. But what if it is large, let's say 64K (which can be useful for 3TB disks)... What do you think, is it issue or not? Should I file it as separate issue?

Comment by Dmitry Naumov [ 28/May/13 ]

Updated benchmark results for latest sources in v2.4 (https://github.com/mongodb/mongo/tree/v2.4)
Benchmark code is available here (https://github.com/DmitryNaumov/MongoDbVsMsSqlInsertPerformance/tree/master/MongoDbBenchmark)

Document Size Elapsed
98 13608
818 14332
1718 16459
4418 21108
7118 22129
8018 22414
8918 23208
9918 23288
10918 23321
11918 26454
12918 26359
13918 26473
14918 26646
15918 27503
16918 26948
17918 26838
18918 27349
48918 42665
98918 63959

Don't see any significant difference with previous unpatched results.

Comment by Tad Marshall [ 27/May/13 ]

Thanks, Dmitry! Those are pretty compelling looking numbers!

Can you post the code you used to do the benchmarks to help us reproduce the results?

It would be nice to verify that there is no interaction with SERVER-9242 and SERVER-9721 since those also affected performance on some things quite a bit. That is, we should verify that your change does the same thing (improves performance) with the other fixes, since they are already in the code (2.4.4-rc0 for example).

Did you close your pull request? (https://github.com/mongodb/mongo/pull/431)

Comment by Dmitry Naumov [ 27/May/13 ]

Here are my benchmarks with different document sizes.
Same environment as in previous post. All tests are run on HDD with journalCommitInterval is 2, syncdelay is set to 3000 to prevent MMF flushing during the test.

Document size is the size of BSON serialized stream, not how it is transfered via network or stored on disk.
Elapsed time is average time (btw, should we use best, not average?) of three measurements, in milliseconds.

Unpatched version refers to r2.4.3 tag https://github.com/DmitryNaumov/mongo/tree/r2.4.3
Patched version with one line "fix" is https://github.com/DmitryNaumov/mongo/tree/SERVER-9754

Document size Unpatched Patched Ratio
98 14261 10072 0.71
818 15310 10064 0.66
1718 18924 10069 0.53
4418 21345 10197 0.48
7118 21982 10204 0.46
8018 22634 10203 0.45
8918 23034 10434 0.45
9918 23823 10422 0.44
10918 24690 11146 0.45
11918 26677 12885 0.48
12918 27371 16289 0.60
13918 26744 18766 0.70
14918 27171 19269 0.71
15918 27942 19649 0.70
16918 27158 20259 0.75
17918 28130 20111 0.71
18918 28737 20059 0.70
48918 42022 30389 0.72
98918 65974 54118 0.82
208918 111400 105800 0.95

Some observations:

  • unpatched version produces much more load on HDD, according to performance counters
  • smaller documents gain more from patch than bigger documents - something else limits throughput (CPU: serialization/compression?)
  • for small documents in some range (<10000) patched version produces same results no matter what document size is

May be there is some correlation between cluster size and documents size but it's hard to understand due to number of factors like bson_size != storage_size, journal compression etc.

UPDATED: I'm rerunning benchmark for large documents because I forgot to restart mongod between tests.
UPDATED: I've put new results. Now I'm using best of 5 attempts because it is hard to produce stable results with average.

Comment by Dmitry Naumov [ 24/May/13 ]

I'm doing some testing with different document sizes and number of concurrent threads. An idea suddenly struck me that in order to obtain stable results every single benchmark run should be less than syncdelay, otherwise I see additional disk activity when MMF's dirty pages are flushed to disk.

Comment by Dmitry Naumov [ 23/May/13 ]

Tad, you got me right. The speed of inserts is limited by journalCommitInterval and only in case with j:true. To see the difference between patched/unpatched we should use smallest journalCommitInterval possible, which is 2. Imho we don't see speedup with large documents because even with compression they are large that 8K, meaning journal grows at every insert. I don't want to do guessing, it would be better to preallocate journal and re-run benchmark.

I do understand that single-threaded benchmark is very synthetic and I posted multi-threaded results in my blog. What I clearly saw is that patching greatly reduces disk load (performance counter "disk queue size" or ResourceMonitor's "Active Time") making it available for other consumers, like queries for example. And I think this is real benefit for all mongo users and clients.

Comment by Tad Marshall [ 23/May/13 ]

I was wondering about preallocation. We do a test on Linux to decide if preallocation is faster, but on Windows it seems that we just don't preallocate. I'm not sure why the code is different for the two platforms.

Testing with both small and large (and very large? maybe 1 MB or larger?) documents seems like a good idea. If a change is a huge win for small documents and not worse for larger ones, that would be good to know.

Comment by Dmitry Naumov [ 23/May/13 ]

Environment:

Windows 7 Professional SP1 x64
Intel Core i7-2600 @ 3.40 GHz
SSD Corsair Force 3
HDD Seagate ST320DM000 320GB @ 7200 rpm
16 GB RAM

More than 8 GB RAM is available during benchmark (which may be important for memory-mapped files architecture of mongo). Absence of CPU/IO activity during benchmark is verified by Process Explorer + Resource Monitor. A lot of pre-trimmed free space on SSD.

Unpatched version refers to r2.4.3 tag https://github.com/DmitryNaumov/mongo/tree/r2.4.3
Patched version with one line "fix" is https://github.com/DmitryNaumov/mongo/tree/SERVER-9754

Case 1: without j:true, default journalCommitInterval

Unpatched HDD: 13846, 13698, 12619
Unpatched SSD: 11298, 10864, 11342

Patched HDD: 13387, 12764, 13646
Patched SSD: 10493, 12577, 10690

Device Unpatched Patched
HDD 13388 13266
SSD 11168 11253

Case 2: with j:true, default journalCommitInterval

Unpatched HDD: 79613, 79544, 79253
Unpatched SSD: 72018, 72321, 72049

Patched HDD: 75782, 76490, 75897
Patched SSD: 72201, 71609, 71523

Device Unpatched Patched
HDD 79470 76056
SSD 72129 71778

Case 3: without j:true, journalCommitInterval 2

Unpatched HDD: 14385, 13285, 13750
Unpatched SSD: 11277, 11851, 10132

Patched HDD: 13814, 13837, 12314
Patched SSD: 10813, 12010, 11057

Device Unpatched Patched
HDD 13807 13322
SSD 11087 11293

Case 4: with j:true, journalCommitInterval 2 (this one really differs from your results)

Unpatched HDD: 23639, 23003, 22997
Unpatched SSD: 18149, 16702, 16049

Patched HDD: 19731, 20750, 19458
Patched SSD: 19059, 19414, 17603

Device Unpatched Patched
HDD 23213 19980
SSD 16967 18692

Case 5: same as case 4, but with small documents (only 10 ints in array, instead of 10K)

Unpatched HDD: 3896, 3572, 3468
Unpatched SSD: 2614, 2573, 2223

Patched HDD: 2329, 2322, 2283
Patched SSD: 2240, 2212, 2180

Device Unpatched Patched
HDD 3645 2311
SSD 2470 2211

I'll prepare averages and tables later.

It looks like, imho, it fits my theory: cases 1-3 should not depend on patch. Differences in average results are same as deviation of measurements. Case 4 shows improvements for HDD as I expected, but real break-through can be achieved only in case 5 with small documents.

That means my one-liner fix is not enough. We should either preallocate journal files or grow them by larger blocks. I can do this and then we can re-measure case 4 again.

Comment by Tad Marshall [ 23/May/13 ]

Hi Dmitry,

Formatting first, since it's easy:

The indented block with line numbers is done by putting {code} on the line before the quoted text and another {code} on the line after it. You can get a scrolling block without line numbers by using {noformat} before and after instead of {code}. I quoted the wavy braces with backslash in this paragraph to make them visible.

Table are created using the vertical bar symbol "|" . Each line begins and ends with one or two vertical bars, with text in between and separating columns. Use doubled bars for the header, single bars for content lines.

||header 1||header 2||
|data 1|data 2|

produces

header 1 header 2
data 1 data 2

Some of this is documented at https://jira.mongodb.org/secure/WikiRendererHelpAction.jspa?section=texteffects , which you can get to from the "?" symbol below the comment entry box. You can click the button to the left of the "?" to see a preview to see if the formatting is working as you expected.

If I understand your comments, I should test j: true to see the changes made by the patch. Did I get that right?

Paraphrasing, the speed of inserts will be limited (with j: true) by the journalCommitInterval, so shorter intervals should make everything faster and may make the improvements from the patch more visible. Does that sound right to you?

I think you're saying that inserting documents that are over 8 KB each (as with the 10,000 element array) will mask the effect of the change by growing the journal file on every write. This would make the "unnecessary" metadata writes "necessary", preventing the optimization from happening.

Note that the journal is compressed, so an 8 KB document may be smaller than 8 KB in the journal.

My understanding of the documentation (and interpreting the flag name FILE_FLAG_WRITE_THROUGH as meaningful) suggests that the metadata will be written in any case, but that FILE_FLAG_WRITE_THROUGH causes it to happen at the time of the write. This may be why we don't always see a speedup; we haven't prevented the writes, just moved them in time.

It is also important to remember that this is a single-threaded (single client) test, and that the effects may be different with multiple clients trying to write at the same time. It is also vastly more complicated with multiple clients, since we enter a world of write locks and yielding heuristics. It might be informative to do additional benchmarking with parallel shells (look for startParallelShell in the JavaScript source code in src/mongo/shell/servers_misc.js and in the jstests files) and with benchRun (http://www.mongodb.org/about/contributors/js-benchmarking-harness/). We would like to improve the multi-user case, since that is the common use case.

Reducing unnecessary disk I/O should improve all cases so these goals are not necesarily in conflict, but we need to test both to be sure.

Thanks for your work on this!

Tad

Comment by Dmitry Naumov [ 23/May/13 ]

Tad,
I have your benchmark results (which are very similar to yours... except case 4) and while I'm trying to prepare nice and shiny report I want you to think about one thing. In my post I mentioned that actually journal files should be preallocated to get maximum performance, but it doesn't really matters in my benchmark. Why? When journal file grows, metadata should be updated. This costs disk I/O and you can't avoid it. Journal grows by 8K blocks (I can find that in sources if needed). But it doesn't grow every test iteration in my benchmark, because my documents are small, much smaller that 8K. But your benchmark is the opposite. Array of 10K integers is large than 8K and this means that every time you write to journal it grows and filesystem metadata is also updated. That's my theory, I'll check with PerfView if metadata is really updated on every insert if journal grows.

As I said before, when inserting large documents key point is either preallocate journal file or (as MS SQL does) let it grow by huge blocks, not every 8K.

Comment by Dmitry Naumov [ 23/May/13 ]

Tad,
thank you for your test and detailed results & explanation. I will run it shortly as soon as I get to same desktop I used for my measurements.
But before I dare to comment some results:

1. without j: true, default journalCommitInterval: with no journaling disk should not matter, same results should be for both HDD and SSD. With default journalCommitInterval = 100 ms it also doesn't matter do we do additional disk I/O to update timestamp or not. That's my explanation, I don't have the answer why you saw -13% speedup yet.

2. with j: true, default journalCommitInterval: as I explained in one of my posts (http://dmitry-naumov.blogspot.ru/2013/05/mongodb-vs-ms-sql-journalcommitinterval.html) default journalCommitInterval adds a 100/3 + 1 = 34 ms penalty for each insert. That's why you can't get results better that 2000 iterations * 34 ms = 68000. In my understanding this added latency hides any additional disk I/O costs and again HDD and SSD should score near the same results.

3. without j: true, journalCommitInterval 2: same as case 1, benchmark doesn't depend on disk at all, result should be close to those obtained in case 1. Yes, journal is updated more frequently, I would agree it may affect other queries which hit the disk, but don't see how this affects benchmark.

4. with j: true, journalCommitInterval 2: imho the only case where my patch matters. What wonders me is only 2 times better results than in case 2. While journalCommitInterval is reduced from 100 to 2, means that total test added latency is reduced from 68000 to 2000 and what we got is 38314 vs 80125 (unpatched) and 36047 vs 81819 (patched). If you saw my post when I figured out about journalCommitInterval, my improvement was from 29 insert/sec to 700 insert/second which make sense for me. Also in this case without patch SSD should outperform HDD quite well, because disk I/O costs much less for SSD.

It would be nice if you re-run case 4 or may be you have explanation of so smal improvement of case 4 vs case 2.

I'll come back shortly with my results of your benchmark, but to remove additional factors I am going to run it agains pure r2.4.3 and against my patched version of 2.4.3 which differs only in 1 line.

BTW, Tad, how do you post such well-formatted text? What do you use to format code and tables? Thanks!

Comment by Tad Marshall [ 22/May/13 ]

Hi Dmitry,

I did some benchmarking today, and the results I got were not the solid win I was hoping for. Out of 8 cases I tested, your patch made 5 of them faster and 3 slower.

I wrote a test script and ran it on two builds of mongod.exe; one with the fix for SERVER-9721 and one with that fix and your patch.

Here's the test file I ran. The "'--journalCommitInterval', '2'" arguments were only there for the sections marked "journalCommitInterval 2" in the tables. I passed either "--eval 'waitForJournal=false'" or "--eval 'waitForJournal=true'" on the command line. I tested on a Seagate 2 TB hard disk and on a Samsung 840 Pro SSD.

// Test the performance of saving to a collection with and without j: true.
//
// Related to SERVER-9754.
//
 
var makeArray = function () {
    var arr = [];
    for (var i = 0; i < 10000; ++i) {
        arr.push(i);
    }
    return arr;
}
 
var writeData = function (testDB, arr, waitForJournal) {
    var howMany = 2000;
    for (var i = 0; i < howMany; ++i) {
        testDB.test.insert({ i: i, arr: arr });
        if (waitForJournal) {
            testDB.test.runCommand({ getLastError: 1, j: true });
        }
    }
}
 
var time = function (f, testDB, arr, waitForJournal) {
    var start = new ISODate();
    f(testDB, arr, waitForJournal);
    var end = new ISODate();
    return (end - start);
}
 
var doTest = function (waitForJournal) {
    // Time creation of the test collection
    //
    var baseName = 'save_performance';
    var dbpath = '/data/db/perfTests/' + baseName
    var port = allocatePorts(1)[0];
    var mongod = startMongod('--port', port, '--dbpath', dbpath, '--journalCommitInterval', '2');
    sleep(50);
    testDB = new Mongo('localhost:' + port).getDB(baseName);
    testDB.dropDatabase();
    sleep(50);
    var arr = makeArray();
    var saveTime = time(writeData, testDB, arr, waitForJournal);
    stopMongod(port, 15);
 
    jsTest.log('Test ' + (waitForJournal ? 'with' : 'without') + ' waitForJournal took ' + saveTime + ' ms');
}
 
doTest(waitForJournal);

Each of the numbers below is the number of milliseconds to complete the test, and is the average of three trials, each in a clean directory. There was some variation in some of the timings, and I don't have a theory for why that would happen. The tests were run on an otherwise idle desktop computer under Windows 7, no virtualization and nothing competing for CPU or disk I/O. But the variability does not explain the differences. In the first data line below, the 14718 number is the average of 14727, 14727 and 14700 while the 16650 number is the average of 19808, 15407 and 14735. This makes the best of the "patched" times slower than the worst of the "unpatched" times.

Windows without j: true, default journalCommitInterval:

Device Unpatched Patched Speedup
spinning disk 14718 16650 -13.13%
SSD 13590 13282 2.27%

Windows with j: true, default journalCommitInterval:

Device Unpatched Patched Speedup
spinning disk 80125 81819 -2.11%
SSD 76154 75635 0.68%

Windows without j: true, journalCommitInterval 2:

Device Unpatched Patched Speedup
spinning disk 12397 13379 -7.92%
SSD 11705 11351 3.02%

Windows with j: true, journalCommitInterval 2:

Device Unpatched Patched Speedup
spinning disk 38314 36047 5.92%
SSD 33457 32469 2.95%

Since these results are so different from what you posted, I'm wondering if you can provide a better test, one that will show results like what you saw.

Your explanation makes logical sense; don't do disk I/O to update timestamps, and things should be faster. We may be seeing some of this in these results, but we're also seeing slowdowns that don't match the theory.

The fix for SERVER-9721 went into the master and 2.4 branches today and so it should show up in the nightly builds soon.

If you can post a better test than what I tried, that would be great, thanks!

Tad

Comment by Dmitry Naumov [ 22/May/13 ]

Tad,
at a glance SERVER-9721 may be connected with my findings, but I don't see any connections with SERVER-9242. Let me know when SERVER-9721 is fixed, so I can recheck everything again.

BTW, I've submitted my proposed solution https://github.com/mongodb/mongo/pull/431

Comment by Tad Marshall [ 22/May/13 ]

Hi Dmitry,

Thanks a lot for this report!

If you have been looking at version 2.4.3 in Windows (as your Affects Version and Environment say), then you are hitting both SERVER-9242 and SERVER-9721, which clobber Windows performance. SERVER-9242 is fixed in the 2.4 nightly and the fix for SERVER-9721 is coming soon. If would be great if you could update your results once SERVER-9721 is fixed and see if the journaling continues to be a performance bottleneck for you.

We'll be testing your suggestion; it is much appreciated!

Tad

Generated at Thu Feb 08 03:21:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.