[SERVER-31574] Throughput degradation with j:true after restart of mongod Created: 15/Oct/17  Updated: 07/Jan/18  Resolved: 07/Nov/17

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 3.6.0-rc0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Susan LoVerso
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File 349_fresh     HTML File 349_restart     HTML File 360_fresh     HTML File 360_restart     PNG File Screen Shot 2017-10-16 at 2.55.31 pm.png     PNG File disk.png     PNG File jtrue.png     PNG File jtrue2.png     File metrics.349_3     File metrics.360_3    
Issue Links:
Duplicate
duplicates WT-3681 Don't truncate the last log file in r... Closed
Related
related to WT-3681 Don't truncate the last log file in r... Closed
Operating System: ALL
Participants:

 Description   

Test consists of 20 threads inserting empty documents with j:true:

function run() {
    for (i = 0; i < 10000; i++)
        db.c.insert({}, {writeConcern: {j: true}})
}

Throughput after the first and subsequent restarts of mongod is about 40% lower than throughput on the initial startup of mongod. (First run below was on 3.4.9 and second was on 3.6.0-rc0, but it turned out that was not the important difference, it was the restart that reduced performance, on all versions tested).



 Comments   
Comment by Susan LoVerso [ 24/Oct/17 ]

I sent bruce.lucas the diff in email and he confirms that the fix in WT-3681 fixes this problem.

Comment by Susan LoVerso [ 24/Oct/17 ]

I have figured this out - we are truncating the new log file we just created and pre-allocated, in recovery. There are paths that take care to avoid that and I found a path that is not avoiding that. I have opened WT-3681 for this WiredTiger source change. bruce.lucas can you test a 1-line change for WT to see if it improves your run? With the fix I see the log file not getting truncated and remaining pre-allocated. But I've never seen any performance difference on any machine I've run on.

Comment by Susan LoVerso [ 23/Oct/17 ]

bruce.lucas I can reproduce what you see with the log file. My log file reaches 100Mb quickly enough that it doesn't seem to impact the throughput on this machine, but I do see the log file starting small and growing rather than pre-allocated.

Comment by Susan LoVerso [ 20/Oct/17 ]

That is unexpected. On restart, yes, we remove any pre-allocated files that existed from an earlier run. However, on wiredtiger_open we will create the next, new log file and go through the exact same pre-allocation path that the internal thread takes when it creates the pre-alloc files. So the log file in use should already be pre-allocated and ready to go (including an fsync after the pre-allocation OS call returns) before mongod is even finished initializing. This will need more investigation.

Comment by Bruce Lucas (Inactive) [ 20/Oct/17 ]

sue.loverso, I think I see what's going on.

I was able to eliminate the performance reduction on restart by removing the journal files before the restart.

If I don't remove the journal files before the restart, watching the journal directory while the test runs I see

  • on the initial run there is a pre-allocated 100 MB log file (and two 100 MB preplog files) that don't change in size, so they're being overwritten.
  • after the restart however instead of a pre-allocated 100 MB log file I see a log file that is steadily growing (and two 100 MB preplog files), until it reaches 100 MB and then it switches to one of the preplog files. I imagine the performance decline, extra writes to disk, and lack of merges are due to updating the file metadata as its size changes.

I didn't see this before because the test didn't run long enough, but at the point where it switches to the pre-allocated log file the performance goes back up:

This could explain reports we've seen of initial low performance after a restart.

Comment by Vamsi Boyapati [ 19/Oct/17 ]

Statistics of my successful repo are 349_fresh 349_restart 360_fresh 360_restart

The machine on which I run this tests is `ocelot`

Comment by Susan LoVerso [ 19/Oct/17 ]

All of these are on a 3.6 tree. I did 2 runs of the benchrun command (writeCmd:true) on both the fresh and restarted mongod:

Fresh:
     "totalOps/s" : 44550.645528354486,
     "totalOps/s" : 42683.899114884065,
Restart:
     "totalOps/s" : 43692.905267700866,
     "totalOps/s" : 48128.65077857071,

With writeCmd:false I see results in the 100-110K ops/sec range.

These are run on my AWS instance on the SSD on /mnt/fast.

Comment by Bruce Lucas (Inactive) [ 19/Oct/17 ]

What kind of numbers are you seeing? Here are my benchrun numbers, before and after restart:

	"totalOps/s" : 30337.019862240675,
   	"totalOps/s" : 23456.449623449553,

If your numbers are much lower than that possibly it affects the timing as well.

What kind of disk? I'm using SSD.

vamsi.krishna, what are the details of your successful repro?

Comment by Susan LoVerso [ 19/Oct/17 ]

I used exactly that, within a function run() and then did a time run in my script. My times were always identical. I even upped it to 100000 inserts but there was still no difference in time to execute. With it being the same time, I did not dive into the FTDC data. With writeCmd:true it is much slower than with false but still no change fresh versus restart (using the benchRun command).

Comment by Bruce Lucas (Inactive) [ 19/Oct/17 ]

That's interesting, I find the same thing - does not reproduce using that benchRun command. It does reproduce using the original shell loop posted on the ticket, e.g.

    for i in $(seq 20); do
        mongo --eval '
            for (i = 0; i < 10000; i++)
                db.c.insert({}, {writeConcern: {j: true}})
        ' &
    done        
    wait

I find it also reproduces using benchRun specifying writeCmd:true. The difference is that for both repro cases the througput is about an order of magnitude slower; I suspect that creates some timing interaction with journal flushes. The question is why is that different after a restart?

Comment by Susan LoVerso [ 19/Oct/17 ]

I have not been able to reproduce this. I ran the function above, spawning threads and timing the whole thing on both a fresh and restarted mongod on both 3.4 and 3.6. I still see the huge throughput drop from 3.4 -> 3.6 that is SUPPORT-2435. But I see no difference in throughput within the same release with fresh versus restarted. I've been running:

        mongo --eval '
            benchRun({
            ops:[{
             op:"insert",
             doc:{},
             ns:"test.c",
             writeCmd:false,
             writeConcern:{j:true}
           }],
           seconds:30,parallel:20
         }) ' 

Comment by Eric Milkie [ 16/Oct/17 ]

By “instance” you are referring to the Amazon defintion of that term, or do you mean “mongod process instance”?

Comment by Bruce Lucas (Inactive) [ 16/Oct/17 ]

So it looks like the issue is not 3.4.9 vs 3.6.0-rc0, but rather performance on a freshly started instance vs performance after a restart. For some reason the performance degradation noted above is seen after the first (and after each subsequent) restart, and that effect is seen on both 3.4.9 and 3.6.0-rc0.

Comment by Bruce Lucas (Inactive) [ 16/Oct/17 ]

Yes, my test was standalone, single ssd disk.

vamsi.krishna, from the ftdc data we can see two disks in your setup, apparently striped using raid or lvm? Also there are some periods of very low throughput, accompanied by very high average wait times for one or the other of the disks. I wonder what the reason for that is. Your setup shows no writes merged in either case, and I suspect accounts for the differences in performance.

Comment by Vamsi Boyapati [ 16/Oct/17 ]

Attached please find diagnostic data from my runs.
349_3 from 3.4.9 metrics.349_3
360_3 from 3.6.0.rc0 metrics.360_3

Comment by Eric Milkie [ 16/Oct/17 ]

Since the absolute performance is so different (inserts per sec), it might be affecting the results. We are going to try the reproducer again with a faster disk.

Comment by Alexander Gorrod [ 16/Oct/17 ]

vamsi.krishna Could you please attach the diagnostic data from your runs to the ticket?

Comment by Vamsi Boyapati [ 16/Oct/17 ]

I tried to reproduce this on standalone instances, not so much difference. Below are my results:

Leftside is 3.4.9 and right side is 3.6.0.rc0

I ran the test on machine with 40 CPU's and 128GB RAM.

Any suggestions for reproduction

Comment by Alexander Gorrod [ 15/Oct/17 ]

vamsi.krishna and neha.khatri please reproduce this performance degradation and start investigating a root cause for the difference.

Comment by Bruce Lucas (Inactive) [ 15/Oct/17 ]

The disk metrics are interesting (3.4.9 left, 3.6.0-rc0 right):

  • The rate of "sda bytes written" to the disk has not gone down even though the rate of inserts has gone down and the rate of "log bytes written" by the journaling system has gone down correspondingly. We are writing more bytes at the disk layer even though we're not writing more bytes overall to the journal.
  • We have very large write amplification (bytes written to disk / bytes written to journal) presumably due to the frequent journal flushes to implement j:true. The preceding point implies that the write amplification has gone up, from 5.6x under 3.4.9 to 8.4x under 3.6.0-rc0.
  • Rate of write requests merged has gone to nearly 0 under 3.6.0-rc0, suggesting a more random pattern of writes at the disk layer.

Theories:

  • A change in the way journal flushes are managed in the integration layer has had a side effect of reducing the efficiency of journaling.
  • A change in the journal system has reduced the efficiency of journaling in the presence of frequent flushes.
Generated at Thu Feb 08 04:27:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.