[SERVER-23622] Inconsistent throughput during insert workload Created: 08/Apr/16  Updated: 21/Dec/16  Resolved: 10/Nov/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Susan LoVerso
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 23622.0901.tar.gz     File 23622.20.tar.gz     File 23622.nolog.tar.gz     File 23622.results.tar.gz     File 2816.20.tar.gz     File aggressive.43min.tar.gz     PNG File cs29611.png     File diagnostic.data.tar     File exp.tar.gz     PNG File lagging.png     File patch-diagnostic.data.tar     PNG File patch.png     PNG File run-branch.png     PNG File run8-zoom.png     File run8.tar.gz     PNG File stalls.png    
Issue Links:
Duplicate
is duplicated by SERVER-18081 Tailing the oplog requires paging in ... Closed
Related
is related to SERVER-18081 Tailing the oplog requires paging in ... Closed
is related to SERVER-22831 Low query rate with heavy cache press... Closed
is related to SERVER-26988 Secondary delay causes large drop in ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

With a heavy insert load and a secondary that is delayed the cache on the primary fills to 100% and operation rates drop. Here's a run showing behavior on the primary with the secondary delayed due to lag, but a similar effect is seen if the secondary is intentionally delayed using slaveDelay.

  • from D-E the cache is 95% full and insert rate drops considerably, possibly due to application threads doing evictions?
  • F-G and H-I seem to be seem to be related to checkpoints, possibly also in combination with the full cache?
  • the rate of pages walked for eviction is generally very high, about 6k times the rate of pages actually evicted, suggesting that the issue is difficulty finding pages to evict to keep the cache at target levels

The high rate of pages walked for eviction suggests a connection to SERVER-22831, which also showed that symptom in connection with a full cache; however the above run was on 3.2.5-rc1 where SERVER-22831 was fixed, so it seems there is a different issue here.

The above test involved

  • 3.2.5-rc1
  • 2-node replica set
  • 25 GB cache
  • 100 GB oplog
  • 5 threads inserting 800 byte documents into 5 separate collections

for t in $(seq 5); do
    mongo --eval "
        x = ''
        for (var i=0; i<800; i++)
            x += 'x'
        docs = []
        for (var i=0; i<1000; i++)
            docs.push({x:x})
        ops = [{
            op: 'insert',
            ns: 'test.c' + '$t',
            doc: docs,
        }]
        res = benchRun({
            ops: ops,
            seconds: 10000,
            parallel: 1
        })
    " &
done
 
wait



 Comments   
Comment by Michael Cahill (Inactive) [ 10/Nov/16 ]

Closing this issue because it ended up focusing on standalone performance. Opened SERVER-26988 to track the original reported issue of cache pressure caused by a delayed secondary.

Comment by Susan LoVerso [ 10/Nov/16 ]

michael.cahill Is there more to do in this ticket? We have the changes from WT-2816 and WT-2866. I do still see occasional long latencies when running, but not dropouts in the time before my disk fills. When I ran on my old, very large SSD AWS box it ran for an hour with 3 drops to zero, all exactly 15 minutes apart. I assume those were some kind of periodic SSD housekeeping event.

Comment by Michael Cahill (Inactive) [ 08/Sep/16 ]

sue.loverso, thanks for the new data. I see two primary causes of the remaining stalls:

1. threads contending to split a large page when appending: this leads to plateaus in "range of IDs currently pinned" causing a stall; and
2. ineffective scrubbing on the tip of the WT-2233 branch leading to stalls during checkpoints (because we're at the dirty limit and checkpoint blocks eviction of dirty pages).

I'll fix 2 first up this morning: the issue there is that scrubbing was the cause of the regression in the YCSB 50/50 workload, so what we want is something that allows scrubbing to reduce the amount of dirty data in cache with minimal impact on application threads.

To some extent, case 1 has been around forever and AFAIK none of the recent changes to eviction has made it worse. In the past I've tried to make forced eviction code try harder to get exclusive access to these pages but that has never been a net win.

Comment by Susan LoVerso [ 07/Sep/16 ]

I tried to run this test for an hour. It could only run for 43 minutes before filling the file system (which was nearly empty at the start). It essentially ran 2x the earlier test. It had 83 stalls of which 39 were zero drop-outs. It completed 29 checkpoints in the 43 minutes. There were several multi-second stalls. I have attached aggressive.43min.tar.gz if you want its data.

Comment by Susan LoVerso [ 07/Sep/16 ]

michael.cahill I ran your branch today and things are much improved generally. I am going to run a longer test too since the only dropouts to 0 occurred in minute 18-19. When I ran your branch last night, there were 14 stalls, of which 10 were drops to 0. This morning after your changes, I see 6 stalls, of which 3 are drops to 0. Five of the six occurred in a one minute period, 18 minutes into the test, but they were not grouped together. (I.e. it is several intermittent drops, not one longer one like yesterday's run.) Performance is good. This is a 20 minute standalone no-journal version of this insert test.

Ops/Sec/Thraad Num Ckpts Num Stalls Num Zero Long Latency
160 13 6 3 53.5 seconds!

There were three super long latencies, on the order of 47-53 seconds and a handful more of about 10 seconds. Those are troubling.

Here's a chart of the entire run. I've uploaded this as run8.tar.gz. A-B is where 5 of the 6 stalls happened.

Here's the zoomed in portion A-B:

Comment by Bruce Lucas (Inactive) [ 02/Sep/16 ]

Looks indeed like the stalls coincide with checkpoints, particularly with the long fsync that occurs from the beginning of the checkpoint to the point where we allocate a transaction id (marked by range of ids pinned increasing):

One thing that can cause this is if the journal is on the same drive as the data files - are you using separate disks for journal and data?

Comment by Bruce Lucas (Inactive) [ 02/Sep/16 ]

It has been impossible to run pmp or break with gdb when stalls happen

I don't think it requires an unresponsive disk. I've seen this consistently when we are fsyncing the files during a checkpoint. I believe what is happening is that the kernel does not allow the fsync call to be interrupted, which is required to attach gdb. You can check by correlating the stalls in the ftdc data with checkpoints, and more specifically the stats that record when we do the fsyncs.

Comment by Susan LoVerso [ 02/Sep/16 ]

michael.cahill Here is the diagnostic data from my latest run.

Comment by Alexander Gorrod [ 01/Sep/16 ]

Thanks for the analysis sue.loverso

It has been impossible to run pmp or break with gdb when stalls happen

I've seen two causes for this:

  • There are lots and lots of threads.
  • The disk is unresponsive for the same time period.

Do you have statistics for the run? Could you check to see if the checkpoint sync duration approximately matches the periods where throughput stalls?

Comment by Susan LoVerso [ 01/Sep/16 ]

Running the standalone 20 minute test with current mongo master as of Aug 30 performance is better. Stalls still exist but checkpoints are maintained.

Ops/sec Num ckpts Stall secs Zero stall secs
156 14 22 14

It has been impossible to run pmp or break with gdb when stalls happen. It takes far longer than the stall time to connect with the process. I was able to watch with perf top and run perf record every 3 seconds instead. I was able to capture this for perf top. tree_walk_internal is always at the top or 2nd. But during a stall it spikes up to over 20%:

Normal running:
  12.25%  libc-2.17.so        [.] __memcpy_ssse3_back
   5.64%  mongod              [.] __tree_walk_internal
   5.07%  mongod              [.] __rec_txn_read
   4.86%  mongod              [.] __wt_row_search
   4.50%  mongod              [.] __free_skip_list
   4.46%  mongod              [.] __wt_row_modify
 
During a stall:
  21.38%  mongod               [.] __tree_walk_internal
  10.14%  libc-2.17.so         [.] __memcpy_ssse3_back
   4.13%  mongod               [.] __rec_txn_read
   3.94%  mongod               [.] __wt_row_search
   3.93%  mongod               [.] __wt_row_modify

Running perf report on the perf data recorded, I see similar information during a stall. perf report orders things in decreasing order based on Children percentage. During a stall, tree_walk_internal is the highest mongod function with any self time. During normal operation, tree_walk_internal is about the 20th mongod function listed.

Normal:
  Children      Self  Command        Shared Object       Symbol 
+   12.12%    11.45%  mongod         mongod              [.] __tree_walk_internal                                                                                      
 
During stall:
  Children      Self  Command        Shared Object       Symbol     
+   24.38%     0.00%  mongod         mongod              [.] __wt_thread_run                                                                                           
+   24.35%     0.89%  mongod         mongod              [.] __wt_evict_thread_run                                                                                     
+   23.44%    21.76%  mongod         mongod              [.] __tree_walk_internal                                                                                      

Comment by Alexander Gorrod [ 12/Apr/16 ]

bruce.lucas I have been able to reproduce similar insert throughput slowdowns during checkpoint, though it takes a lot longer for me to see the symptoms - between 10 and 15 minutes before the symptoms start. Now that I can reproduce I'll start digging deeper.

Comment by Eric Milkie [ 11/Apr/16 ]

Whoops, I was indeed confused. Thanks for the clarification. It makes sense now.

Comment by Bruce Lucas (Inactive) [ 11/Apr/16 ]

alexander.gorrod, the run above was without slaveDelay, and was I believe due to the lagging secondary. When I did a run with slaveDelay I used 120 seconds.

Yes, all the processes on the same machine (12 cores, 24 CPUs), but overall CPU utilization was well below 100%. Customer was not running with all processes on same machine.

The graph above is the entire run, so less than 2 minutes to repro the issue.

You might try more writer threads as the effect disappeared with a small enough number of threads. Are you seeing secondary lag in your repro? Also, cache size may matter - I was using 25 GB, customer had a much larger cache (309 GB) and saw the symptom much more dramatically.

Diagnostic data attached (from a run with lag but not slaveDelay). This is a different run from the one graphed above, but parameters and outcome are the same.

Regarding the possibility that there aren't many pages that are available for eviction, I'm not sure how that could be the case: the workload is inserting at the end of collection, index, and oplog; and is reading from some point in the middle of the oplog. That should leave lots of pages that could be evicted, at least from a naive viewpoint about what pages could be evicted...

milkie, my initial description didn't make it clear that the slowdown occurred on the primary; I've corrected that now. Does that clarify? The delayed secondary means that it is reading from the middle of the oplog on the primary, not the end, which changes the interaction with the cache.

Comment by Eric Milkie [ 11/Apr/16 ]

I don't see how slave delay could affect the behavior of caching. The storage engine doesn't even know about any delay, and the replicated operations should be presented by the replication system to the storage engine in the same relative order, more or less. It will be interesting to see what the underlying cause is.

Comment by Alexander Gorrod [ 11/Apr/16 ]

bruce.lucas I attempted to reproduce this without any success.

Could you give me some more information about your setup? Specifically:

  • What value did you use for slaveDelay
  • Where you running all of the processes on the same machine
  • How long did the test run before getting into the state in your graph?

Also - would you mind uploading the diagnostic data you used to generate the graph?

Re: the symptom of walking lots of pages per eviction. It is an indication that something is inefficient, but there are two different types of cause:
1) The eviction server algorithm is inefficient. That is it is doing a poor job of finding, choosing and queuing candidates.
2) The particular workload means that there aren't many pages that are available for eviction, so the server is working very hard to find the few candidates that are available.

It will be interesting to see which category this falls into.

Generated at Thu Feb 08 04:03:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.