[SERVER-23622] Inconsistent throughput during insert workload Created: 08/Apr/16 Updated: 21/Dec/16 Resolved: 10/Nov/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Susan LoVerso |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
With a heavy insert load and a secondary that is delayed the cache on the primary fills to 100% and operation rates drop. Here's a run showing behavior on the primary with the secondary delayed due to lag, but a similar effect is seen if the secondary is intentionally delayed using slaveDelay.
The high rate of pages walked for eviction suggests a connection to The above test involved
|
| Comments |
| Comment by Michael Cahill (Inactive) [ 10/Nov/16 ] | |||||||||||||||||||||||||||||||
|
Closing this issue because it ended up focusing on standalone performance. Opened | |||||||||||||||||||||||||||||||
| Comment by Susan LoVerso [ 10/Nov/16 ] | |||||||||||||||||||||||||||||||
|
michael.cahill Is there more to do in this ticket? We have the changes from | |||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 08/Sep/16 ] | |||||||||||||||||||||||||||||||
|
sue.loverso, thanks for the new data. I see two primary causes of the remaining stalls: 1. threads contending to split a large page when appending: this leads to plateaus in "range of IDs currently pinned" causing a stall; and I'll fix 2 first up this morning: the issue there is that scrubbing was the cause of the regression in the YCSB 50/50 workload, so what we want is something that allows scrubbing to reduce the amount of dirty data in cache with minimal impact on application threads. To some extent, case 1 has been around forever and AFAIK none of the recent changes to eviction has made it worse. In the past I've tried to make forced eviction code try harder to get exclusive access to these pages but that has never been a net win. | |||||||||||||||||||||||||||||||
| Comment by Susan LoVerso [ 07/Sep/16 ] | |||||||||||||||||||||||||||||||
|
I tried to run this test for an hour. It could only run for 43 minutes before filling the file system (which was nearly empty at the start). It essentially ran 2x the earlier test. It had 83 stalls of which 39 were zero drop-outs. It completed 29 checkpoints in the 43 minutes. There were several multi-second stalls. I have attached aggressive.43min.tar.gz if you want its data. | |||||||||||||||||||||||||||||||
| Comment by Susan LoVerso [ 07/Sep/16 ] | |||||||||||||||||||||||||||||||
|
michael.cahill I ran your branch today and things are much improved generally. I am going to run a longer test too since the only dropouts to 0 occurred in minute 18-19. When I ran your branch last night, there were 14 stalls, of which 10 were drops to 0. This morning after your changes, I see 6 stalls, of which 3 are drops to 0. Five of the six occurred in a one minute period, 18 minutes into the test, but they were not grouped together. (I.e. it is several intermittent drops, not one longer one like yesterday's run.) Performance is good. This is a 20 minute standalone no-journal version of this insert test.
There were three super long latencies, on the order of 47-53 seconds and a handful more of about 10 seconds. Those are troubling. Here's a chart of the entire run. I've uploaded this as run8.tar.gz. A-B is where 5 of the 6 stalls happened. Here's the zoomed in portion A-B: | |||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 02/Sep/16 ] | |||||||||||||||||||||||||||||||
|
Looks indeed like the stalls coincide with checkpoints, particularly with the long fsync that occurs from the beginning of the checkpoint to the point where we allocate a transaction id (marked by range of ids pinned increasing):
One thing that can cause this is if the journal is on the same drive as the data files - are you using separate disks for journal and data? | |||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 02/Sep/16 ] | |||||||||||||||||||||||||||||||
I don't think it requires an unresponsive disk. I've seen this consistently when we are fsyncing the files during a checkpoint. I believe what is happening is that the kernel does not allow the fsync call to be interrupted, which is required to attach gdb. You can check by correlating the stalls in the ftdc data with checkpoints, and more specifically the stats that record when we do the fsyncs. | |||||||||||||||||||||||||||||||
| Comment by Susan LoVerso [ 02/Sep/16 ] | |||||||||||||||||||||||||||||||
|
michael.cahill Here is the diagnostic data from my latest run. | |||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 01/Sep/16 ] | |||||||||||||||||||||||||||||||
|
Thanks for the analysis sue.loverso
I've seen two causes for this:
Do you have statistics for the run? Could you check to see if the checkpoint sync duration approximately matches the periods where throughput stalls? | |||||||||||||||||||||||||||||||
| Comment by Susan LoVerso [ 01/Sep/16 ] | |||||||||||||||||||||||||||||||
|
Running the standalone 20 minute test with current mongo master as of Aug 30 performance is better. Stalls still exist but checkpoints are maintained.
It has been impossible to run pmp or break with gdb when stalls happen. It takes far longer than the stall time to connect with the process. I was able to watch with perf top and run perf record every 3 seconds instead. I was able to capture this for perf top. tree_walk_internal is always at the top or 2nd. But during a stall it spikes up to over 20%:
Running perf report on the perf data recorded, I see similar information during a stall. perf report orders things in decreasing order based on Children percentage. During a stall, tree_walk_internal is the highest mongod function with any self time. During normal operation, tree_walk_internal is about the 20th mongod function listed.
| |||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 12/Apr/16 ] | |||||||||||||||||||||||||||||||
|
bruce.lucas I have been able to reproduce similar insert throughput slowdowns during checkpoint, though it takes a lot longer for me to see the symptoms - between 10 and 15 minutes before the symptoms start. Now that I can reproduce I'll start digging deeper. | |||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 11/Apr/16 ] | |||||||||||||||||||||||||||||||
|
Whoops, I was indeed confused. Thanks for the clarification. It makes sense now. | |||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 11/Apr/16 ] | |||||||||||||||||||||||||||||||
|
alexander.gorrod, the run above was without slaveDelay, and was I believe due to the lagging secondary. When I did a run with slaveDelay I used 120 seconds. Yes, all the processes on the same machine (12 cores, 24 CPUs), but overall CPU utilization was well below 100%. Customer was not running with all processes on same machine. The graph above is the entire run, so less than 2 minutes to repro the issue. You might try more writer threads as the effect disappeared with a small enough number of threads. Are you seeing secondary lag in your repro? Also, cache size may matter - I was using 25 GB, customer had a much larger cache (309 GB) and saw the symptom much more dramatically. Diagnostic data attached (from a run with lag but not slaveDelay). This is a different run from the one graphed above, but parameters and outcome are the same. Regarding the possibility that there aren't many pages that are available for eviction, I'm not sure how that could be the case: the workload is inserting at the end of collection, index, and oplog; and is reading from some point in the middle of the oplog. That should leave lots of pages that could be evicted, at least from a naive viewpoint about what pages could be evicted... milkie, my initial description didn't make it clear that the slowdown occurred on the primary; I've corrected that now. Does that clarify? The delayed secondary means that it is reading from the middle of the oplog on the primary, not the end, which changes the interaction with the cache. | |||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 11/Apr/16 ] | |||||||||||||||||||||||||||||||
|
I don't see how slave delay could affect the behavior of caching. The storage engine doesn't even know about any delay, and the replicated operations should be presented by the replication system to the storage engine in the same relative order, more or less. It will be interesting to see what the underlying cause is. | |||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 11/Apr/16 ] | |||||||||||||||||||||||||||||||
|
bruce.lucas I attempted to reproduce this without any success. Could you give me some more information about your setup? Specifically:
Also - would you mind uploading the diagnostic data you used to generate the graph? Re: the symptom of walking lots of pages per eviction. It is an indication that something is inefficient, but there are two different types of cause: It will be interesting to see which category this falls into. |