[SERVER-26804] Pause in inserts for YCSB Created: 27/Oct/16  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Performance, Storage, WiredTiger
Affects Version/s: 3.4.0-rc0, 3.4.0-rc1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Daly Assignee: Backlog - Storage Engines Team
Resolution: Unresolved Votes: 0
Labels: 3.7BackgroundTask
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File pause.png     HTML File timeseries.html     PNG File timeseries.p4.png    
Issue Links:
Depends
Assigned Teams:
Storage Engines
Operating System: ALL
Steps To Reproduce:

YCSB as run in longevity regression suite. 3 shard cluster in AWS using c3.2xlarge instances.

Sprint: Storage 2016-11-21, Storage 2016-12-12
Participants:
Linked BF Score: 0

 Description   

Running YCSB against a sharded cluster in our regression framework, we see a 10+ pause in inserts correlated with eviction in aggressive mode. The pause appears to be on the primary of the second shard of a three shard cluster. The test was run using c3.2xlarge instances, using local SSD for data and journal. The journal is on a separate device that the data.

Here's a plot of some key stats during the pause:



 Comments   
Comment by David Daly [ 26/Jan/17 ]

I finally got something running here. For this test I dropped the YCSB collection between runs, but the server stayed up. YCSB uses the same document ids each time it runs, so if you don't drop the collection you get duplicate key errors. There is an option to change the start point of the documents, so we can work around this if needed.

Looking at the results there are a few interesting things:

  1. The first two runs show drops to 0 lasting over 10 seconds. Twice on the first run (220 and 550 seconds in), and once on the second run (520 seconds in)
  2. There are no 10 second drops to zero after the first two runs.
  3. The throughput goes up dramatically across the first four runs
  4. One of the secondaries falls off the oplog during the third run and stays in recovery for the rest of the run.
  5. The performance looking at FTDC data looks much more stable a run or two after the third node falls off the oplog.

Does this data help confirm or deny your suspicions? It does seem to get much more stable over time, at least for the nodes that don't fall over.

Comment by David Hows [ 12/Dec/16 ]

Hi David Daly,

Sorry for the delay.

What makes sense to do here? One simple experiment I could do is to run the load phase twice, either dropping the collection in between, or loading to a new collection the second time. Would that make sense?

It would make sense to try testing the load phase twice.

I understand about the physical hardware.

Let me know how that goes.

Comment by David Daly [ 01/Dec/16 ]

david.hows I can definitely do the pre-heating collections. What makes sense to do here? One simple experiment I could do is to run the load phase twice, either dropping the collection in between, or loading to a new collection the second time. Would that make sense?

Testing this with physical hardware requires a fair amount more work to get a real cluster up and make sure it falls over again. If we want to try just standalone with oplog, I can run that locally on my machine. I don't know if it will reproduce there as it's just a different set of hardware than we're using in AWS. If you think there's something to learn from that experiment, I would be glad to do it. Thanks.

Comment by David Hows [ 01/Dec/16 ]

To Henrik's comments.

The issue here is not only about pre-heating the disk itself, although that could be a factor. It is about pre-heating the WiredTiger cache following a restart of the MongoDB instance. It is not unusual an instance to take some time in ramping to be able to keep up with eviction and there are other factors around the first usage of a collection, as we have to do things like initial writes.

One thing worth testing here would be to heat up the instance and collection first, with an intial warm up pass to build the collection ahead of the workload. This should hopefully work around that worst case where we a large early checkpoint saturating the disk.

To David Daly's comments.

Understand where you are coming from. Are you able to look at doing some testing with physical hardware? Or pre-heating the collections as suggested above? Hopefully with some of those changes we can minimize the stalls.

Comment by Susan LoVerso [ 02/Nov/16 ]

I looked at the FTDC from the run you showed. The statistics I'm viewing show a lot of IO going on during the stalls. The number of active write system calls in progress goes up and stays up at 31 for the duration of the stall. There is also 1 fsync and a checkpoint running for the entire time. The system (2nd line) shows it is spending its time in iowait. Coincidentally eviction is aggressive for the exact duration of the stall as well.

Comment by David Daly [ 28/Oct/16 ]

Hi sue.loverso, we're running YCSB from here: https://github.com/mongodb-labs/YCSB/tree/evergreen on the evergreen branch.
To set it up, clone the repo, and

  • cd YCSB/ycsb-mongo
  • ./setup.sh

And we're running it with this command line

./bin/ycsb load mongodb -s -P workloads/workloadLongevity -p mongodb.url=mongodb://10.2.0.99:27017/ycsb -threads 32

Change the mongodb.url to the appropriate target for you.

We were running the tests with a separate EC2 node (c3.2xlarge using local SSD) for the client, and each node in the cluster. I tried simplifying to 3 node repl set and standalone. The issue repeated on the 3 node repl set. The standalone shows drops in throughput, but none going to zero. I'm not sure if that is random variation, or because there's no oplog to run with. I kicked of a 1node repl set run also to see what happens there.

I think it makes sense to start with a 1 Node repl set, and see if you can reproduce with that locally. If it doesn't we can start pulling apart what's different between your local environment and our test environment.

Comment by David Daly [ 27/Oct/16 ]

Attaching raw timeseries.html file for the primary of the second shard.
timeseries.html

Generated at Thu Feb 08 04:13:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.