[WT-2868] Add sample_interval to checkpoint-stress wtperf config

Add sample_interval to generate monitor file for gathering throughput and latency stats

WT-2868 Add sample_interval to checkpoint-stress.wtperf (#2989)
Sue LoVerso I have updated Jenkins job perf-long to run checkpoint-stress.wtperf and generate graphs on throughput, number of updates and checkpoint count. Though I ran my changes separately, it is yet to execute a full run.

Please take a look and let me know if you have any suggestions on improving this change to the jenkins job.

WT-2868 Add sample_interval to checkpoint-stress.wtperf (#2989)
Branch: mongodb-3.4
Branch: mongodb-3.4

Message: Import wiredtiger: 7d3c0f9f50862798270cf38663255202e5bcf3fd from branch mongodb-3.4

ref: 2566118fc6..7d3c0f9f50
for: 3.3.12

WT-2865 eviction thread error failure
WT-2868 Add sample_interval to checkpoint-stress wtperf config
WT-2869 Performance regression on secondaries
Branch: master

Sulabh Mahajan The run completed. I have a couple suggestions that I'll let you fix so that you learn those parts of Jenkins:

  • Overall I have reservations about including the checkpoint-stress numbers in the collective latency calculations. The other 4 tests are all related to each other in configuration, duration and data and run for a couple hours. This new test is unrelated to those. However, I'm a bit on the fence because adding a lot of new plots gets confusing too. I am okay with it individually added to the Max Latency chart as long as it is in the same ballpark, but it may skew the "total number of warnings" chart. (However, it isn't doing that yet - see next bullet.)
  • You need to add a setting max_latency=2000 to checkpoint-stress.wtperf in order to get any latency warning messages. So, for now it isn't contributing to that value. While you're in there please move the sample* lines so they're alphabetized. Thanks.
  • In Jenkins, you need to add a label to all your data. In each plot definition, when you pick "Load data from properties file" it will bring up a box labelled "Data series legend label" for you to type in what the data is. This is particularly important in any plot with more than one data item such as the Max Latency plot.
  • In Jenkins, when it displays the plots (when you select the "Plots" link to see the charts) it displays them alphabetically by "Plot title". That is why I named the others "Test1" - "Test4". But again, those 4 tests are related to each other, using the same data initially created in Test1, the populate phase. You don't need to call yours "Test5" and that might be misleading since it isn't related to the other four. However, I don't have a good suggestion because if you want it at the bottom after the other 4, then anything I come up with inserts it in the beginning or middle.
  • In order to see the plots on http://source.wiredtiger.com/jenkins/plots/ you need to add entries for them in the wiredtiger/jenkins/plots/index.mh file. (I.e. that is the repo: https://github.com/wiredtiger/wiredtiger.github.com). They're simply numbered so add another line for each additional plot you add. Then run build.sh to generate the HTML. That requires pandoc so if you don't have that on your system just let me know and I can add the plots there.
  • I did fix a typo in the property filenames in the checkpoint stress min/max throughput plot. But there is no data in the csv file yet because of the typo.
  • You could eliminate one step in getting the update min/max throughput numbers by doing cut -d ' ' -f 5 instead of the two cut commands you're using.

But thanks for adding this!

Sue LoVerso Following addresses your comments:

  • I discussed with Alex, and we are of the opinion that we do not need latency measurements for the checkpoint-stress test. So I removed all latency related changes
  • I will fix alphabetising sample* files in the wtperf file as a separate change
  • Added legend label to graphs that were missing it
  • Renamed from Test5 to Test_Checkpoint_Stress, this keeps these graphs appear at the bottom
  • I have added the graph to the jenkins/plots/index.mh file in the wiredtiger.github.com repo
  • Shortened to cut -d ' ' -f 5 to get min/max throughput

Thanks for helping me out with this one.

I discussed with Alex, and we are of the opinion that we do not need latency measurements for the checkpoint-stress test. So I removed all latency related changes

I'm going to push back and ask both you and Alexander Gorrod what the point of adding the sample* lines is if you're not measuring nor plotting any latency? The current new plots are 1. number of checkpoints, 2. update counts, 3. min/max throughput. None of that requires the monitor thread that records latencies into the monitor file. The other thing the monitor thread can measure and warn about is a minimum throughput, but we don't use that anywhere right now.

I really like having number of checkpoints plot. I think it is telling that we're only completing 1 checkpoint in 10 minutes.

So I'll put these additional comments out there:

  • If we don't want latency measurements, then you can remove the sample lines. I can agree that the long test1-4 measures that sort of thing already and does checkpoints once per minute as well. It does not have an update-only workload though.
  • However, I'll point out that long latencies have been more common around the syncs of checkpoints. So it can be interesting.
  • Are we trying to get close to how MongoDB uses WT? If so, I have more suggestions on the config file:
    • Add 4 eviction threads to the connection config.
    • Increase cache size to 16Gb (half of the memory of the AWS perf machine we use).
    • For table config, use leaf_page_max=16k,memory_page_max=10M.
    • Consider turning on fast, json statistics.
  • If we're not trying to resemble MongoDB usage, then a comment for the values in the config file would be helpful to know why/how those numbers were chosen.
  • Thought of this typing the above - Why not have this test be another related test to the other 500m tests as an update-only version? Then it does fit in with the other plots and already has the MongoDB-related setup. Its run-time would have to increase as well. I'm kind of liking this idea. If you do this we should figure out how to best incorporate the number of checkpoints information you added (i.e. just for this test, or sum from all tests, and how to know what a good/bad number is, etc). What do you think?
FYI I hand-edited the csv files so that the plots are consistently labeled with your new labels and they look like you'd expect.

Sue Loverso thanks for the input, I will work on this and get back to you.

Sue Loverso, the changes to the wtperf file are under review::

  • We decided to keep the sample* lines for any use in the future. I can remove them if you feel otherwise.
  • We are not necessarily imitating how MongoDB uses WT, I have added eviction threads and increased cache size.
  • Since this test is partly intended for the performance measurements, I am inclined to not turn on statistics till needed. Let me know if you feel otherwise
  • This wtperf configuration is mostly based on the test that was used for WT-2389, to keep track of any performance regression in no of updates with stressed checkpoints.
    I am not sure if this goes well with any other tests. For now I am inclined towards not merging this with any other test.
WT-2868 Add sample_interval to checkpoint-stress.wtperf (#2989)
Branch: mongodb-3.2
Branch: mongodb-3.2

WT-2868 Add sample_interval to checkpoint-stress wtperf config

