[SERVER-43804] Establish baseline performance metrics for startup, shutdown, initiation portions of ReplSetTest and ShardingTest Created: 03/Oct/19  Updated: 20/Oct/19  Resolved: 17/Oct/19

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: William Schultz (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive faster-local-testing-snapshot.zip     File replica_sets.csv     PNG File replica_sets.png     File sharding.csv     PNG File sharding.png    
Issue Links:
Depends
depends on SERVER-43803 Add log messages to aid measurement o... Closed
Sprint: Repl 2019-10-21
Participants:

 Description   

To measure the success of the Faster Local Testing project, we should, before making any changes, measure the execution times of the startup, shutdown, and initiation phases of ReplSetTest and ShardingTest across all tests in the replica_sets and sharding suites. We should record the durations of the relevant sections of code for each test so that we can run this experiment again after the project is complete to see how the performance improved for each relevant section i.e. ReplSetTest.startSet, ShardingTest.stop, etc.



 Comments   
Comment by William Schultz (Inactive) [ 20/Oct/19 ]

After some more investigation, I think that the distinct profile of stopSet performance can be explained by the fact that stopSet execution time includes the time we take to validate each replica set node. If validation is one of the slowest parts of stopSet, then the amount of data written during the test may affect its speed much more than the size of the replica set.

Comment by William Schultz (Inactive) [ 17/Oct/19 ]

After some discussion about these metrics, there are a few other follow up experiments that might be interesting to run in the future. Adding them here for reference:

  • Record metrics from running locally on a Macbook
  • Explore the potential impacts of different Evergreen/AWS instance types
  • Explore the discrepancies between local workstation runs and Evergreen runs in more depth
Comment by William Schultz (Inactive) [ 17/Oct/19 ]

judah.schvimer I turned off the icecream daemon (sudo systemctl stop iceccd.service) before running the benchmarks on my local machine. There may be other factors that are affecting the performance in some way, but I will leave that to a follow up investigation. I do not want to deal with the issues of uploading potentially private credentials to a public repo, so I am attaching a zip directory that contains a snapshot of the scripts I used: faster-local-testing-snapshot.zip. I may change these scripts locally but this directory contains the scripts as they exist today and instructions on how to run them. Basically, there are 3 stages of the report generation:

  1. Download log files from Evergreen for a given task ID
  2. Extract duration metrics from the downloaded logs and compile them into a CSV file
  3. Generate plots based on the generated CSV file.
Comment by Jason Chan [ 15/Oct/19 ]

The above report LGTM.

Comment by William Schultz (Inactive) [ 14/Oct/19 ]

jason.chan If you want to review the above report and see if it is all clear to you that would be great.

Comment by William Schultz (Inactive) [ 14/Oct/19 ]

Performance Baseline

The attached charts display data for setup and teardown performance across different Evergreen variants and on my local workstation, for ReplSetTest and ShardingTest. These results are based on runs from this Evergreen version and this patch build. The full data sets used to generate these charts are also attached: replica_sets.csv and sharding.csv. The metrics captured are those added in log messages in SERVER-43803.

Each individual plot is a histogram where the x-axis is the duration of that portion of the test fixture in milliseconds, and the y-axis is the number of tests whose relevant metric fell into that execution time bucket. The upper limit of the x-axis is bounded to 10000ms to remove any significant outliers. To help understand the data further, each histogram bar is colored to indicate the number of nodes (or shards) that were used in the test that the data point came from. For example, we can see that for ReplSetTest.startSet, there are tight groupings corresponding to some of the most common replica set sizes used in tests i.e. 1, 2, and 3 nodes. The startup times appear to scale roughly linearly in the number of nodes. The ReplSetTest.stopSet histogram profile is more spread out. I'm not sure exactly why this is, but maybe for certain tests that write more data they take longer to checkpoint that data when shutting down, and so the shutdown time is dependent on the amount of data written. For startup, all nodes should nearly always be starting with a clean data directory, so presumably there is less variability in those execution times.

We can see that the histogram profiles are quite consistent across different Evergreen variants. It is worth pointing out that this Evergreen data is compiled from test runs with a reasonably high level of parallelization i.e. multiple tests running on the same machine at once. This doesn't give an entirely accurate view of single test latency, but my hope is that it's a reasonable proxy. If a machine is overloaded, then I would expect startup/shutdown times to only be slower than they are on an unloaded machine. So, if anything, these numbers should act as an upper bound for local test execution times. For completeness, I also ran one replica_sets suite on a RHEL 6.2 spawn host with no parallelization i.e. 1 resmoke job, notated as "enterprise-rhel-62-64-bit-sequential". Even with a single job the histogram profile isn't dramatically different, so I think it's OK to use the metrics from parallelized suites as a reasonable approximation of single test latencies.

The histogram profile when running on my local Ubuntu workstation is a noticeable anomaly (see the "ubuntu_1610_local_workstation" subplots). Its median across all plots is much slower and the profile is much more spread out. My hypothesis is that this may have to do with a slower disk on my local machine than those in AWS. To test this hypothesis, I also ran the suites locally on my machine using a ramdisk for the data directory. This provides a significant performance improvement and brings the performance profile more in line with those from the Evergreen runs. I did not attempt to run a full sharding suite on my local workstation because it takes a long time to run.

In the sharding data, for which only two Evergreen variants are shown, we see the high amount of 3 node replica set config servers. Note that the "count" value for the startupAndInitiate metric counts the number of shards in that test, not the number of nodes in all shards.

There are still some aspects of this data that are worth investigating further (for example, why the data consistency checks profile seems to differ between local runs and Evergreen runs), but I think this data acts as a good baseline, and we can continue to analyze these metrics as we speed up different parts of the fixtures. Below are summary statistics for the recorded metrics that are shown in the charts:

replica_sets

Build Variant Metric Mean Median SD
"enterprise-rhel-62-64-bit" "initiateWithNodeZeroAsPrimary" 3897 3362 2360
"enterprise-rhel-62-64-bit" "startSet" 1871 1616 903
"enterprise-rhel-62-64-bit" "stopSetConsistencyChecks" 774 175 3585
"enterprise-rhel-62-64-bit" "stopSetShutdown" 1665 1538 1114
"enterprise-rhel-62-64-bit-sequential" "initiateWithNodeZeroAsPrimary" 3847 3470 2271
"enterprise-rhel-62-64-bit-sequential" "startSet" 1834 1615 808
"enterprise-rhel-62-64-bit-sequential" "stopSetConsistencyChecks" 767 156 3583
"enterprise-rhel-62-64-bit-sequential" "stopSetShutdown" 1599 1529 882
"linux-64-debug" "initiateWithNodeZeroAsPrimary" 3968 3428 2301
"linux-64-debug" "startSet" 1987 1817 1026
"linux-64-debug" "stopSetConsistencyChecks" 853 225 3556
"linux-64-debug" "stopSetShutdown" 1849 1593 1544
"ubuntu_1610_local_workstation" "initiateWithNodeZeroAsPrimary" 8407 9122 3352
"ubuntu_1610_local_workstation" "startSet" 5719 5326 2765
"ubuntu_1610_local_workstation" "stopSetConsistencyChecks" 4821 4434 5631
"ubuntu_1610_local_workstation" "stopSetShutdown" 5414 4180 9141
"ubuntu_1610_local_workstation_ramdisk" "initiateWithNodeZeroAsPrimary" 5532 5922 2509
"ubuntu_1610_local_workstation_ramdisk" "startSet" 2464 2152 1097
"ubuntu_1610_local_workstation_ramdisk" "stopSetConsistencyChecks" 3797 3424 5062
"ubuntu_1610_local_workstation_ramdisk" "stopSetShutdown" 3433 2466 8761

sharding

Build Variant Metric Mean Median SD
"enterprise-rhel-62-64-bit" "startupAndInitiate" 4921 4234 3318
"enterprise-rhel-62-64-bit" "startupAndInitiateConfigServer" 5664 5752 1369
"enterprise-rhel-62-64-bit" "stopShards" 2387 1788 3533
"linux-64-debug" "startupAndInitiate" 4867 4270 3005
"linux-64-debug" "startupAndInitiateConfigServer" 5666 5800 1216
"linux-64-debug" "stopShards" 2514 1901 3572
Generated at Thu Feb 08 05:04:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.