-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Performance Benchmarking
-
None
-
None
-
Go Drivers
-
None
-
None
-
None
-
None
-
None
-
None
Context
Right now, we compare the results of the benchmarks run in the "perf" task with the "stable region" from a set of previous waterfall builds, and post the diff of those times on PRs. That has proved to be quite noisy and isn't providing any value right now. We need to find a way to reduce the noise so that only real performance diffs are called out.
At least one source of noise is that individual benchmark runs often differ a lot from the baseline, but that difference doesn't stay consistent through multiple runs. We really need to collect data from multiple runs (5 or more) in the "perf" task to make sure the results are valid (i.e. form a new stable region). I recommend starting by running the benchmarks in the "perf" task multiple times (5 or more) and comparing all of the new results to see if it forms a new stable region.
Definition of done
- Run the benchmarks in the "perf" task 5 or more times, reporting all results.
- It's possible the simplest way to implement that is to run the "perf" task 5 times, but it's not clear how to do that.
- Update the perfcomp tool to be able to fetch stats from multiple benchmark runs per task, comparing the waterfall and the patch build stat series to see if there's a new stable region.
Pitfalls
What should the implementer watch out for? What are the risks?