-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Server Programmability
-
Fully Compatible
-
ALL
-
-
Programmability 2025-02-03
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Unfortunately, that the way we're measuring performance in BenchmarkWithProfiler is flawed, statistically.
Stems from a misunderstanding of how the benchmark counters work.
Affects the critical benchmark service_entry_point_shard_role_bm etc that we use for gating commits.
Effects are severe for the counters per iteration metric but minimal for the instructions per iteration metric.
The benchmark library heuristically estimates the number of iterations required for stability of the results.
To do this, it runs several iterations as warmups. First with just one iteration, then perhaps with 10, 100, etc. All of these runs are performed using the same instance of the fixture object, but each thread of each run will get a separate State object to own its counters.
From these initial probing runs, the benchmark framework calculates a very particular number (like 13523) for the number of iterations it will use, and then it runs the real benchmark, throwing away the counters from all of its previous calibration runs. This is important! Those runs were not statistically valuable, as the benchmark library is trying to get a measurement of the steady cost of a large number of iterations.
For reasons I am struggling to understand, the BenchmarkWithProfiler fixture ignores the benchmark library's Counter API (docs) and instead captures every thread's run's counters into integer accumulators owned by the fixture. That means that the counters accumulated from the early and invalid calibration runs are NOT thrown away.
What's worse: these fixture owned counters are not cleared between runs for different thread counts. Benchmarks like SEP_PING, will run in a thread range, which is a power-of-two progression, up to physical cores * 2. So it's runs with threads=1, 2, 4, 8, 16, 32, in order. By the time we get to threads=32, the counters for cycles, instructions, and iterations are already populated with all of the results from the previous thread count trials. So we don't really understand the performance at any thread count except the first: threads=1.
Here's a sample run with some diagnostic information printed out at the destructor of a BenchmarkWithProfiler instance.
Note that there is only one such object, and it is reused for all of the thread counts and calibration runs for each thread count.
Run on (16 X 2100 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x16)
L1 Instruction 64 KiB (x16)
L2 Unified 1024 KiB (x16)
L3 Unified 32768 KiB (x1)
Load Average: 6.67, 11.02, 8.63
----------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------
ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads:1 7904 ns 7904 ns 87078 cycles=816.738M cycles_per_iteration=8.31802k cycles_per_iteration_correct=8.29946k instructions=0 instructions_per_iteration=0 instructions_per_iteration_correct=0 iterations_incorrect=98.189k
ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads:2 4109 ns 8218 ns 81616 cycles=1.73257G cycles_per_iteration=8.57591k cycles_per_iteration_correct=8.6597k instructions=0 instructions_per_iteration=0 instructions_per_iteration_correct=0 iterations_incorrect=202.027k
ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads:4 2176 ns 8703 ns 79048 cycles=2.89015G cycles_per_iteration=8.87858k cycles_per_iteration_correct=9.30224k instructions=0 instructions_per_iteration=0 instructions_per_iteration_correct=0 iterations_incorrect=325.519k
ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads:8 1195 ns 9526 ns 70288 cycles=3.71028G cycles_per_iteration=9.1681k cycles_per_iteration_correct=10.2311k instructions=0 instructions_per_iteration=0 instructions_per_iteration_correct=0 iterations_incorrect=404.695k
ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads:16 1047 ns 16127 ns 44688 cycles=4.83899G cycles_per_iteration=10.3583k cycles_per_iteration_correct=17.9449k instructions=0 instructions_per_iteration=0 instructions_per_iteration_correct=0 iterations_incorrect=467.159k
ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads:32 892 ns 15128 ns 44416 cycles=6.49673G cycles_per_iteration=12.6119k cycles_per_iteration_correct=33.8196k instructions=0 instructions_per_iteration=0 instructions_per_iteration_correct=0 iterations_incorrect=515.127k
StatHistory: {
threads: 1, d(instr): 0, d(cyc): 155296, d(iter): 1, cyc_per_iter: 155296.000, naive_estimate:{tot_cyc: 155296, tot_iter: 1, cyc_per_iter: 155296.000}
threads: 1, d(instr): 0, d(cyc): 119417, d(iter): 10, cyc_per_iter: 11941.700, naive_estimate:{tot_cyc: 274713, tot_iter: 11, cyc_per_iter: 24973.909}
threads: 1, d(instr): 0, d(cyc): 844542, d(iter): 100, cyc_per_iter: 8445.420, naive_estimate:{tot_cyc: 1119255, tot_iter: 111, cyc_per_iter: 10083.378}
threads: 1, d(instr): 0, d(cyc): 8508919, d(iter): 1000, cyc_per_iter: 8508.919, naive_estimate:{tot_cyc: 9628174, tot_iter: 1111, cyc_per_iter: 8666.223}
threads: 1, d(instr): 0, d(cyc): 84409364, d(iter): 10000, cyc_per_iter: 8440.936, naive_estimate:{tot_cyc: 94037538, tot_iter: 11111, cyc_per_iter: 8463.463}
threads: 1, d(instr): 0, d(cyc): 722700531, d(iter): 87078, cyc_per_iter: 8299.462, naive_estimate:{tot_cyc: 816738069, tot_iter: 98189, cyc_per_iter: 8318.020}
threads: 2, d(instr): 0, d(cyc): 140505, d(iter): 2, cyc_per_iter: 70252.500, naive_estimate:{tot_cyc: 816878574, tot_iter: 98191, cyc_per_iter: 8319.282}
threads: 2, d(instr): 0, d(cyc): 321117, d(iter): 20, cyc_per_iter: 16055.850, naive_estimate:{tot_cyc: 817199691, tot_iter: 98211, cyc_per_iter: 8320.857}
threads: 2, d(instr): 0, d(cyc): 2254988, d(iter): 200, cyc_per_iter: 11274.940, naive_estimate:{tot_cyc: 819454679, tot_iter: 98411, cyc_per_iter: 8326.861}
threads: 2, d(instr): 0, d(cyc): 22499033, d(iter): 2000, cyc_per_iter: 11249.516, naive_estimate:{tot_cyc: 841953712, tot_iter:100411, cyc_per_iter: 8385.074}
threads: 2, d(instr): 0, d(cyc): 183841487, d(iter): 20000, cyc_per_iter: 9192.074, naive_estimate:{tot_cyc: 1025795199, tot_iter:120411, cyc_per_iter: 8519.115}
threads: 2, d(instr): 0, d(cyc): 706769995, d(iter): 81616, cyc_per_iter: 8659.699, naive_estimate:{tot_cyc: 1732565194, tot_iter:202027, cyc_per_iter: 8575.909}
threads: 4, d(instr): 0, d(cyc): 380444, d(iter): 4, cyc_per_iter: 95111.000, naive_estimate:{tot_cyc: 1732945638, tot_iter:202031, cyc_per_iter: 8577.622}
threads: 4, d(instr): 0, d(cyc): 734612, d(iter): 40, cyc_per_iter: 18365.300, naive_estimate:{tot_cyc: 1733680250, tot_iter:202071, cyc_per_iter: 8579.560}
threads: 4, d(instr): 0, d(cyc): 4671117, d(iter): 400, cyc_per_iter: 11677.792, naive_estimate:{tot_cyc: 1738351367, tot_iter:202471, cyc_per_iter: 8585.681}
threads: 4, d(instr): 0, d(cyc): 39769085, d(iter): 4000, cyc_per_iter: 9942.271, naive_estimate:{tot_cyc: 1778120452, tot_iter:206471, cyc_per_iter: 8611.962}
threads: 4, d(instr): 0, d(cyc): 376701509, d(iter): 40000, cyc_per_iter: 9417.538, naive_estimate:{tot_cyc: 2154821961, tot_iter:246471, cyc_per_iter: 8742.700}
threads: 4, d(instr): 0, d(cyc): 735323203, d(iter): 79048, cyc_per_iter: 9302.237, naive_estimate:{tot_cyc: 2890145164, tot_iter:325519, cyc_per_iter: 8878.576}
threads: 8, d(instr): 0, d(cyc): 985894, d(iter): 8, cyc_per_iter: 123236.750, naive_estimate:{tot_cyc: 2891131058, tot_iter:325527, cyc_per_iter: 8881.386}
threads: 8, d(instr): 0, d(cyc): 1579173, d(iter): 80, cyc_per_iter: 19739.662, naive_estimate:{tot_cyc: 2892710231, tot_iter:325607, cyc_per_iter: 8884.054}
threads: 8, d(instr): 0, d(cyc): 9651483, d(iter): 800, cyc_per_iter: 12064.354, naive_estimate:{tot_cyc: 2902361714, tot_iter:326407, cyc_per_iter: 8891.849}
threads: 8, d(instr): 0, d(cyc): 88797277, d(iter): 8000, cyc_per_iter: 11099.660, naive_estimate:{tot_cyc: 2991158991, tot_iter:334407, cyc_per_iter: 8944.666}
threads: 8, d(instr): 0, d(cyc): 719125504, d(iter): 70288, cyc_per_iter: 10231.128, naive_estimate:{tot_cyc: 3710284495, tot_iter:404695, cyc_per_iter: 9168.101}
threads:16, d(instr): 0, d(cyc): 3007933, d(iter): 16, cyc_per_iter: 187995.812, naive_estimate:{tot_cyc: 3713292428, tot_iter:404711, cyc_per_iter: 9175.170}
threads:16, d(instr): 0, d(cyc): 5565055, d(iter): 160, cyc_per_iter: 34781.594, naive_estimate:{tot_cyc: 3718857483, tot_iter:404871, cyc_per_iter: 9185.290}
threads:16, d(instr): 0, d(cyc): 36101661, d(iter): 1600, cyc_per_iter: 22563.538, naive_estimate:{tot_cyc: 3754959144, tot_iter:406471, cyc_per_iter: 9237.951}
threads:16, d(instr): 0, d(cyc): 282106059, d(iter): 16000, cyc_per_iter: 17631.629, naive_estimate:{tot_cyc: 4037065203, tot_iter:422471, cyc_per_iter: 9555.840}
threads:16, d(instr): 0, d(cyc): 801921020, d(iter): 44688, cyc_per_iter: 17944.885, naive_estimate:{tot_cyc: 4838986223, tot_iter:467159, cyc_per_iter: 10358.328}
threads:32, d(instr): 0, d(cyc): 10685833, d(iter): 32, cyc_per_iter: 333932.281, naive_estimate:{tot_cyc: 4849672056, tot_iter:467191, cyc_per_iter: 10380.491}
threads:32, d(instr): 0, d(cyc): 18319491, d(iter): 320, cyc_per_iter: 57248.409, naive_estimate:{tot_cyc: 4867991547, tot_iter:467511, cyc_per_iter: 10412.571}
threads:32, d(instr): 0, d(cyc): 126611298, d(iter): 3200, cyc_per_iter: 39566.031, naive_estimate:{tot_cyc: 4994602845, tot_iter:470711, cyc_per_iter: 10610.763}
threads:32, d(instr): 0, d(cyc): 1502129367, d(iter): 44416, cyc_per_iter: 33819.555, naive_estimate:{tot_cyc: 6496732212, tot_iter:515127, cyc_per_iter: 12611.904}
}
The "naive" column shows the answer that's reported by the benchmark: for the threads:32 case, it's 12.6k cycles per iteration.
The real number for that thread count is 33.8k cycles per iteration. Underestimating the cost by 63%!