Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.1.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Server Programmability
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Steps To Reproduce:
Hide

Created by code at this development commit.
https://github.com/10gen/mongo/pull/31507/commits/3dba9b7729f8241c210ae4422fe584fa3e62ad87
via:

$ git checkout 3dba9b7729f8241c210ae4422fe584fa3e62ad87 ( bazel build --config=opt //src/mongo/db:service_entry_point_shard_role_bm && bazel-bin/src/mongo/db/service_entry_point_shard_role_bm --benchmark_filter='ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads')
Show
Created by code at this development commit. https://github.com/10gen/mongo/pull/31507/commits/3dba9b7729f8241c210ae4422fe584fa3e62ad87 via: $ git checkout 3dba9b7729f8241c210ae4422fe584fa3e62ad87 ( bazel build --config=opt //src/mongo/db:service_entry_point_shard_role_bm && bazel-bin/src/mongo/db/service_entry_point_shard_role_bm --benchmark_filter= 'ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads' )
Sprint:
Programmability 2025-02-03
Linked BF Score:
200
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Unfortunately, that the way we're measuring performance in BenchmarkWithProfiler is flawed, statistically.
Stems from a misunderstanding of how the benchmark counters work.
Affects the critical benchmark service_entry_point_shard_role_bm etc that we use for gating commits.
Effects are severe for the counters per iteration metric but minimal for the instructions per iteration metric.
The benchmark library heuristically estimates the number of iterations required for stability of the results.
To do this, it runs several iterations as warmups. First with just one iteration, then perhaps with 10, 100, etc. All of these runs are performed using the same instance of the fixture object, but each thread of each run will get a separate State object to own its counters.
From these initial probing runs, the benchmark framework calculates a very particular number (like 13523) for the number of iterations it will use, and then it runs the real benchmark, throwing away the counters from all of its previous calibration runs. This is important! Those runs were not statistically valuable, as the benchmark library is trying to get a measurement of the steady cost of a large number of iterations.
For reasons I am struggling to understand, the BenchmarkWithProfiler fixture ignores the benchmark library's Counter API (docs) and instead captures every thread's run's counters into integer accumulators owned by the fixture. That means that the counters accumulated from the early and invalid calibration runs are NOT thrown away.
What's worse: these fixture owned counters are not cleared between runs for different thread counts. Benchmarks like SEP_PING, will run in a thread range, which is a power-of-two progression, up to physical cores * 2. So it's runs with threads=1, 2, 4, 8, 16, 32, in order. By the time we get to threads=32, the counters for cycles, instructions, and iterations are already populated with all of the results from the previous thread count trials. So we don't really understand the performance at any thread count except the first: threads=1.
Here's a sample run with some diagnostic information printed out at the destructor of a BenchmarkWithProfiler instance.
Note that there is only one such object, and it is reused for all of the thread counts and calibration runs for each thread count.

Run on (16 X 2100 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x16)
  L1 Instruction 64 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 32768 KiB (x1)
Load Average: 6.67, 11.02, 8.63
----------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------
ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads:1        7904 ns         7904 ns        87078 cycles=816.738M cycles_per_iteration=8.31802k cycles_per_iteration_correct=8.29946k instructions=0 instructions_per_iteration=0 instructions_per_iteration_correct=0 iterations_incorrect=98.189k
ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads:2        4109 ns         8218 ns        81616 cycles=1.73257G cycles_per_iteration=8.57591k cycles_per_iteration_correct=8.6597k instructions=0 instructions_per_iteration=0 instructions_per_iteration_correct=0 iterations_incorrect=202.027k
ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads:4        2176 ns         8703 ns        79048 cycles=2.89015G cycles_per_iteration=8.87858k cycles_per_iteration_correct=9.30224k instructions=0 instructions_per_iteration=0 instructions_per_iteration_correct=0 iterations_incorrect=325.519k
ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads:8        1195 ns         9526 ns        70288 cycles=3.71028G cycles_per_iteration=9.1681k cycles_per_iteration_correct=10.2311k instructions=0 instructions_per_iteration=0 instructions_per_iteration_correct=0 iterations_incorrect=404.695k
ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads:16       1047 ns        16127 ns        44688 cycles=4.83899G cycles_per_iteration=10.3583k cycles_per_iteration_correct=17.9449k instructions=0 instructions_per_iteration=0 instructions_per_iteration_correct=0 iterations_incorrect=467.159k
ServiceEntryPointShardRoleBenchmarkFixture/BM_SEP_PING/threads:32        892 ns        15128 ns        44416 cycles=6.49673G cycles_per_iteration=12.6119k cycles_per_iteration_correct=33.8196k instructions=0 instructions_per_iteration=0 instructions_per_iteration_correct=0 iterations_incorrect=515.127k
StatHistory: {
  threads: 1, d(instr):  0, d(cyc):      155296, d(iter):     1, cyc_per_iter:  155296.000, naive_estimate:{tot_cyc:      155296, tot_iter:     1, cyc_per_iter:  155296.000}
  threads: 1, d(instr):  0, d(cyc):      119417, d(iter):    10, cyc_per_iter:   11941.700, naive_estimate:{tot_cyc:      274713, tot_iter:    11, cyc_per_iter:   24973.909}
  threads: 1, d(instr):  0, d(cyc):      844542, d(iter):   100, cyc_per_iter:    8445.420, naive_estimate:{tot_cyc:     1119255, tot_iter:   111, cyc_per_iter:   10083.378}
  threads: 1, d(instr):  0, d(cyc):     8508919, d(iter):  1000, cyc_per_iter:    8508.919, naive_estimate:{tot_cyc:     9628174, tot_iter:  1111, cyc_per_iter:    8666.223}
  threads: 1, d(instr):  0, d(cyc):    84409364, d(iter): 10000, cyc_per_iter:    8440.936, naive_estimate:{tot_cyc:    94037538, tot_iter: 11111, cyc_per_iter:    8463.463}
  threads: 1, d(instr):  0, d(cyc):   722700531, d(iter): 87078, cyc_per_iter:    8299.462, naive_estimate:{tot_cyc:   816738069, tot_iter: 98189, cyc_per_iter:    8318.020}
  threads: 2, d(instr):  0, d(cyc):      140505, d(iter):     2, cyc_per_iter:   70252.500, naive_estimate:{tot_cyc:   816878574, tot_iter: 98191, cyc_per_iter:    8319.282}
  threads: 2, d(instr):  0, d(cyc):      321117, d(iter):    20, cyc_per_iter:   16055.850, naive_estimate:{tot_cyc:   817199691, tot_iter: 98211, cyc_per_iter:    8320.857}
  threads: 2, d(instr):  0, d(cyc):     2254988, d(iter):   200, cyc_per_iter:   11274.940, naive_estimate:{tot_cyc:   819454679, tot_iter: 98411, cyc_per_iter:    8326.861}
  threads: 2, d(instr):  0, d(cyc):    22499033, d(iter):  2000, cyc_per_iter:   11249.516, naive_estimate:{tot_cyc:   841953712, tot_iter:100411, cyc_per_iter:    8385.074}
  threads: 2, d(instr):  0, d(cyc):   183841487, d(iter): 20000, cyc_per_iter:    9192.074, naive_estimate:{tot_cyc:  1025795199, tot_iter:120411, cyc_per_iter:    8519.115}
  threads: 2, d(instr):  0, d(cyc):   706769995, d(iter): 81616, cyc_per_iter:    8659.699, naive_estimate:{tot_cyc:  1732565194, tot_iter:202027, cyc_per_iter:    8575.909}
  threads: 4, d(instr):  0, d(cyc):      380444, d(iter):     4, cyc_per_iter:   95111.000, naive_estimate:{tot_cyc:  1732945638, tot_iter:202031, cyc_per_iter:    8577.622}
  threads: 4, d(instr):  0, d(cyc):      734612, d(iter):    40, cyc_per_iter:   18365.300, naive_estimate:{tot_cyc:  1733680250, tot_iter:202071, cyc_per_iter:    8579.560}
  threads: 4, d(instr):  0, d(cyc):     4671117, d(iter):   400, cyc_per_iter:   11677.792, naive_estimate:{tot_cyc:  1738351367, tot_iter:202471, cyc_per_iter:    8585.681}
  threads: 4, d(instr):  0, d(cyc):    39769085, d(iter):  4000, cyc_per_iter:    9942.271, naive_estimate:{tot_cyc:  1778120452, tot_iter:206471, cyc_per_iter:    8611.962}
  threads: 4, d(instr):  0, d(cyc):   376701509, d(iter): 40000, cyc_per_iter:    9417.538, naive_estimate:{tot_cyc:  2154821961, tot_iter:246471, cyc_per_iter:    8742.700}
  threads: 4, d(instr):  0, d(cyc):   735323203, d(iter): 79048, cyc_per_iter:    9302.237, naive_estimate:{tot_cyc:  2890145164, tot_iter:325519, cyc_per_iter:    8878.576}
  threads: 8, d(instr):  0, d(cyc):      985894, d(iter):     8, cyc_per_iter:  123236.750, naive_estimate:{tot_cyc:  2891131058, tot_iter:325527, cyc_per_iter:    8881.386}
  threads: 8, d(instr):  0, d(cyc):     1579173, d(iter):    80, cyc_per_iter:   19739.662, naive_estimate:{tot_cyc:  2892710231, tot_iter:325607, cyc_per_iter:    8884.054}
  threads: 8, d(instr):  0, d(cyc):     9651483, d(iter):   800, cyc_per_iter:   12064.354, naive_estimate:{tot_cyc:  2902361714, tot_iter:326407, cyc_per_iter:    8891.849}
  threads: 8, d(instr):  0, d(cyc):    88797277, d(iter):  8000, cyc_per_iter:   11099.660, naive_estimate:{tot_cyc:  2991158991, tot_iter:334407, cyc_per_iter:    8944.666}
  threads: 8, d(instr):  0, d(cyc):   719125504, d(iter): 70288, cyc_per_iter:   10231.128, naive_estimate:{tot_cyc:  3710284495, tot_iter:404695, cyc_per_iter:    9168.101}
  threads:16, d(instr):  0, d(cyc):     3007933, d(iter):    16, cyc_per_iter:  187995.812, naive_estimate:{tot_cyc:  3713292428, tot_iter:404711, cyc_per_iter:    9175.170}
  threads:16, d(instr):  0, d(cyc):     5565055, d(iter):   160, cyc_per_iter:   34781.594, naive_estimate:{tot_cyc:  3718857483, tot_iter:404871, cyc_per_iter:    9185.290}
  threads:16, d(instr):  0, d(cyc):    36101661, d(iter):  1600, cyc_per_iter:   22563.538, naive_estimate:{tot_cyc:  3754959144, tot_iter:406471, cyc_per_iter:    9237.951}
  threads:16, d(instr):  0, d(cyc):   282106059, d(iter): 16000, cyc_per_iter:   17631.629, naive_estimate:{tot_cyc:  4037065203, tot_iter:422471, cyc_per_iter:    9555.840}
  threads:16, d(instr):  0, d(cyc):   801921020, d(iter): 44688, cyc_per_iter:   17944.885, naive_estimate:{tot_cyc:  4838986223, tot_iter:467159, cyc_per_iter:   10358.328}
  threads:32, d(instr):  0, d(cyc):    10685833, d(iter):    32, cyc_per_iter:  333932.281, naive_estimate:{tot_cyc:  4849672056, tot_iter:467191, cyc_per_iter:   10380.491}
  threads:32, d(instr):  0, d(cyc):    18319491, d(iter):   320, cyc_per_iter:   57248.409, naive_estimate:{tot_cyc:  4867991547, tot_iter:467511, cyc_per_iter:   10412.571}
  threads:32, d(instr):  0, d(cyc):   126611298, d(iter):  3200, cyc_per_iter:   39566.031, naive_estimate:{tot_cyc:  4994602845, tot_iter:470711, cyc_per_iter:   10610.763}
  threads:32, d(instr):  0, d(cyc):  1502129367, d(iter): 44416, cyc_per_iter:   33819.555, naive_estimate:{tot_cyc:  6496732212, tot_iter:515127, cyc_per_iter:   12611.904}
}

The "naive" column shows the answer that's reported by the benchmark: for the threads:32 case, it's 12.6k cycles per iteration.
The real number for that thread count is 33.8k cycles per iteration. Underestimating the cost by 63%!

Assignee:: Billy Donahue
Reporter:: Billy Donahue
Participants:: Billy Donahue, Githook User
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Jan 26 2025 01:07:33 AM UTC
Updated:: Jan 29 2025 08:17:30 AM UTC
Resolved:: Jan 28 2025 02:05:55 PM UTC

Details

Description

Attachments

Activity

People

Dates