Increase sample size in join tests for better plans

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Optimization
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Overview

      The query_golden_join_optimization_plan_stability resmoke suite uses the
      default internalJoinPlanSamplingSize=1000 for the join optimizer's
      cardinality-estimation sampler. With the deterministic sequential sample of
      1000 docs, several queries in the plan_stability_tpch_fuzzed.js input set
      have outer-side filters with selectivity below 0.2% — the sampler misses all
      matching rows, returns CE=0, and the cost terms collapse to 0 across the
      join graph. The join enumerator then falls back to its deterministic
      ordering (first-tried-INLJ-when-an-index-exists) and produces a plan that
      isn't a cost-based choice.

      This causes brittle baselines: the recorded "good plan" is in fact an
      under-sampling artifact, and any cost-model change that nudges CE estimates
      even slightly (e.g. SERVER-123070's kMinCE clamp) produces visible
      regressions against the baselines, even when the new plans are actually
      closer to ground truth.

      Proposed Change

      Set internalJoinPlanSamplingSize: 10000 in
      buildscripts/resmokeconfig/suites/query_golden_join_optimization_plan_stability.yml
      and regenerate the golden baselines on master.

      This is a 10x increase. At sample 10000:
      * part (20 K rows): 50% sample — every realistic outer-side filter is reliably estimated.
      * orders (150 K rows): ~6.7% sample — even the 0.02% selectivity filters in idx 211 land 1-2 sample matches in expectation.
      * lineitem (~600 K rows): ~1.7% — adequate for any non-pathological filter.

      Rationale

      A sweep of internalJoinPlanSamplingSize on clean master against the
      "idx 1" query (5-way TPC-H join with a 0.145%-selective part filter)
      shows the optimizer's plan choice converges quickly and stably:

      Sample size Plan shape keys docs
      1000 (default) INLJ + FETCH+IDX_PROBE all the way 132 20132
      3000 HJ on outer 2, INLJ on lower 3 124 20154
      5000 HJ on outer 3, INLJ on innermost 116 21146
      10000 same as 5000 116 21146
      20000 same as 5000 116 21146
      exactCE (ground truth) same as 5000 116 21146

      The 10000-sample plan matches what the cost model produces with exact
      cardinalities. The default-sample-size baseline is the only one that
      doesn't.

      Verification

      End-to-end test on SERVER-123070 (the branch this finding came out of):* Bumped suite YAML to internalJoinPlanSamplingSize: 10000 on clean master.

      • Regenerated baselines for plan_stability_tpch_fuzzed and plan_stability_subjoin_cardinality.md. plan_stability_tpch_official was unchanged (its queries were already adequately sampled at 1000).
      • Copied YAML and baselines to the SERVER-123070 branch, re-ran all three tests: 0-line diff against new baselines, all pass.

      Cost

      Suite wall-clock at j=1:

      Sample total runtime
      1000 (default) ~90 s
      10000 ~145 s

      About +55 s overhead, dominated by plan_stability_tpch_fuzzed which
      runs 222 commands.

      Scope of Work

      • buildscripts/resmokeconfig/suites/query_golden_join_optimization_plan_stability.yml — add internalJoinPlanSamplingSize: 10000 (with comment explaining why).
      • jstests/query_golden/expected_output/plan_stability_tpch_fuzzed — regenerate.
      • jstests/query_golden/expected_output/plan_stability_subjoin_cardinality.md — regenerate.

      Acceptance Criteria

      • Suite YAML carries the explicit sample-size override with a comment.
      • Both regenerated baselines committed.
      • All three tests in the suite pass at the new baselines.
      • SERVER-123070's plan_stability join-opt diff disappears once that branch rebases onto this change.

      Background

      See full analysis attached as a comment: per-query sample-size sweep, why
      the default 1000-sample plan is an under-sampling artifact, and the
      relationship to SERVER-123070.

            Assignee:
            Philip Stoev
            Reporter:
            Timour Katchaounov
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: