Support performance investigation for kernel 6.12 upgrade

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Concurrency
    • None
    • Storage Engines - Persistence
    • 128.953
    • SE Persistence backlog
    • None

      Summary

      The product performance team is evaluating regressions when upgrading from Linux kernel 6.1 (CFS scheduler) to kernel 6.12 (EEVDF scheduler) on MongoDB Atlas. Two distinct regressions have been identified:

      (a) Write throughput: ycsb_100update −10.7%

      (b) Read latency: mixed-workload read p50 +228–300% (ecommerce FindOneProduct p50 +300%, ycsb 50/50 read p50 +228%). Slice-independent. Requires concurrent writes — read-only ycsb_100read on 6.12 is clean/faster.

      Root cause (corrected from original description)

      The regression-tracking yield is in MongoDB's transport layer: ServiceExecutor::yieldIfAppropriate → std::this_thread::yield() (service_executor.cpp), fired 2×/request via SessionWorkflow::Impl (after-send + before-receive). Under thread-per-connection at 128 conns / 8 vCPUs the guard runningThreads > cores is always true.

      On EEVDF, each sched_yield() costs ~one full base slice (~2.8ms) instead of CFS's cheap requeue. At 2 yields/request the per-op off-CPU cost rises from 7% to 19.6% of thread time. The throughput-probing admission controller converges to a smaller write-ticket pool (28→16), collapsing admitted write concurrency (3.35→2.45 active writers) and producing the −11% throughput drop.

      Mitigations evaluated so far

      • base_slice_ns=750µs host knob: partially recovers the write path (−10.7%→−6.5% vs CFS), net-positive over plain 6.12 suite-wide, but does not address (b) and adds small regressions of its own.
      • Patch F — remove the after-send _yieldPointReached() call in session_workflow.cpp (yields 2×/req → 1×/req): closes the ycsb_100update gap in CPU-bound workloads (−10.7%→−0.3%, ns). Does not fix (b). The before-receive yield (kept by Patch F) is the (b) lever — removing both yields halves read p50 on 50read50 (1259→626µs) at a −9% throughput / +17–22% write-latency cost. Yields were originanlly added to help tail-latency perforamnce (see BF-27452 / SERVER-125097)

      Ask

      Looking for WT input on whether there is a cheaper cooperative-yield primitive for the transport layer's oversubscription case. Something that relinquishes the CPU cooperatively without incurring a full EEVDF base-slice descheduling.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Jawwad Asghar
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: