-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Concurrency
-
None
-
Storage Engines - Persistence
-
128.953
-
SE Persistence backlog
-
None
Summary
The product performance team is evaluating regressions when upgrading from Linux kernel 6.1 (CFS scheduler) to kernel 6.12 (EEVDF scheduler) on MongoDB Atlas. Two distinct regressions have been identified:
(a) Write throughput: ycsb_100update −10.7%
(b) Read latency: mixed-workload read p50 +228–300% (ecommerce FindOneProduct p50 +300%, ycsb 50/50 read p50 +228%). Slice-independent. Requires concurrent writes — read-only ycsb_100read on 6.12 is clean/faster.
Root cause (corrected from original description)
The regression-tracking yield is in MongoDB's transport layer: ServiceExecutor::yieldIfAppropriate → std::this_thread::yield() (service_executor.cpp), fired 2×/request via SessionWorkflow::Impl (after-send + before-receive). Under thread-per-connection at 128 conns / 8 vCPUs the guard runningThreads > cores is always true.
On EEVDF, each sched_yield() costs ~one full base slice (~2.8ms) instead of CFS's cheap requeue. At 2 yields/request the per-op off-CPU cost rises from 7% to 19.6% of thread time. The throughput-probing admission controller converges to a smaller write-ticket pool (28→16), collapsing admitted write concurrency (3.35→2.45 active writers) and producing the −11% throughput drop.
Mitigations evaluated so far
- base_slice_ns=750µs host knob: partially recovers the write path (−10.7%→−6.5% vs CFS), net-positive over plain 6.12 suite-wide, but does not address (b) and adds small regressions of its own.
- Patch F — remove the after-send _yieldPointReached() call in session_workflow.cpp (yields 2×/req → 1×/req): closes the ycsb_100update gap in CPU-bound workloads (−10.7%→−0.3%, ns). Does not fix (b). The before-receive yield (kept by Patch F) is the (b) lever — removing both yields halves read p50 on 50read50 (1259→626µs) at a −9% throughput / +17–22% write-latency cost. Yields were originanlly added to help tail-latency perforamnce (see BF-27452 /
SERVER-125097)
Ask
Looking for WT input on whether there is a cheaper cooperative-yield primitive for the transport layer's oversubscription case. Something that relinquishes the CPU cooperatively without incurring a full EEVDF base-slice descheduling.
- is related to
-
SERVER-125097 Remove redundant per-request sched_yield in SessionWorkflow
-
- Closed
-