-
Type:
Task
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Query Execution
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Summary
Run and analyze QS + RL overload experiments when incoming requests exceed QS kill capacity
Description
In dev we see overload cases where QS is actively killing operations, but the successful admission rate remains higher than the QS kill rate, so overload persists. In the current dev config, the RL CPU threshold is 99%, so RL rarely engages and most load shedding is left to QS, which kills operations synchronously and one by one.
This ticket tracks running a focused set of experiments (with both QS and RL enabled) to understand how the two mechanisms behave in this regime:
Scope
- Update dev RL config (experimentally)
-
- Use the new RL config Andrew proposed (e.g., 80% CPU threshold + updated rate limiter settings) on a dev cluster where we repro the issue.
- Ensure RL actually engages in the “more incoming than QS can kill” workload.
- Run combined QS + RL experiments using availability workloads
-
- Reuse the same availability workloads currently showing:
- Overload + QS tuning.
- Successful admissions > QS kill rate.
- Run with:
- Baseline: QS on, RL off (to confirm the “QS saturated” behavior).
- Treatment: QS on, RL on with updated config.
- Reuse the same availability workloads currently showing:
- Instrumentation / data collection
For each run, record:
-
- Engagement timelines
- When RL first enables rate limiting, and when it disengages.
- When QS first enters tuning, and when it returns to monitoring.
- Per‑mechanism impact
- RL: ingressRequestRateLimiter.rejectedAdmissions (or equivalent) over time.
- QS: mongotune.policy.query-sentinel.killed_operations over time.
- System context:
- CPU pressure, dwell time metrics, availability / canary metrics.
- Engagement timelines
- Analysis
-
- Identify cases where RL alone is sufficient (QS never tunes or tunes minimally).
- Identify cases where RL + QS together are needed to keep the node stable.
- For workloads where incoming > QS kill rate:
- Show how earlier RL engagement changes overload behavior.
- Summarize which mechanism (RL vs QS) is doing how much of the load shedding.
Deliverables
- Short analysis doc (or section) with:
- Workload/config details.
- Plots or tables showing RL vs QS engagement times and rejected vs killed operations.
- A few representative “RL‑only” and “RL+QS” examples.
- Recommendation on whether the 80% CPU + new RL config is appropriate for dev (and potentially for broader rollout) in this pattern.