Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Query Execution
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Summary
Run and analyze QS + RL overload experiments when incoming requests exceed QS kill capacity

Description
In dev we see overload cases where QS is actively killing operations, but the successful admission rate remains higher than the QS kill rate, so overload persists. In the current dev config, the RL CPU threshold is 99%, so RL rarely engages and most load shedding is left to QS, which kills operations synchronously and one by one.

This ticket tracks running a focused set of experiments (with both QS and RL enabled) to understand how the two mechanisms behave in this regime:

Scope

Update dev RL config (experimentally)

- Use the new RL config Andrew proposed (e.g., 80% CPU threshold + updated rate limiter settings) on a dev cluster where we repro the issue.
- Ensure RL actually engages in the “more incoming than QS can kill” workload.

Run combined QS + RL experiments using availability workloads

- Reuse the same availability workloads currently showing:
  - Overload + QS tuning.
  - Successful admissions > QS kill rate.
- Run with:
  - Baseline: QS on, RL off (to confirm the “QS saturated” behavior).
  - Treatment: QS on, RL on with updated config.

Instrumentation / data collection
For each run, record:

- Engagement timelines
  - When RL first enables rate limiting, and when it disengages.
  - When QS first enters tuning, and when it returns to monitoring.
- Per‑mechanism impact
  - RL: ingressRequestRateLimiter.rejectedAdmissions (or equivalent) over time.
  - QS: mongotune.policy.query-sentinel.killed_operations over time.
- System context:
  - CPU pressure, dwell time metrics, availability / canary metrics.

Analysis

- Identify cases where RL alone is sufficient (QS never tunes or tunes minimally).
- Identify cases where RL + QS together are needed to keep the node stable.
- For workloads where incoming > QS kill rate:
  - Show how earlier RL engagement changes overload behavior.
  - Summarize which mechanism (RL vs QS) is doing how much of the load shedding.

Deliverables

Short analysis doc (or section) with:
- Workload/config details.
- Plots or tables showing RL vs QS engagement times and rejected vs killed operations.
- A few representative “RL‑only” and “RL+QS” examples.
Recommendation on whether the 80% CPU + new RL config is appropriate for dev (and potentially for broader rollout) in this pattern.

Assignee:: Zixuan Zhuang
Reporter:: Zixuan Zhuang
Participants:: Zixuan Zhuang
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Mar 13 2026 09:04:29 PM UTC
Updated:: Mar 31 2026 08:20:55 PM UTC
Resolved:: Mar 31 2026 08:20:55 PM UTC

Details

Description

Attachments

Activity

People

Dates