-
Type:
Task
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Query Execution
-
QE 2026-03-30, QE 2026-04-13
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Summary
Document the “more incoming than QS can kill” pattern and add a qs_kill_to_admission_ratio metric + alert
Description
We’ve identified a recurring pattern where incoming successful admissions exceed QS’s kill rate during overload. In this situation, QS is saturated and RL should be the primary knob before we consider making QS more aggressive. We also want a simple metric/alert to flag this condition.
This ticket covers documentation + observability, not experimental runs.
Scope
- Field Guide update
-
- Add a new short section to the Query Sentinel Field Guide that:
- Describes the “more incoming than QS can kill” situation:
- Overload + QS in tuning.
- killed_operations increasing.
- Successful admissions still high; overload persists.
- Explains the intended roles:
- RL: primary protection for request storms at admission.
- QS: backstop that kills expensive in‑flight queries.
- Recommends that engineers check RL first:
- Verify RL is enabled and configured.
- Consider tuning RL thresholds (e.g., CPU, dwell time, rate) before making QS more aggressive.
- Describes the “more incoming than QS can kill” situation:
- Include an example metric combo or Grafana panel (QS kills vs successful admissions) for recognizing this pattern.
- Add a new short section to the Query Sentinel Field Guide that:
- New metric + alert
-
- Define a new derived metric:
- qs_kill_to_admission_ratio = rate(QS killed operations) / rate(successful admissions)
- Conditioned on overload / QS in tuning where possible.
- Work with the relevant observability pipeline (T2 / Grafana / Atlas metrics) to:
- Surface this ratio on a relevant dashboard (e.g., QS / IWM / RL).
- Configure an alert when:
- The node is overloaded (based on existing overload criteria), and
- qs_kill_to_admission_ratio stays low (e.g., < 0.5 for ~1 minute).
- Document how to interpret the alert:
- Low ratio under overload = QS saturated; RL likely under‑tuned or inactive.
- Suggest checking RL config and dev/production policy.
- Define a new derived metric: