Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Query Execution
Sprint:
QE 2026-03-30, QE 2026-04-13
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Summary
Document the “more incoming than QS can kill” pattern and add a qs_kill_to_admission_ratio metric + alert

Description
We’ve identified a recurring pattern where incoming successful admissions exceed QS’s kill rate during overload. In this situation, QS is saturated and RL should be the primary knob before we consider making QS more aggressive. We also want a simple metric/alert to flag this condition.

This ticket covers documentation + observability, not experimental runs.

Scope

Field Guide update

- Add a new short section to the Query Sentinel Field Guide that:
  - Describes the “more incoming than QS can kill” situation:
    - Overload + QS in tuning.
    - killed_operations increasing.
    - Successful admissions still high; overload persists.
  - Explains the intended roles:
    - RL: primary protection for request storms at admission.
    - QS: backstop that kills expensive in‑flight queries.
  - Recommends that engineers check RL first:
    - Verify RL is enabled and configured.
    - Consider tuning RL thresholds (e.g., CPU, dwell time, rate) before making QS more aggressive.
- Include an example metric combo or Grafana panel (QS kills vs successful admissions) for recognizing this pattern.

New metric + alert

- Define a new derived metric:
  - qs_kill_to_admission_ratio = rate(QS killed operations) / rate(successful admissions)
  - Conditioned on overload / QS in tuning where possible.
- Work with the relevant observability pipeline (T2 / Grafana / Atlas metrics) to:
  - Surface this ratio on a relevant dashboard (e.g., QS / IWM / RL).
  - Configure an alert when:
    - The node is overloaded (based on existing overload criteria), and
    - qs_kill_to_admission_ratio stays low (e.g., < 0.5 for ~1 minute).
- Document how to interpret the alert:
  - Low ratio under overload = QS saturated; RL likely under‑tuned or inactive.
  - Suggest checking RL config and dev/production policy.

Assignee:: Zixuan Zhuang
Reporter:: Zixuan Zhuang
Participants:: Zixuan Zhuang
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Mar 13 2026 09:15:41 PM UTC
Updated:: Mar 31 2026 08:24:44 PM UTC
Resolved:: Mar 31 2026 08:24:44 PM UTC

Details

Description

Attachments

Activity

People

Dates