QS Field Guide update + new qs_kill_to_admission_ratio metric/alert

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Execution
    • QE 2026-03-30, QE 2026-04-13
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary
      Document the “more incoming than QS can kill” pattern and add a qs_kill_to_admission_ratio metric + alert

      Description
      We’ve identified a recurring pattern where incoming successful admissions exceed QS’s kill rate during overload. In this situation, QS is saturated and RL should be the primary knob before we consider making QS more aggressive. We also want a simple metric/alert to flag this condition.

      This ticket covers documentation + observability, not experimental runs.

      Scope

      1. Field Guide update
        • Add a new short section to the Query Sentinel Field Guide that:
          • Describes the “more incoming than QS can kill” situation:
            • Overload + QS in tuning.
            • killed_operations increasing.
            • Successful admissions still high; overload persists.
          • Explains the intended roles:
            • RL: primary protection for request storms at admission.
            • QS: backstop that kills expensive in‑flight queries.
          • Recommends that engineers check RL first:
            • Verify RL is enabled and configured.
            • Consider tuning RL thresholds (e.g., CPU, dwell time, rate) before making QS more aggressive.
        • Include an example metric combo or Grafana panel (QS kills vs successful admissions) for recognizing this pattern.
      1. New metric + alert
        • Define a new derived metric:
          • qs_kill_to_admission_ratio = rate(QS killed operations) / rate(successful admissions)
          • Conditioned on overload / QS in tuning where possible.
        • Work with the relevant observability pipeline (T2 / Grafana / Atlas metrics) to:
          • Surface this ratio on a relevant dashboard (e.g., QS / IWM / RL).
          • Configure an alert when:
            • The node is overloaded (based on existing overload criteria), and
            • qs_kill_to_admission_ratio stays low (e.g., < 0.5 for ~1 minute).
        • Document how to interpret the alert:
          • Low ratio under overload = QS saturated; RL likely under‑tuned or inactive.
          • Suggest checking RL config and dev/production policy.

            Assignee:
            Zixuan Zhuang
            Reporter:
            Zixuan Zhuang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: