Add serverStatus/FTDC metrics for slow in-progress operations

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Critical - P2
    • None
    • Affects Version/s: None
    • Component/s: None
    • Query Integration
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      The slow in-progress query logging feature (SERVER-17942, SERVER-112619) emits log lines when an in-flight operation exceeds slowOpInProgressThreshold, but does not expose any corresponding metrics through serverStatus or FTDC. This means there is no time-series signal in FTDC diagnostics data to identify when operations are stalling — only after-the-fact log grep. Adding lightweight counters/gauges would significantly improve outage triage and availability analysis.

      Proposed Metrics

      1. query.totalSlowInProgressQueryLogs (Counter64)

      Cumulative count of times logLongRunningOperationIfNeeded() fires. Mirrors the existing query.totalSlowQueryLogs pattern. The irate() of this counter in FTDC shows onset and intensity of operation stalls.

      2. query.currentSlowInProgressOps (Gauge)

      Point-in-time count of operations currently past the slowOpInProgressThreshold. Incremented when logLongRunningOperationIfNeeded() fires, decremented in completeAndLogOperation() / ~CurOp(). A per-CurOp flag (_countedAsLongRunning) ensures balanced inc/dec. Value is 0 in steady state and spikes during stalls.

      3. (Stretch) query.maxSlowInProgressDurationMs (Gauge)

      Max elapsed time in milliseconds among currently-running operations that have crossed the threshold. Updated via CAS in logLongRunningOperationIfNeeded(), reset to 0 when currentSlowInProgressOps drops to 0. Distinguishes "many ops slightly over threshold" from "many ops stuck for minutes."

      Implementation Notes

      All three metrics are stable scalars — no schema variability, safe for FTDC compression.

      The counter (#1) is trivial: a single MetricBuilder<Counter64> increment next to the existing LOGV2_OPTIONS(1794200, ...) call in CurOp::logLongRunningOperationIfNeeded().

      The gauge (#2) requires a small lifecycle change in CurOp: a _countedAsLongRunning bool to pair the increment with a decrement on operation completion or destruction.

      Motivation

      During outage characterization, the only current signal for "operations are stuck" in FTDC is indirect: rising globalLock.activeClients, ticket exhaustion, checkpoint duration, etc. A direct "N ops are currently stalled" gauge provides an unambiguous correlation target.

      Supports availability testing and data-science-driven outage clustering by adding a clean, dedicated metric to the FTDC time series.

            Assignee:
            Unassigned
            Reporter:
            Blake Oler
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: