-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Critical - P2
-
None
-
Affects Version/s: None
-
Component/s: None
-
Query Integration
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Summary
The slow in-progress query logging feature (SERVER-17942, SERVER-112619) emits log lines when an in-flight operation exceeds slowOpInProgressThreshold, but does not expose any corresponding metrics through serverStatus or FTDC. This means there is no time-series signal in FTDC diagnostics data to identify when operations are stalling — only after-the-fact log grep. Adding lightweight counters/gauges would significantly improve outage triage and availability analysis.
Proposed Metrics
1. query.totalSlowInProgressQueryLogs (Counter64)
Cumulative count of times logLongRunningOperationIfNeeded() fires. Mirrors the existing query.totalSlowQueryLogs pattern. The irate() of this counter in FTDC shows onset and intensity of operation stalls.
2. query.currentSlowInProgressOps (Gauge)
Point-in-time count of operations currently past the slowOpInProgressThreshold. Incremented when logLongRunningOperationIfNeeded() fires, decremented in completeAndLogOperation() / ~CurOp(). A per-CurOp flag (_countedAsLongRunning) ensures balanced inc/dec. Value is 0 in steady state and spikes during stalls.
3. (Stretch) query.maxSlowInProgressDurationMs (Gauge)
Max elapsed time in milliseconds among currently-running operations that have crossed the threshold. Updated via CAS in logLongRunningOperationIfNeeded(), reset to 0 when currentSlowInProgressOps drops to 0. Distinguishes "many ops slightly over threshold" from "many ops stuck for minutes."
Implementation Notes
All three metrics are stable scalars — no schema variability, safe for FTDC compression.
The counter (#1) is trivial: a single MetricBuilder<Counter64> increment next to the existing LOGV2_OPTIONS(1794200, ...) call in CurOp::logLongRunningOperationIfNeeded().
The gauge (#2) requires a small lifecycle change in CurOp: a _countedAsLongRunning bool to pair the increment with a decrement on operation completion or destruction.
Motivation
During outage characterization, the only current signal for "operations are stuck" in FTDC is indirect: rising globalLock.activeClients, ticket exhaustion, checkpoint duration, etc. A direct "N ops are currently stalled" gauge provides an unambiguous correlation target.
Supports availability testing and data-science-driven outage clustering by adding a clean, dedicated metric to the FTDC time series.
- is related to
-
SERVER-112619 Enable slow in progress query logs by default
-
- Closed
-
-
SERVER-17942 Long running operations should be logged periodically
-
- Closed
-