Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Unresolved
Priority: Critical - P2
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- qi-observability
- quick-tech-debt

Assigned Teams:

Query Integration
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Summary

The slow in-progress query logging feature (~~SERVER-17942~~, ~~SERVER-112619~~) emits log lines when an in-flight operation exceeds slowOpInProgressThreshold, but does not expose any corresponding metrics through serverStatus or FTDC. This means there is no time-series signal in FTDC diagnostics data to identify when operations are stalling — only after-the-fact log grep. Adding lightweight counters/gauges would significantly improve outage triage and availability analysis.

Proposed Metrics

1. query.totalSlowInProgressQueryLogs (Counter64)

Cumulative count of times logLongRunningOperationIfNeeded() fires. Mirrors the existing query.totalSlowQueryLogs pattern. The irate() of this counter in FTDC shows onset and intensity of operation stalls.

2. query.currentSlowInProgressOps (Gauge)

Point-in-time count of operations currently past the slowOpInProgressThreshold. Incremented when logLongRunningOperationIfNeeded() fires, decremented in completeAndLogOperation() / ~CurOp(). A per-CurOp flag (_countedAsLongRunning) ensures balanced inc/dec. Value is 0 in steady state and spikes during stalls.

3. (Stretch) query.maxSlowInProgressDurationMs (Gauge)

Max elapsed time in milliseconds among currently-running operations that have crossed the threshold. Updated via CAS in logLongRunningOperationIfNeeded(), reset to 0 when currentSlowInProgressOps drops to 0. Distinguishes "many ops slightly over threshold" from "many ops stuck for minutes."

Implementation Notes

All three metrics are stable scalars — no schema variability, safe for FTDC compression.

The counter (#1) is trivial: a single MetricBuilder<Counter64> increment next to the existing LOGV2_OPTIONS(1794200, ...) call in CurOp::logLongRunningOperationIfNeeded().

The gauge (#2) requires a small lifecycle change in CurOp: a _countedAsLongRunning bool to pair the increment with a decrement on operation completion or destruction.

Motivation

During outage characterization, the only current signal for "operations are stuck" in FTDC is indirect: rising globalLock.activeClients, ticket exhaustion, checkpoint duration, etc. A direct "N ops are currently stalled" gauge provides an unambiguous correlation target.

Supports availability testing and data-science-driven outage clustering by adding a clean, dedicated metric to the FTDC time series.

is related to

SERVER-112619 Enable slow in progress query logs by default

Closed

SERVER-17942 Long running operations should be logged periodically

Closed

Assignee:: Unassigned
Reporter:: Blake Oler
Participants:: Blake Oler
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Apr 03 2026 05:55:20 PM UTC
Updated:: Apr 16 2026 06:58:09 PM UTC

Details

Description

Summary

Proposed Metrics

Implementation Notes

Motivation

Attachments

Issue Links

Activity

People

Dates