Loading...

XML

Word

Printable

JSON

Type: Investigation
Resolution: Won't Do
Priority: Major - P3
Fix Version/s: No version
Affects Version/s: None
Component/s: None
Labels:
None

Documentation Changes:
Not Needed

Assigned Teams:

Developer Tools

Original Downstream Change Summary

This added a number of new serverStatus metrics and a new section under "operation.interrupt." This did not change or remove any existing serverStatus metrics.

Description of Linked Ticket

Operations are expected to check for interrupt periodically. This task was to track when operations have a long duration between two interrupt checks, and to provide metrics about such operations in server status.

Summary of diagnostics added

Under this ticket, a number of diagnostics were added:

Every operation now tracks how many interrupt checks it does.
- At the operation level, numInterruptChecks is reported in slow query logs and $currentOp (~~SERVER-104009~~)
- At the query shape level, we intend to report interrupt checks per second and possibly other information as well (~~SERVER-107647~~)
- At the process level we track the total number of interrupt checks across all operations. This can be used for very coarse grain analysis.
A small fraction (< 1%) of operations will track the maximum time between interrupt checks and the accumulated "overdue" time.
- We attempted to do this for all operations. It incurred a small but measurable performance penalty, so we only do it for a small sample.
- The sampling ratio is controlled by the setParameter overdueInterruptCheckSamplingRate. Code here.
- At the operation level, sampled operations will report:
  - The number of overdue interrupt checks. (In addition to the total number of interrupt checks)
  - The maximum time between two interrupt checks
  - The accumulated overdue time between interrupt checks.
- At the process level, we report:
  - The number of sampled operations
  - The number of sampled operations which had at least one overdue interrupt check
  - The total number of interrupt checks by sampled operations
  - The number of overdue interrupt checks by sampled operations
  - The accumulated "time overdue" by sampled operations
  - The maximum time between any two interrupt checks, across all sampled operations
  - From these values we can also derive things like:
    - Lower bound for average time between interrupt checks
      - Lower bound because we'd assume that non-overdue interrupt checks come exactly on time
    - Average time between overdue interrupt checks

Other considerations

Interruptible waits
- Operations are able to do a wait on a condition variable which completes if the condition becomes true OR if the operation is killed.
  - We do not count time spent in an interruptible wait as time between interrupt checks. We should not get false positives due to interruptible waits.
Suboperations
- Operations can "spawn" other operations that run under the same OperationContext (DBDirectClient, bulk inserts, $out?)
- For simplicity, only the top-level operation reports statistics about overdue interrupt checks
- If an operation spawns a sub-operation which is delinquent, the parent operation will be considered delinquent and include all of the metrics about the overdue interrupt checks.
Clock accuracy
- Sampled operations track time between interrupt checks using TickSource.

depends on

SERVER-104007 Track delinquent interrupt checks

Closed

Assignee:: Unassigned
Reporter:: Backlog - Core Eng Program Management Team
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Aug 28 2025 06:56:40 PM UTC
Updated:: Dec 19 2025 11:26:21 AM UTC
Resolved:: Sep 03 2025 01:59:05 PM UTC

Details

Description

Description of Linked Ticket

Attachments

Issue Links

Activity

People

Dates