-
Type:
Task
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Query Execution
-
Fully Compatible
-
v8.2, v8.0
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Operations are expected to check for interrupt periodically. This task was to track when operations have a long duration between two interrupt checks, and to provide metrics about such operations in server status.
Summary of diagnostics added
Under this ticket, a number of diagnostics were added:
- Every operation now tracks how many interrupt checks it does.
- At the operation level, numInterruptChecks is reported in slow query logs and $currentOp (
SERVER-104009) - At the query shape level, we intend to report interrupt checks per second and possibly other information as well (
SERVER-107647) - At the process level we track the total number of interrupt checks across all operations. This can be used for very coarse grain analysis.
- At the operation level, numInterruptChecks is reported in slow query logs and $currentOp (
- A small fraction (< 1%) of operations will track the maximum time between interrupt checks and the accumulated "overdue" time.
- We attempted to do this for all operations. It incurred a small but measurable performance penalty, so we only do it for a small sample.
- The sampling ratio is controlled by the setParameter overdueInterruptCheckSamplingRate. Code here.
- At the operation level, sampled operations will report:
- The number of overdue interrupt checks. (In addition to the total number of interrupt checks)
- The maximum time between two interrupt checks
- The accumulated overdue time between interrupt checks.
- At the process level, we report:
- The number of sampled operations
- The number of sampled operations which had at least one overdue interrupt check
- The total number of interrupt checks by sampled operations
- The number of overdue interrupt checks by sampled operations
- The accumulated "time overdue" by sampled operations
- The maximum time between any two interrupt checks, across all sampled operations
- From these values we can also derive things like:
- Lower bound for average time between interrupt checks
- Lower bound because we'd assume that non-overdue interrupt checks come exactly on time
- Average time between overdue interrupt checks
- Lower bound for average time between interrupt checks
Other considerations
- Interruptible waits
- Operations are able to do a wait on a condition variable which completes if the condition becomes true OR if the operation is killed.
- We do not count time spent in an interruptible wait as time between interrupt checks. We should not get false positives due to interruptible waits.
- Operations are able to do a wait on a condition variable which completes if the condition becomes true OR if the operation is killed.
- Suboperations
- Operations can "spawn" other operations that run under the same OperationContext (DBDirectClient, bulk inserts, $out?)
- For simplicity, only the top-level operation reports statistics about overdue interrupt checks
- If an operation spawns a sub-operation which is delinquent, the parent operation will be considered delinquent and include all of the metrics about the overdue interrupt checks.
- Clock accuracy
- Sampled operations track time between interrupt checks using TickSource.
- is depended on by
-
COMPASS-9770 Investigate changes in SERVER-104007: Track delinquent interrupt checks
-
- Closed
-
- is related to
-
SERVER-107293 Revert SERVER-104007
-
- Closed
-
-
SERVER-104008 Record operation delinquency ticket information in QueryStats
-
- Closed
-
-
SERVER-104009 Record delinquent checkForInterrupt() information in CurOp and slow logging
-
- Closed
-
-
SERVER-104010 Track delinquent ticket releases
-
- Closed
-
- related to
-
SERVER-106769 A low overhead timer for x86 and AArch64
-
- Investigating
-
-
SERVER-107407 Complete TODO listed in SERVER-104007
-
- Open
-
-
SERVER-105801 Consider moving OperationContext overdue interrupt information to Interruptible
-
- Backlog
-
-
SERVER-107647 Record operation delinquency checkForinterrupt information in QueryStats
-
- Closed
-