-
Type:
Investigation
-
Resolution: Won't Do
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Not Needed
-
Developer Tools
Original Downstream Change Summary
This added a number of new serverStatus metrics and a new section under "operation.interrupt." This did not change or remove any existing serverStatus metrics.
Description of Linked Ticket
Operations are expected to check for interrupt periodically. This task was to track when operations have a long duration between two interrupt checks, and to provide metrics about such operations in server status.
Summary of diagnostics added
Under this ticket, a number of diagnostics were added:
- Every operation now tracks how many interrupt checks it does.
- At the operation level, numInterruptChecks is reported in slow query logs and $currentOp (
SERVER-104009) - At the query shape level, we intend to report interrupt checks per second and possibly other information as well (
SERVER-107647) - At the process level we track the total number of interrupt checks across all operations. This can be used for very coarse grain analysis.
- At the operation level, numInterruptChecks is reported in slow query logs and $currentOp (
- A small fraction (< 1%) of operations will track the maximum time between interrupt checks and the accumulated "overdue" time.
- We attempted to do this for all operations. It incurred a small but measurable performance penalty, so we only do it for a small sample.
- The sampling ratio is controlled by the setParameter overdueInterruptCheckSamplingRate. Code here.
- At the operation level, sampled operations will report:
- The number of overdue interrupt checks. (In addition to the total number of interrupt checks)
- The maximum time between two interrupt checks
- The accumulated overdue time between interrupt checks.
- At the process level, we report:
- The number of sampled operations
- The number of sampled operations which had at least one overdue interrupt check
- The total number of interrupt checks by sampled operations
- The number of overdue interrupt checks by sampled operations
- The accumulated "time overdue" by sampled operations
- The maximum time between any two interrupt checks, across all sampled operations
- From these values we can also derive things like:
- Lower bound for average time between interrupt checks
- Lower bound because we'd assume that non-overdue interrupt checks come exactly on time
- Average time between overdue interrupt checks
- Lower bound for average time between interrupt checks
Other considerations
- Interruptible waits
- Operations are able to do a wait on a condition variable which completes if the condition becomes true OR if the operation is killed.
- We do not count time spent in an interruptible wait as time between interrupt checks. We should not get false positives due to interruptible waits.
- Operations are able to do a wait on a condition variable which completes if the condition becomes true OR if the operation is killed.
- Suboperations
- Operations can "spawn" other operations that run under the same OperationContext (DBDirectClient, bulk inserts, $out?)
- For simplicity, only the top-level operation reports statistics about overdue interrupt checks
- If an operation spawns a sub-operation which is delinquent, the parent operation will be considered delinquent and include all of the metrics about the overdue interrupt checks.
- Clock accuracy
- Sampled operations track time between interrupt checks using TickSource.
- depends on
-
SERVER-104007 Track delinquent interrupt checks
-
- Closed
-