Track delinquent interrupt checks

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Fixed
    • Priority: Major - P3
    • 8.3.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Execution
    • Fully Compatible
    • v8.2, v8.0
    • 200
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Operations are expected to check for interrupt periodically. This task was to track when operations have a long duration between two interrupt checks, and to provide metrics about such operations in server status.

       

      Summary of diagnostics added

      Under this ticket, a number of diagnostics were added:

      • Every operation now tracks how many interrupt checks it does.
        • At the operation level, numInterruptChecks is reported in slow query logs and $currentOp (SERVER-104009)
        • At the query shape level, we intend to report interrupt checks per second and possibly other information as well (SERVER-107647)
        • At the process level we track the total number of interrupt checks across all operations. This can be used for very coarse grain analysis.
      • A small fraction (< 1%) of operations will track the maximum time between interrupt checks and the accumulated "overdue" time.
        • We attempted to do this for all operations. It incurred a small but measurable performance penalty, so we only do it for a small sample.
        • The sampling ratio is controlled by the setParameter overdueInterruptCheckSamplingRate. Code here.
        • At the operation level, sampled operations will report:
          • The number of overdue interrupt checks. (In addition to the total number of interrupt checks) 
          • The maximum time between two interrupt checks
          • The accumulated overdue time between interrupt checks.
        • At the process level, we report:
          • The number of sampled operations
          • The number of sampled operations which had at least one overdue interrupt check
          • The total number of interrupt checks by sampled operations
          • The number of overdue interrupt checks by sampled operations
          • The accumulated "time overdue" by sampled operations
          • The maximum time between any two interrupt checks, across all sampled operations
          • From these values we can also derive things like:
            • Lower bound for average time between interrupt checks
              • Lower bound because we'd assume that non-overdue interrupt checks come exactly on time
            • Average time between overdue interrupt checks

      Other considerations

      • Interruptible waits
        • Operations are able to do a wait on a condition variable which completes if the condition becomes true OR if the operation is killed.
          • We do not count time spent in an interruptible wait as time between interrupt checks. We should not get false positives due to interruptible waits.
      • Suboperations
        • Operations can "spawn" other operations that run under the same OperationContext (DBDirectClient, bulk inserts, $out?)
        • For simplicity, only the top-level operation reports statistics about overdue interrupt checks
        • If an operation spawns a sub-operation which is delinquent, the parent operation will be considered delinquent and include all of the metrics about the overdue interrupt checks.
      • Clock accuracy
        • Sampled operations track time between interrupt checks using TickSource.

            Assignee:
            Ian Boros
            Reporter:
            Ian Boros
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: