Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-86504

Better observability for operations which exceed ticket deadlines

    • Query Execution
    • QO 2024-03-04, QO 2024-03-18, QO 2024-04-01, QO 2024-04-15, QO 2024-04-29, QO 2024-05-13, QO 2024-05-27, QO 2024-06-10, QO 2024-06-24

      Summary of redefined requirements from kevin.cherkauer@mongodb.com's notes of 2024-12-11 meeting with matt.broadstone@mongodb.com ** martin.neupauer@mongodb.com ** amr.elhelw@mongodb.com ** ian.boros@mongodb.com:

      1.  Implement this whole project at the Ticket, AdmissionContext, OperationContext, and CurOp layers, not in Query code, using the existing slow query logging mechanism.
      2.  Use something like internalQueryExecYieldPeriodMS + 20ms as the delinquent ticket hold threshold. TBD: Can/should this Query-specific knob be used in non-Query layers?
      3. Query shape will be whatever slow logging already uses for it.
      4. Include histograms per query shape of how long the ticket was held for delinquent tickets.
      5. Make sure there is a mechanism to prevent log flooding per query shape (probably already exists).

      This Jira ticket is NOT a Query-layer one and will NOT track the intervals between yields or interrupt checks. It is solely about Ticket hold durations and is a Workload Scheduling project, not a Query Execution project.

       

      Below is the original (now obsolete) description from matt.broadstone@mongodb.com 2024-02-12:

      We have a loose requirement in the server that long-running operations should yield every 10ms (this is configurable for the query subsystem), however nothing about our cooperative scheduling implementation enforces this contract. As we begin to ideate on improvements in this space, it would be useful to see which types of operations currently acquire tickets without yielding in a reasonable time (or at all). I'm imagining a few improvements:

      • Collect aggregated metrics for number of queries which hold tickets longer than the yielding threshold (default: 10ms).  This is not only useful for triage, but could be integrated into a node health statistic for admission policies.
      • As a generalization of the above, it would be valuable to keep a simple histogram of the duration of each ticket held.
      • In addition to the above, it would be valuable to also keep a simple histogram of the number of ticket acquisitions per query
      • Mark queries (perhaps using the slowms machinery? using query shape?) as being delinquent in ticket retention. This would help server engineers identify pathological cases where tickets are not being released by observing properties of the source query (or the query itself)

            Assignee:
            kevin.cherkauer@mongodb.com Kevin Cherkauer
            Reporter:
            matt.broadstone@mongodb.com Matt Broadstone
            Votes:
            0 Vote for this issue
            Watchers:
            30 Start watching this issue

              Created:
              Updated: