Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:

Assigned Teams:

Query Execution
Sprint:
QO 2024-03-04, QO 2024-03-18, QO 2024-04-01, QO 2024-04-15, QO 2024-04-29, QO 2024-05-13, QO 2024-05-27, QO 2024-06-10, QO 2024-06-24
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Summary of redefined requirements from kevin.cherkauer@mongodb.com's notes of 2024-12-11 meeting with matt.broadstone@mongodb.com ** martin.neupauer@mongodb.com ** amr.elhelw@mongodb.com ** ian.boros@mongodb.com:

Implement this whole project at the Ticket, AdmissionContext, OperationContext, and CurOp layers, not in Query code, using the existing slow query logging mechanism.
Use something like internalQueryExecYieldPeriodMS + 20ms as the delinquent ticket hold threshold. TBD: Can/should this Query-specific knob be used in non-Query layers?
Query shape will be whatever slow logging already uses for it.
Include histograms per query shape of how long the ticket was held for delinquent tickets.
Make sure there is a mechanism to prevent log flooding per query shape (probably already exists).

This Jira ticket is NOT a Query-layer one and will NOT track the intervals between yields or interrupt checks. It is solely about Ticket hold durations and is a Workload Scheduling project, not a Query Execution project.

Below is the original (now obsolete) description from matt.broadstone@mongodb.com 2024-02-12:

We have a loose requirement in the server that long-running operations should yield every 10ms (this is configurable for the query subsystem), however nothing about our cooperative scheduling implementation enforces this contract. As we begin to ideate on improvements in this space, it would be useful to see which types of operations currently acquire tickets without yielding in a reasonable time (or at all). I'm imagining a few improvements:

Collect aggregated metrics for number of queries which hold tickets longer than the yielding threshold (default: 10ms). This is not only useful for triage, but could be integrated into a node health statistic for admission policies.
As a generalization of the above, it would be valuable to keep a simple histogram of the duration of each ticket held.
In addition to the above, it would be valuable to also keep a simple histogram of the number of ticket acquisitions per query
Mark queries (perhaps using the slowms machinery? using query shape?) as being delinquent in ticket retention. This would help server engineers identify pathological cases where tickets are not being released by observing properties of the source query (or the query itself)

related to

SERVER-87365 Execution control doesn't ramp up fast enough for slow yielding queries

Backlog

SERVER-72258 Audit and add missing checkForInterrupt to SBE stages

Closed

SERVER-86164 Create a test that catches operations that aren't interruptible for significant periods of time

Backlog

Assignee:: Unassigned
Reporter:: Matt Broadstone
Participants:: Andy Schwerin, Matt Broadstone
Votes:: 0 Vote for this issue
Watchers:: 30 Start watching this issue

Created:: Feb 12 2024 04:00:16 PM UTC
Updated:: Jan 29 2025 08:04:36 PM UTC
Confidence Status Last Update:: 10/Dec/24 8:52 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates