[SERVER-63015] Capture metrics about time spent multiplanning Created: 26/Jan/22  Updated: 28/Jul/23

Status: Backlog
Project: Core Server
Component/s: Diagnostics
Affects Version/s: 5.2.0
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Kevin Arhelger Assignee: Backlog - Query Integration
Resolution: Unresolved Votes: 2
Labels: query-offsite
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-68029 Add new metric to measure optimizatio... Open
related to SERVER-63642 Add serverStatus metrics to measure m... Closed
is related to SERVER-61250 Introduce bound inclusions for bounde... Closed
Assigned Teams:
Query Integration
Participants:

 Description   

This request is for an addition to the slow query message. SERVER-63642 implemented a serverStatus metric for this. Currently, the time spent planning a query is unaccounted in both the slow log line and in serverStats. Note that SERVER-68029 tracks very related work for the cost-based optimizer.

Having a metric in server status reflecting the time spent planning queries would give a strong signal if there is something wrong. Perhaps a metric like "timeAcquiringMicros" for the locking statistics would be helpful to determine how efficient planning is.

Reporting the time spent planning a query in the slow query message would greatly assist query diagnostics. Currently a query taking excessive time planning will not reflect in any of the current metrics, hiding any potential impact from planning delays.



 Comments   
Comment by Charlie Swanson [ 24/Mar/23 ]

Moving this out of the current telemetry epic which has been reduced in scope. This work still seems relevant and I think should be tackled in our planned follow up project, so I moved it there. Note that the scope hasn't been drafted/approved yet, so I'm not 100% sure it will include planning time as a telemetry metric.

Comment by Kevin Arhelger [ 31/Jan/22 ]

Here is the general idea.

  • If a query is planned for the first time, this time is reported in the slow query message.
  • If a query is re-planned, the time planning the query is reported in the slow query message.
  • Anytime a query in canceled or completes planning/re-planning, the time spent planning increments a serverStatus counter.

serverStatus example:

metrics: {
	query: {
		planningDurationMillis: 12456,
	}
}

Slow query example:

"attr": {
	"planningDurationMicros": 123,
	"fromMultiPlanner": true,
}

Nice to have:

  • Increment a serverStatus counter at the beginning of planning or re-planning.
  • Increment a separate serverStatus counter at cancelation, or when planning/re-planning completes.

serverStatus example:

metrics: {
	query: {
		planningStarted: 123,
		planningFinished: 123,
	}
}

A delta between these two counters would allow us to determine how many planning events are currently in-flight, and would allow easier diagnostics for abnormally long planning phases.

Nice to have but likely difficult:

  • Periodically increment a separate serverStatus counter with planning duration while planning is running.

This would help with a plan that is running but does not finish before the server terminates (SIGKILL, SIGSEGV, pulling the power plug).

There are a few things that aren’t included here that could be separate requests:

  • Time to parse a query
  • Time running the “old plan” before re-planning triggered included in the slow query long line.

Please let me know if there is any additional color I can add.

Comment by Ana Meza [ 31/Jan/22 ]

Hi kevin.arhelger could you please add some details about the solution are looking for?

Comment by Bruce Lucas (Inactive) [ 26/Jan/22 ]

This sounds useful. The serverStatus/FTDC metric should follow the usual design pattern and be a cumulative time.

Generated at Thu Feb 08 05:56:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.