[SERVER-63642] Add serverStatus metrics to measure multi-planning performance Created: 14/Feb/22  Updated: 04/Jan/24  Resolved: 12/Apr/22

Status: Closed
Project: Core Server
Component/s: Query Execution, Query Planning
Affects Version/s: None
Fix Version/s: 6.0.0-rc0, 5.0.9, 4.4.15

Type: Improvement Priority: Major - P3
Reporter: David Storch Assignee: Jess Balint
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-65271 serverStatus should allow fine-graine... Closed
Documented
is documented by DOCS-15235 [SERVER] Add serverStatus metrics to ... Closed
Problem/Incident
causes SERVER-65984 Move ServerStatusMetricField speciali... Closed
Related
related to SERVER-63641 Improve SBE multi-planning by choosin... Closed
is related to SERVER-62150 SBE Multiplanning can be slow when su... Open
is related to SERVER-31078 Query planning is very slow during mu... Backlog
is related to SERVER-63015 Capture metrics about time spent mult... Backlog
is related to SERVER-62981 Make SBE multi-planner's trial period... Closed
is related to SERVER-63641 Improve SBE multi-planning by choosin... Closed
Backwards Compatibility: Minor Change
Sprint: QE 2022-04-04, QE 2022-02-21, QE 2022-03-07, QE 2022-03-21
Participants:

 Description   

SERVER-62150 describes a scenario where SBE multi-planning can be slow relative to the classic engine's multi-planning implementation. We implemented SERVER-62981 in order to mitigate this issue, and also have proposed SERVER-63641 as an additional improvement. In order to make sure that customers are experiencing good SBE multi-planner performance, we should add metrics to serverStatus. Before implementing this ticket, we need to agree on exactly what metrics to capture and how they will be exposed in serverStatus. The current proposal is to collect histograms of both the number of storage reads performed during SBE multi-planning and the overall wall clock time spent multi-planning.

We may wish to collect similar information for the classic multi-planner as well as the SBE multi-planner. There are known scenarios in which the classic multi-planner can take a long time to complete. In particular, see SERVER-31078.

The intended audience of these metrics is query engineering and query product management. We want to be able to analyze the performance of multi-planning across the Atlas fleet in order to inform our decision making about future improvements to the server. It's probable that these metrics would also be useful in support scenarios (e.g. seeing if a customer is getting a lot of queries which take a long time to multi-plan), but this is not the primary use case.



 Comments   
Comment by Githook User [ 27/Apr/22 ]

Author:

{'name': 'Jess Balint', 'email': 'jbalint@gmail.com', 'username': 'jbalint'}

Message: SERVER-63642 Add serverStatus histogram metrics to measure multi-planning performance

(cherry picked from commit ae996e0249f4f20b4def3a9f81dfc61c81eb4c83)
Branch: v4.4
https://github.com/mongodb/mongo/commit/7214e3aa614dbab8e4dcc94934879c55a50fde4c

Comment by Githook User [ 25/Apr/22 ]

Author:

{'name': 'Jess Balint', 'email': 'jbalint@gmail.com', 'username': 'jbalint'}

Message: SERVER-63642 Add serverStatus histogram metrics to measure multi-planning performance

(cherry picked from commit 43434627e89822b7e19e3a9d3aeb341be331aae6)
Branch: v5.0
https://github.com/mongodb/mongo/commit/ae996e0249f4f20b4def3a9f81dfc61c81eb4c83

Comment by Githook User [ 09/Apr/22 ]

Author:

{'name': 'Jess Balint', 'email': 'jbalint@gmail.com', 'username': 'jbalint'}

Message: SERVER-63642 Add serverStatus histogram metrics to measure multi-planning performance
Branch: master
https://github.com/mongodb/mongo/commit/43434627e89822b7e19e3a9d3aeb341be331aae6

Comment by Bruce Lucas (Inactive) [ 15/Feb/22 ]

We should also consider whether these should go in FTDC, which will be the case if they are included in serverStatus by default. Even though it's not the primary use case, for support it would be helpful if they did. But in many cases histograms have a lot of content, so maybe we could think about a subset that would be especially useful for inclusion in FTDC.

Regarding histograms, I don't know if it's the case here, but we've often found histograms to have limited diagnostic value relative to the FTDC space required, and averages are just as useful without overloading FTDC - for example, we don't include query latency histograms in FTDC, but rather include cumulative total query time and cumulative query count, from which t2 can compute average latency over any time period. I wonder if such an approach could be useful here.

Generated at Thu Feb 08 05:58:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.