[SERVER-63642] Add serverStatus metrics to measure multi-planning performance Created: 14/Feb/22 Updated: 04/Jan/24 Resolved: 12/Apr/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Query Execution, Query Planning |
| Affects Version/s: | None |
| Fix Version/s: | 6.0.0-rc0, 5.0.9, 4.4.15 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | David Storch | Assignee: | Jess Balint |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Minor Change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | QE 2022-04-04, QE 2022-02-21, QE 2022-03-07, QE 2022-03-21 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
SERVER-62150 describes a scenario where SBE multi-planning can be slow relative to the classic engine's multi-planning implementation. We implemented We may wish to collect similar information for the classic multi-planner as well as the SBE multi-planner. There are known scenarios in which the classic multi-planner can take a long time to complete. In particular, see SERVER-31078. The intended audience of these metrics is query engineering and query product management. We want to be able to analyze the performance of multi-planning across the Atlas fleet in order to inform our decision making about future improvements to the server. It's probable that these metrics would also be useful in support scenarios (e.g. seeing if a customer is getting a lot of queries which take a long time to multi-plan), but this is not the primary use case. |
| Comments |
| Comment by Githook User [ 27/Apr/22 ] |
|
Author: {'name': 'Jess Balint', 'email': 'jbalint@gmail.com', 'username': 'jbalint'}Message: (cherry picked from commit ae996e0249f4f20b4def3a9f81dfc61c81eb4c83) |
| Comment by Githook User [ 25/Apr/22 ] |
|
Author: {'name': 'Jess Balint', 'email': 'jbalint@gmail.com', 'username': 'jbalint'}Message: (cherry picked from commit 43434627e89822b7e19e3a9d3aeb341be331aae6) |
| Comment by Githook User [ 09/Apr/22 ] |
|
Author: {'name': 'Jess Balint', 'email': 'jbalint@gmail.com', 'username': 'jbalint'}Message: |
| Comment by Bruce Lucas (Inactive) [ 15/Feb/22 ] |
|
We should also consider whether these should go in FTDC, which will be the case if they are included in serverStatus by default. Even though it's not the primary use case, for support it would be helpful if they did. But in many cases histograms have a lot of content, so maybe we could think about a subset that would be especially useful for inclusion in FTDC. Regarding histograms, I don't know if it's the case here, but we've often found histograms to have limited diagnostic value relative to the FTDC space required, and averages are just as useful without overloading FTDC - for example, we don't include query latency histograms in FTDC, but rather include cumulative total query time and cumulative query count, from which t2 can compute average latency over any time period. I wonder if such an approach could be useful here. |