[DOCS-15235] [SERVER] Add serverStatus metrics to measure multi-planning performance Created: 12/Apr/22  Updated: 13/Nov/23  Resolved: 02/Sep/22

Status: Closed
Project: Documentation
Component/s: manual, Server
Affects Version/s: None
Fix Version/s: 6.0.0-rc0, 5.0.9, 4.4.15, Server_Docs_20231030, Server_Docs_20231106, Server_Docs_20231105, Server_Docs_20231113

Type: Task Priority: Major - P3
Reporter: Backlog - Core Eng Program Management Team Assignee: Dave Cuthbert (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
backported by DOCS-15278 [BACKPORT] [v5.0] Add serverStatus me... Backlog
Documented
documents SERVER-63642 Add serverStatus metrics to measure m... Closed
Participants:
Days since reply: 1 year, 22 weeks, 1 day ago
Epic Link: DOCSP-22649

 Description   

From: SERVER-63642


Original Downstream Change Summary

Addition of new multiplanner histograms and agg metrics in serverStatus and agg-only in FTDC

Description of Linked Ticket

SERVER-62150 describes a scenario where SBE multi-planning can be slow relative to the classic engine's multi-planning implementation. We implemented SERVER-62981 in order to mitigate this issue, and also have proposed SERVER-63641 as an additional improvement. In order to make sure that customers are experiencing good SBE multi-planner performance, we should add metrics to serverStatus. Before implementing this ticket, we need to agree on exactly what metrics to capture and how they will be exposed in serverStatus. The current proposal is to collect histograms of both the number of storage reads performed during SBE multi-planning and the overall wall clock time spent multi-planning.

We may wish to collect similar information for the classic multi-planner as well as the SBE multi-planner. There are known scenarios in which the classic multi-planner can take a long time to complete. In particular, see SERVER-31078.

The intended audience of these metrics is query engineering and query product management. We want to be able to analyze the performance of multi-planning across the Atlas fleet in order to inform our decision making about future improvements to the server. It's probable that these metrics would also be useful in support scenarios (e.g. seeing if a customer is getting a lot of queries which take a long time to multi-plan), but this is not the primary use case.



 Comments   
Comment by Githook User [ 06/Sep/22 ]

Author:

{'name': 'Dave Cuthbert', 'email': '69165704+davemungo@users.noreply.github.com', 'username': 'davemungo'}

Message: DOCS-15235 add server status metrics v6.0 (#1740)

Comment by Githook User [ 02/Sep/22 ]

Author:

{'name': 'Dave Cuthbert', 'email': '69165704+davemungo@users.noreply.github.com', 'username': 'davemungo'}

Message: DOCS-15235 add server status metrics v6.0 (#1740)

Comment by Jess Balint [ 18/Aug/22 ]

The idea (iirc) is to surface them in last ping data. Here are the descriptions from the code:

/**
 * Aggregation of the total number of microseconds spent (in the classic multiplanner).
 */
CounterMetric classicMicrosTotal("query.multiPlanner.classicMicros");
 
/**
 * Aggregation of the total number of "works" performed (in the classic multiplanner).
 */
CounterMetric classicWorksTotal("query.multiPlanner.classicWorks");
 
/**
 * Aggregation of the total number of invocations (of the classic multiplanner).
 */
CounterMetric classicCount("query.multiPlanner.classicCount");
 
/**
 * An element in this histogram is the number of microseconds spent in an invocation (of the
 * classic multiplanner).
 */
HistogramServerStatusMetric classicMicrosHistogram("query.multiPlanner.histograms.classicMicros",
                                                   HistogramServerStatusMetric::pow(11, 1024, 4));
 
/**
 * An element in this histogram is the number of "works" performed during an invocation (of the
 * classic multiplanner).
 */
HistogramServerStatusMetric classicWorksHistogram("query.multiPlanner.histograms.classicWorks",
                                                  HistogramServerStatusMetric::pow(9, 128, 2));
 
/**
 * An element in this histogram is the number of plans in the candidate set of an invocation (of the
 * classic multiplanner).
 */
HistogramServerStatusMetric classicNumPlansHistogram(
    "query.multiPlanner.histograms.classicNumPlans", HistogramServerStatusMetric::pow(5, 2, 2));

 
/**
 * Aggregation of the total number of microseconds spent (in SBE multiplanner).
 */
CounterMetric sbeMicrosTotal("query.multiPlanner.sbeMicros");
 
/**
 * Aggregation of the total number of reads done (in SBE multiplanner).
 */
CounterMetric sbeNumReadsTotal("query.multiPlanner.sbeNumReads");
 
/**
 * Aggregation of the total number of invocations (of the SBE multiplanner).
 */
CounterMetric sbeCount("query.multiPlanner.sbeCount");
 
/**
 * An element in this histogram is the number of microseconds spent in an invocation (of the SBE
 * multiplanner).
 */
HistogramServerStatusMetric sbeMicrosHistogram("query.multiPlanner.histograms.sbeMicros",
                                               HistogramServerStatusMetric::pow(11, 1024, 4));
 
/**
 * An element in this histogram is the number of reads performance during an invocation (of the SBE
 * multiplanner).
 */
HistogramServerStatusMetric sbeNumReadsHistogram("query.multiPlanner.histograms.sbeNumReads",
                                                 HistogramServerStatusMetric::pow(9, 128, 2));
 
/**
 * An element in this histogram is the number of plans in the candidate set of an invocation (of the
 * SBE multiplanner).
 */
HistogramServerStatusMetric sbeNumPlansHistogram("query.multiPlanner.histograms.sbeNumPlans",
                                                 HistogramServerStatusMetric::pow(5, 2, 2));

Let me know if you need any further clarification.

Comment by Jess Balint [ 18/Aug/22 ]

I didn't find any good reference but I can provide one here. The "query.multiPlanner" is the new sub-object in the metrics.

"query" : {
  "planCacheTotalSizeEstimateBytes" : NumberLong(0),
  "updateOneOpStyleBroadcastWithExactIDCount" : NumberLong(0),
  "multiPlanner" : {
    "classicCount" : NumberLong(0),
    "classicMicrosTotal" : NumberLong(0),
    "classicWorksTotal" : NumberLong(0),
    "sbeCount" : NumberLong(0),
    "sbeMicrosTotal" : NumberLong(0),
    "sbeNumReadsTotal" : NumberLong(0),
    "classicMicros" : {
      "(-inf, 0)" : { "count" : NumberLong(0) },
      "[0, 100)" : { "count" : NumberLong(0) },
      "[100, 1000)" : { "count" : NumberLong(0) },
      "[1000, 10000)" : { "count" : NumberLong(0) },
      "[10000, inf)" : { "count" : NumberLong(0)
      },
      "totalCount" : NumberLong(0)
    },
    "classicNumPlans" : {
      "(-inf, 0)" : { "count" : NumberLong(0) },
      "[0, 100)" : { "count" : NumberLong(0) },
      "[100, 1000)" : { "count" : NumberLong(0) },
      "[1000, 10000)" : { "count" : NumberLong(0) },
      "[10000, inf)" : { "count" : NumberLong(0) },
      "totalCount" : NumberLong(0)
    },
    "classicWorks" : {
      "(-inf, 0)" : { "count" : NumberLong(0) },
      "[0, 100)" : { "count" : NumberLong(0) },
      "[100, 1000)" : { "count" : NumberLong(0) },
      "[1000, 10000)" : { "count" : NumberLong(0) },
      "[10000, inf)" : { "count" : NumberLong(0) },
      "totalCount" : NumberLong(0)
    },
    "sbeMicros" : {
      "(-inf, 0)" : { "count" : NumberLong(0) },
      "[0, 100)" : { "count" : NumberLong(0) },
      "[100, 1000)" : { "count" : NumberLong(0) },
      "[1000, 10000)" : { "count" : NumberLong(0) },
      "[10000, inf)" : { "count" : NumberLong(0) },
      "totalCount" : NumberLong(0)
    },
    "sbeNumPlans" : {
      "(-inf, 0)" : { "count" : NumberLong(0) },
      "[0, 100)" : { "count" : NumberLong(0) },
      "[100, 1000)" : { "count" : NumberLong(0) },
      "[1000, 10000)" : { "count" : NumberLong(0) },
      "[10000, inf)" : { "count" : NumberLong(0) },
      "totalCount" : NumberLong(0)
    },
    "sbeNumReads" : {
      "(-inf, 0)" : { "count" : NumberLong(0) },
      "[0, 100)" : { "count" : NumberLong(0) },
      "[100, 1000)" : { "count" : NumberLong(0) },
      "[1000, 10000)" : { "count" : NumberLong(0) },
      "[10000, inf)" : { "count" : NumberLong(0) },
      "totalCount" : NumberLong(0)
    },
  },
  "queryExecutionEngine" : {
    "aggregate" : {
      "classicHybrid" : NumberLong(0),
      "classicOnly" : NumberLong(0),
      "sbeHybrid" : NumberLong(10),
      "sbeOnly" : NumberLong(0)
    },
    "find" : {
      "classic" : NumberLong(0),
      "sbe" : NumberLong(0)
    }
  }
},

Comment by Jess Balint [ 10/Aug/22 ]

dave.cuthbert@mongodb.comsorry for the delay. we could document the new metrics in https://www.mongodb.com/docs/manual/reference/command/serverStatus/

Comment by Education Bot [ 04/May/22 ]

Fix Version updated for upstream SERVER-63642:
6.0.0-rc0, 5.0.9, 4.4.15

Comment by Jess Mokrzecki [ 27/Apr/22 ]

Fix Version updated for upstream SERVER-63642:
4.4.14, 6.0.0-rc0, 5.0.9

Comment by Jess Mokrzecki [ 25/Apr/22 ]

Fix Version updated for upstream SERVER-63642:
6.0.0-rc0, 5.0.9

Comment by Jess Mokrzecki [ 12/Apr/22 ]

Fix Version updated for upstream SERVER-63642:
6.0.0-rc0

Generated at Thu Feb 08 08:12:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.