[SERVER-56361] Better FTDC diagnostics of RSM, including server-side isMaster handling Created: 26/Apr/21  Updated: 29/Oct/23  Resolved: 02/Jun/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.0.24
Fix Version/s: 4.0.25

Type: Improvement Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Andrew Shuvalov (Inactive)
Resolution: Fixed Votes: 0
Labels: post-rc0
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2021-04-27 at 7.53.32 PM.png     PNG File Screen Shot 2021-05-11 at 12.23.16 PM.png    
Issue Links:
Backports
Related
related to SERVER-57449 Better FTDC diagnostics of RSM Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v5.0
Participants:
Linked BF Score: 180

 Description   

We need to corner possible problems from HELP ticket by detecting stuckness in processing topology changes both at the client and server side.

The main target of such diagnostics is to point to the areas where the processing could be substantially delayed with "in flight" counters - counters that increment on code area entry and decrement (possibly with latency attached) when the control exits the block. This is something that cannot be properly addressed with logging without be unnecessary verbose.

The data that FTDC can display easily is: how many threads are blocked waiting to enter a code section related to RSM (both server and client side) and what is the average latency for the threads that already left that section.



 Comments   
Comment by Githook User [ 02/Jun/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-56361: Better FTDC diagnostics of RSM
Branch: v4.0
https://github.com/mongodb/mongo/commit/70842af1543116b91c94072285a33f0384308b71

Comment by Githook User [ 01/Jun/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-56361: revert FTDC diagnostics
Branch: v4.0
https://github.com/mongodb/mongo/commit/8b565a174c9edf5451b6325085cebe0f1afe1e3a

Comment by Andrew Shuvalov (Inactive) [ 28/May/21 ]

However this will require some optional changes in FTDC gui, after 5.0 port is made.

Comment by Githook User [ 28/May/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-56361: Better FTDC diagnostics of RSM
Branch: v4.0
https://github.com/mongodb/mongo/commit/e4faa4d08720df067a62d8e48f66944a2f2b76cc

Comment by Andrew Shuvalov (Inactive) [ 14/May/21 ]

I decreased the period over which max latency is calculated to 1 sec, it was 10 sec before. However we should not be afraid that it will create "flickering" and miss the incident on the chart because the previous max latency will be erased by new data only. If RSM is not doing any new requests for say 10 seconds, the last max latency will "stick" for 10 seconds because there is no new data.

However if RSM is very busy and doing a lot of requests max latency will be erased quickly. But I don't think it's a problem because I also added some extra backup logging.

Comment by Bruce Lucas (Inactive) [ 28/Apr/21 ]

Some general patterns we use:

  • Number of threads doing something
    • Instantaneous counts, e.g. serverStatus.connections.active. This is easy and fast to collect, but suffers from possible sampling errors when charted
    • Cumulative time spent, e.g. the various timeAcquiringMicros for locks, flow control, etc. When differentiated and scaled this tells you the number of threads waiting. To be useful though this must be updated at least several times per second while waiting (given FTDC 1-second sample rate). Since this is a cumulative metric it is less subject to sampling errors when charted.
  • Time spent doing something, i.e. latency - see for example serverStatus.opLatencies. A cumulative count of operations and and cumulative time is recorded in serverStatus, and from this an average latency over time can be computed for charting. This works well for sub-second latencies. If the primary concern is with longer latencies, consider logging events instead using the "durationMillis" attribute.

Distribution percentiles is probably overkill, and also doesn't sample well when charting at various time scales, so we don't generally do it.

There is a serverStatus.repl section that might be a good home for this, although I see there may be some relationship to sharding (e.g. "shard01" metric in your example?) and

I noticed that some of your metrics relate to the "hello" command. We already record some metrics related to this, don't know if sufficient for what you want to do.

Also some of your metrics seem to be per-shard? We can't record per-shard information in FTDC because some customers have a very large number of shards.

Comment by Andrew Shuvalov (Inactive) [ 28/Apr/21 ]

The generic task description: suppose we have a congested block of code that requires access to some limited resource. Some concurrent jobs are trying to execute this blocks - this could be multiple threads or tasks waiting in a queue. We want to collect and plot metrics of two different kind:

  1. The current count of threads of tasks waiting before entering this block
  2. The average and possibly 95% latency of executing this block, which includes both waiting and execution time

The concrete example above - I was instrumenting the code around ReplicaSetMonitor src/mongo/client/replica_set_monitor.cpp to register delays in obtaining the current primary for a given ReplicaSet.

Unfortunately I'm not using the proper pattens as I'm not yet familiar with FTDC code base enough. I was creating my own implementation by subclassing the class FTDCCollectorInterface and registering it with:

    controller->addPeriodicCollector(stdx::make_unique<ReplicaSetMonitorStatsFTDC>());

inside registerMongoSCollectors() in src/mongo/db/ftdc/ftdc_mongos.cpp. I think the proper way should be somehow in ServerStatus, but I can't figure out where exactly ServerStatus is bridged to FTDC and how it's hierarchical structure is bridged to flat FTDC, and how exactly the periodic collectors are fetching data from server status.

The second question is how and where we handle a time-series of latencies. I see that we have few "average" kind of metrics, so we do have some statistical time periods processing somewhere. It's probably not different from any Gaussian or Gaussian-like stream of discrete data, but latency distribution is the most common example. Ideally, I would also collect 95-percentile, assuming the accuracy is not important, the rough algorithm could be reasonably simple:

  1. We collect the histogram of discrete data over 10-seconds window, this will give average and 95%. The baskets are allocated dynamically because we don't know the distribution yet
  2. After 10 seconds we move the histogram to "previous" position and create a new empty histogram. However, as we already know the distribution for the previous 10 seconds we can pre-allocate baskets the most optimal way and have fewer baskets. My hunch is that if we know the distribution in advance, 6-7 polynomial baskets will be sufficient for visual correctness. My previous experience with histograms with >20 baskets tells me this granularity doesn't give much additional information anyway.
Comment by Andrew Shuvalov (Inactive) [ 27/Apr/21 ]

Example of metrics generated with my experimental code (explanation in the next comment):

Comment by Andrew Shuvalov (Inactive) [ 27/Apr/21 ]

bruce.lucas I updated the description, still not keeping it very specific and open for discussion. We got feedback from jonathan.streets:
"1. using the existing serverStatus infrastructure and
2. we should use accumulative counters, not decaying averages. "

So I would like to redo the implementation above to be compliant (the initial version was used in help ticket for troubleshooting). I will work offline with bruce.lucas to get necessary guidance for proper solution.

Comment by Bruce Lucas (Inactive) [ 27/Apr/21 ]

Can you please post a description of the proposed FTDC changes?

Generated at Thu Feb 08 05:39:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.