[SERVER-56361] Better FTDC diagnostics of RSM, including server-side isMaster handling Created: 26/Apr/21 Updated: 29/Oct/23 Resolved: 02/Jun/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.0.24 |
| Fix Version/s: | 4.0.25 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Andrew Shuvalov (Inactive) | Assignee: | Andrew Shuvalov (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | post-rc0 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 180 | ||||||||||||
| Description |
|
We need to corner possible problems from HELP ticket by detecting stuckness in processing topology changes both at the client and server side. The main target of such diagnostics is to point to the areas where the processing could be substantially delayed with "in flight" counters - counters that increment on code area entry and decrement (possibly with latency attached) when the control exits the block. This is something that cannot be properly addressed with logging without be unnecessary verbose. The data that FTDC can display easily is: how many threads are blocked waiting to enter a code section related to RSM (both server and client side) and what is the average latency for the threads that already left that section. |
| Comments |
| Comment by Githook User [ 02/Jun/21 ] | |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: | |
| Comment by Githook User [ 01/Jun/21 ] | |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: | |
| Comment by Andrew Shuvalov (Inactive) [ 28/May/21 ] | |
|
However this will require some optional changes in FTDC gui, after 5.0 port is made. | |
| Comment by Githook User [ 28/May/21 ] | |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: | |
| Comment by Andrew Shuvalov (Inactive) [ 14/May/21 ] | |
|
I decreased the period over which max latency is calculated to 1 sec, it was 10 sec before. However we should not be afraid that it will create "flickering" and miss the incident on the chart because the previous max latency will be erased by new data only. If RSM is not doing any new requests for say 10 seconds, the last max latency will "stick" for 10 seconds because there is no new data. However if RSM is very busy and doing a lot of requests max latency will be erased quickly. But I don't think it's a problem because I also added some extra backup logging. | |
| Comment by Bruce Lucas (Inactive) [ 28/Apr/21 ] | |
|
Some general patterns we use:
Distribution percentiles is probably overkill, and also doesn't sample well when charting at various time scales, so we don't generally do it. There is a serverStatus.repl section that might be a good home for this, although I see there may be some relationship to sharding (e.g. "shard01" metric in your example?) and I noticed that some of your metrics relate to the "hello" command. We already record some metrics related to this, don't know if sufficient for what you want to do. Also some of your metrics seem to be per-shard? We can't record per-shard information in FTDC because some customers have a very large number of shards. | |
| Comment by Andrew Shuvalov (Inactive) [ 28/Apr/21 ] | |
|
The generic task description: suppose we have a congested block of code that requires access to some limited resource. Some concurrent jobs are trying to execute this blocks - this could be multiple threads or tasks waiting in a queue. We want to collect and plot metrics of two different kind:
The concrete example above - I was instrumenting the code around ReplicaSetMonitor src/mongo/client/replica_set_monitor.cpp to register delays in obtaining the current primary for a given ReplicaSet. Unfortunately I'm not using the proper pattens as I'm not yet familiar with FTDC code base enough. I was creating my own implementation by subclassing the class FTDCCollectorInterface and registering it with:
inside registerMongoSCollectors() in src/mongo/db/ftdc/ftdc_mongos.cpp. I think the proper way should be somehow in ServerStatus, but I can't figure out where exactly ServerStatus is bridged to FTDC and how it's hierarchical structure is bridged to flat FTDC, and how exactly the periodic collectors are fetching data from server status. The second question is how and where we handle a time-series of latencies. I see that we have few "average" kind of metrics, so we do have some statistical time periods processing somewhere. It's probably not different from any Gaussian or Gaussian-like stream of discrete data, but latency distribution is the most common example. Ideally, I would also collect 95-percentile, assuming the accuracy is not important, the rough algorithm could be reasonably simple:
| |
| Comment by Andrew Shuvalov (Inactive) [ 27/Apr/21 ] | |
|
Example of metrics generated with my experimental code (explanation in the next comment): | |
| Comment by Andrew Shuvalov (Inactive) [ 27/Apr/21 ] | |
|
bruce.lucas I updated the description, still not keeping it very specific and open for discussion. We got feedback from jonathan.streets: So I would like to redo the implementation above to be compliant (the initial version was used in help ticket for troubleshooting). I will work offline with bruce.lucas to get necessary guidance for proper solution. | |
| Comment by Bruce Lucas (Inactive) [ 27/Apr/21 ] | |
|
Can you please post a description of the proposed FTDC changes? |