[SERVER-60063] Log server discovery times Created: 17/Sep/21 Updated: 29/Oct/23 Resolved: 25/Jan/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 5.3.0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Kevin Arhelger | Assignee: | Rachita Dhawan |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | sharding-nyc-subteam2 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Minor Change | ||||||||
| Sprint: | Sharding 2022-01-24, Sharding 2022-02-07 | ||||||||
| Participants: | |||||||||
| Story Points: | 2 | ||||||||
| Description |
|
For both mongos and mongod, it would be helpful to have an indicator of how long server discovery is taking. For example, if a replica set has an election, understanding the time it took a mongos to discover the new primary is very useful to diagnose issues, by logging "It took X milliseconds to find a valid primary" This should measure the time from when a active primary is not detected until a new primary is discovered. Ideally, a single log line reports
Acceptance criteria: server description record in topology change message includes the amount of time spent in previous topology state. |
| Comments |
| Comment by Githook User [ 25/Jan/22 ] |
|
Author: {'name': 'Rachita Dhawan', 'email': 'rachita.dhawan@gmail.com', 'username': 'racdhawan'}Message: |
| Comment by Rachita Dhawan [ 16/Jan/22 ] |
| Comment by Eric Sedor [ 01/Oct/21 ] |
|
Triage discussed this and I spoke with Kevin offline. We're in agreement that at a minimum, it will be very helpful to have log lines with .attr.durationMillis that describe gaps in any understanding of another server. I've edited the description to this end. kevin.arhelgerbruce.lucas can you let me know if I should amend the description further? |
| Comment by Bruce Lucas (Inactive) [ 21/Sep/21 ] |
|
Would this be better as a log line rather than a serverStatus metric? Generally serverStatus is best for things that happen frequently or repeatedly and can be captured by cumulative counters, but these seem like (mostly) one-time events. Also a log line for such events should contain a durationMillis attr, which will cause it to show up prominently on charts that show such durations which may make it easy to spot correlations with other issues. |