[SERVER-31368] Log time spent waiting for other shards in merge cursors aggregation stage Created: 03/Oct/17 Updated: 30/Oct/23 Resolved: 23/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework, Diagnostics |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.0-rc5, 4.7.0, 4.2.10 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | David Percy |
| Resolution: | Fixed | Votes: | 30 |
| Labels: | SWDI, qopt-team, storch | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.4, v4.2, v4.0
|
||||||||||||||||||||||||||||||||||||||||
| Sprint: | Query 2019-02-11, Query 2019-05-06, Query 2019-11-04, Query 2019-11-18, Query 2019-12-30, Query 2020-01-13, Query 2020-01-27, Query 2020-02-10, Query 2020-02-24, Query 2020-03-09, Query 2020-03-23, Query 2020-04-06, Query 2020-04-20, Query 2020-05-04 | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 35 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
Aggregation queries using $mergeCursors communicate with other shards and therefore can be affected by communication issues or issues on other nodes. In the following example the query ran for a very long time but did not spend any time acquiring locks and did not yield. Possibly it was waiting for a reponse from one or more of the other hosts, but there is no indication of that in the log message. It would be helpful for diagnosis if the slow query report indicated how much time was spent waiting for each host.
|
| Comments |
| Comment by Githook User [ 14/Sep/20 ] |
|
Author: {'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}Message: |
| Comment by Githook User [ 09/Sep/20 ] |
|
Author: {'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}Message: This was only failing on Windows, so I suspect there was a \r character at |
| Comment by Githook User [ 09/Sep/20 ] |
|
Author: {'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}Message: |
| Comment by Githook User [ 08/Sep/20 ] |
|
Author: {'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}Message: |
| Comment by Githook User [ 07/May/20 ] |
|
Author: {'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}Message: |
| Comment by Githook User [ 23/Apr/20 ] |
|
Author: {'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}Message: |
| Comment by Bruce Lucas (Inactive) [ 06/May/19 ] |
With a couple exceptions the most used convention in mongod seems to be "...Micros" and "...Millis"; should we follow that here? |
| Comment by Bruce Lucas (Inactive) [ 31/Jan/19 ] |
|
Yes, it should be logged at level 0 so we can do RCAs on events that may not be possible to reproduce. |
| Comment by Bernard Gorman [ 30/Jan/19 ] |
|
david.storch we'll have to do the work to record this metric in either case, so I don't see any disadvantage to making it part of the regular slow query line. Putting it at LOG(1) also means that TSEs would almost certainly need to ask users to increase their logLevel and re-run the aggregation in order to generate the line, which may not be feasible if the aggregation in question was extremely long-running or disruptive to their system. |
| Comment by David Storch [ 30/Jan/19 ] |
|
bernard.gorman will this be part of the regular slow query log line, or will we introduce a separate log line at logLevel 1 with this information? |
| Comment by David Storch [ 06/Oct/17 ] |
|
bruce.lucas, understood. I imagine that when we implement slow query logging on mongos we will have to solve the problem of how to log information about the latencies associated with each shard. The information should be presented in the same way irrespective of whether the optimization logic decides to merge the result set on mongos or on one of the shards. So "Log slow queries on mongos" is really a bit of a misnomer. A more appropriate name would be "Log slow queries in the sharded query execution path". |
| Comment by Bruce Lucas (Inactive) [ 06/Oct/17 ] |
|
I noticed this was made part of "Log slow queries on mongos", but just to be clear, this particular issue is about what we log on mongod, not on mongos. |