[SERVER-31368] Log time spent waiting for other shards in merge cursors aggregation stage Created: 03/Oct/17  Updated: 30/Oct/23  Resolved: 23/Apr/20

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, Diagnostics
Affects Version/s: None
Fix Version/s: 4.4.0-rc5, 4.7.0, 4.2.10

Type: Improvement Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: David Percy
Resolution: Fixed Votes: 30
Labels: SWDI, qopt-team, storch
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Documented
is documented by DOCS-13613 Investigate changes in SERVER-31368: ... Closed
Problem/Incident
causes SERVER-53751 findMatchingLogLines js helper functi... Closed
Related
related to SERVER-48001 Complete TODO listed in SERVER-31368 Closed
related to SERVER-46446 Log time spent waiting for remote ope... Closed
is related to SERVER-29854 Secondary members no more syncing unt... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4, v4.2, v4.0
Sprint: Query 2019-02-11, Query 2019-05-06, Query 2019-11-04, Query 2019-11-18, Query 2019-12-30, Query 2020-01-13, Query 2020-01-27, Query 2020-02-10, Query 2020-02-24, Query 2020-03-09, Query 2020-03-23, Query 2020-04-06, Query 2020-04-20, Query 2020-05-04
Participants:
Case:
Linked BF Score: 35

 Description   

Aggregation queries using $mergeCursors communicate with other shards and therefore can be affected by communication issues or issues on other nodes. In the following example the query ran for a very long time but did not spend any time acquiring locks and did not yield. Possibly it was waiting for a reponse from one or more of the other hosts, but there is no indication of that in the log message. It would be helpful for diagnosis if the slow query report indicated how much time was spent waiting for each host.

2017-06-23T06:53:15.357+0200 I COMMAND  [conn44] command ... command: aggregate { aggregate: "...", pipeline: [ { $mergeCursors: [ { host: "...", ns: "...", id: 253733414549 }, ...]}, { $group: ... } keysExamined:0 docsExamined:0 numYields:0 nreturned:0 reslen:17527 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_command 2909221ms



 Comments   
Comment by Githook User [ 14/Sep/20 ]

Author:

{'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}

Message: SERVER-31368 Tag log_remote_op_wait.js
Branch: v4.2
https://github.com/mongodb/mongo/commit/220d72da13180652f4986bc65a0dd95966973dd0

Comment by Githook User [ 09/Sep/20 ]

Author:

{'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}

Message: SERVER-31368 Make regex more permissive in log_remote_op_wait.js

This was only failing on Windows, so I suspect there was a \r character at
the end of the line. The important property to test is that the line's last
field says "42ms" or similar, so allowing zero or more whitespace at the end
seems fine.
Branch: v4.2
https://github.com/mongodb/mongo/commit/33887c37750c267f7ddfdeada7096d6abc0bf128

Comment by Githook User [ 09/Sep/20 ]

Author:

{'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}

Message: SERVER-31368 Exclude log_remote_op_wait.js from last_stable
Branch: v4.2
https://github.com/mongodb/mongo/commit/46d96188b11b792f976eabe35b0f8b16d5a6f202

Comment by Githook User [ 08/Sep/20 ]

Author:

{'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}

Message: SERVER-31368 Log time spent waiting in BlockingResultsMerger
Branch: v4.2
https://github.com/mongodb/mongo/commit/cd92e6054c21f744397d3c2f3fc8c19d91564199

Comment by Githook User [ 07/May/20 ]

Author:

{'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}

Message: SERVER-31368 Log time spent waiting in BlockingResultsMerger
Branch: v4.4
https://github.com/mongodb/mongo/commit/ed15e4756564769d4ce98ad8910a878ea5cef639

Comment by Githook User [ 23/Apr/20 ]

Author:

{'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}

Message: SERVER-31368 Log time spent waiting in BlockingResultsMerger
Branch: master
https://github.com/mongodb/mongo/commit/d7c8f19c0980713d38a975d5b546f9b051e19649

Comment by Bruce Lucas (Inactive) [ 06/May/19 ]

"mergeCursorWaitMS"

With a couple exceptions the most used convention in mongod seems to be "...Micros" and "...Millis"; should we follow that here?

Comment by Bruce Lucas (Inactive) [ 31/Jan/19 ]

Yes, it should be logged at level 0 so we can do RCAs on events that may not be possible to reproduce.

Comment by Bernard Gorman [ 30/Jan/19 ]

david.storch we'll have to do the work to record this metric in either case, so I don't see any disadvantage to making it part of the regular slow query line. Putting it at LOG(1) also means that TSEs would almost certainly need to ask users to increase their logLevel and re-run the aggregation in order to generate the line, which may not be feasible if the aggregation in question was extremely long-running or disruptive to their system.

Comment by David Storch [ 30/Jan/19 ]

bernard.gorman will this be part of the regular slow query log line, or will we introduce a separate log line at logLevel 1 with this information?

Comment by David Storch [ 06/Oct/17 ]

bruce.lucas, understood. I imagine that when we implement slow query logging on mongos we will have to solve the problem of how to log information about the latencies associated with each shard. The information should be presented in the same way irrespective of whether the optimization logic decides to merge the result set on mongos or on one of the shards. So "Log slow queries on mongos" is really a bit of a misnomer. A more appropriate name would be "Log slow queries in the sharded query execution path".

Comment by Bruce Lucas (Inactive) [ 06/Oct/17 ]

I noticed this was made part of "Log slow queries on mongos", but just to be clear, this particular issue is about what we log on mongod, not on mongos.

Generated at Thu Feb 08 04:26:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.