[DOCS-13613] Investigate changes in SERVER-31368: Log time spent waiting for other shards in merge cursors aggregation stage Created: 23/Apr/20  Updated: 13/Nov/23  Resolved: 10/Aug/21

Status: Closed
Project: Documentation
Component/s: manual, Server
Affects Version/s: None
Fix Version/s: 4.4.0-rc5, 4.7.0, 4.2.10, Server_Docs_20231030, Server_Docs_20231106, Server_Docs_20231105, Server_Docs_20231113

Type: Task Priority: Major - P3
Reporter: Backlog - Core Eng Program Management Team Assignee: Jason Price
Resolution: Done Votes: 1
Labels: None
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Issue Links:
Documented
documents SERVER-31368 Log time spent waiting for other shar... Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
DOCS-14726 Backport to 4.4 Sub-task Closed Jason Price  
DOCS-14727 Backport to 4.2.10 Sub-task Closed Jason Price  
Participants:
Days since reply: 2 years, 26 weeks, 1 day ago
Epic Link: DOCSP-15042
Story Points: 3

 Description   

Description

Downstream Change Summary

This change adds a new optional field, "remoteOpWaitMillis" to the profiler / slow query log lines. It tells you how much time the node spent waiting for results from other nodes. By comparing this with the total "durationMillis" you can figure out whether the merger or a shard is to blame for a slow query.

It only appears when the command is an aggregate or a getMore on an aggregation cursor. It only appears on the merging node.

Description of Linked Ticket

Aggregation queries using $mergeCursors communicate with other shards and therefore can be affected by communication issues or issues on other nodes. In the following example the query ran for a very long time but did not spend any time acquiring locks and did not yield. Possibly it was waiting for a reponse from one or more of the other hosts, but there is no indication of that in the log message. It would be helpful for diagnosis if the slow query report indicated how much time was spent waiting for each host.

2017-06-23T06:53:15.357+0200 I COMMAND  [conn44] command ... command: aggregate { aggregate: "...", pipeline: [ { $mergeCursors: [ { host: "...", ns: "...", id: 253733414549 }, ...]}, { $group: ... } keysExamined:0 docsExamined:0 numYields:0 nreturned:0 reslen:17527 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_command 2909221ms

Scope of changes

Impact to Other Docs

MVP (Work and Date)

Resources (Scope or Design Docs, Invision, etc.)



 Comments   
Comment by Githook User [ 10/Aug/21 ]

Author:

{'name': 'jason-price-mongodb', 'email': 'jshfjghsdfgjsdjh@aolsdjfhkjsdhfkjsdf.com'}

Message: DOCS-13613 log time spent waiting for other shards
Branch: v4.2
https://github.com/mongodb/docs/commit/7967ebda37a592ffb7434715c7dbe67e191f517e

Comment by Githook User [ 10/Aug/21 ]

Author:

{'name': 'jason-price-mongodb', 'email': 'jshfjghsdfgjsdjh@aolsdjfhkjsdhfkjsdf.com'}

Message: DOCS-13613 Log time spent waiting for shards
Branch: v4.4
https://github.com/mongodb/docs/commit/2d85ff391273d17c08fe6d6475c0a56fe9f8534b

Comment by Githook User [ 09/Aug/21 ]

Author:

{'name': 'jason-price-mongodb', 'email': 'jshfjghsdfgjsdjh@aolsdjfhkjsdhfkjsdf.com'}

Message: DOCS-13613 Log time spent waiting for shards
Branch: master
https://github.com/mongodb/docs/commit/ddd5cefeb88c89d8356ce2c8147da789a5703627

Comment by Jeffrey Allen [ 13/May/21 ]

Hi bhaskarhnarula@gmail.com , this change is reflected in the 4.2.10 release notes. There, you will see an entry stating "SERVER-31368: Log time spent waiting for other shards in merge cursors aggregation stage"

The issue here was caused by upgrading the mongos before upgrading the shards. This procedure explains how to upgrade a sharded cluster to the latest revision of a MongoDB release: https://docs.mongodb.com/v4.2/tutorial/upgrade-revision/#upgrade-sharded-clusters.

Following these steps in the order they are presented ensures that your deployment does not behave unexpectedly when upgrading minor versions.

Comment by Bhaskar Narula [ 13/May/21 ]

Alright, after a day of digging into this I could understand that from 4.2.13 onwards all requests going out of mongos have recordRemoteOpWaitTime: true. Since rest of the nodes are lower than 4.2.13 they aren't able to recognise  this attribute hence erroring out. 

Next hunt is to figure out a way to disable mongos passing this attribute to shards. Can someone please help me understand why don't have this available in the release notes of 4.2.13? 

Comment by Bhaskar Narula [ 13/May/21 ]

Team,

Need quick help regarding the problem below -

We run a sharded mongo database in production environment. Recently, we deprecated one of the mongos instances and the new instance has spawned up with a latest version of Mongo 4.2.14. Rest of the instances (shards, mongos and configs) are on 4.2.5. 

Soon after this we started facing a problem to which we can find absolutely no documentation over the web. The problem happens only for the node which has 4.2.14.

Error - Command failed with error 40415 (Location40415): 'BSON field '$mergeCursors.recordRemoteOpWaitTime' is an unknown field.' on server 10.17.9.84:27017. The full response is {"ok": 0.0, "errmsg": "BSON field '$mergeCursors.recordRemoteOpWaitTime' is an unknown field.", "code": 40415, "codeName": "Location40415", "operationTime": {"$timestamp": {"t": 1620920773, "i": 1}}, "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1620920773, "i": 1}}, "signature": {"hash":

{"$binary": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "$type": "00"}

, "keyId": {"$numberLong": "0"}}}}

 

We understand this happening owing to a long time taken between communication among shards but it wasn't the case with the lower version.

Is there a configuration which we can override? Any leads would be helpful.

Generated at Thu Feb 08 08:08:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.