[SERVER-17716] Correlate the parts of distributed ops in db.currentOp() output Created: 24/Mar/15  Updated: 06/Dec/22  Resolved: 08/Dec/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Jon Rangel (Inactive) Assignee: Backlog - Query Execution
Resolution: Won't Do Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-16265 Add query details to getmore entry in... Closed
is related to SERVER-17696 Terminate sharded queries immediately... Closed
Assigned Teams:
Query Execution
Participants:
Case:

 Description   

It would greatly help debugging of long running distributed operations in a sharded cluster (such as scatter-gather aggregation queries) if db.currentOp() grouped together the parts of the operation in a logical way. Currently the operations running on the shards are displayed in the db.currentOp() output returned from a mongos, but it is difficult to tie together all the operations that originated from a single client request.

Provide a cluster-wide db.currentOp() that can be run from a mongos and clearly lists all running operations including all parts of the op that are distributed to the shards

e.g. for a running aggregation op that is currently at the merge cursors stage, we should clearly see all parts of the op displayed together and hierarchically. i.e.

details of op on mongos
  |----> details of op on primary shard
             |----> details of op on shard 1
             |----> details of op on shard 2
             ... 



 Comments   
Comment by Kaloian Manassiev [ 15/Nov/21 ]

This seems more of an Explain-related kind of work, so assigning to Query Execution.

Comment by Antoine Hom [ 30/Oct/15 ]

Hello,

Is it possible to get an updated status on this ticket?

We are running long aggregation queries and we sometimes need to either kill it when it is running or clean up an already failed one.
The reason why we need to do this clean up is that our cluster is tailored to use all resources available for one aggregation query; that's why we are running only one aggregation query at a time and when one fails partially (still lots of op running on shards) it is preventing us from launching a new one until the op timeout or fail. (timeout of 1h)
In addition, we would like to just kill it while it is running if it is taking too long and we want to cancel it.

To do this clean up programatically and correctly we need a way to tie all the ops together from the cursorid (or another information that we have in the client) then kill them.

This ticket might get soon escalated by our management as our customers are pushing hard for this.

Regards,
Antoine

Comment by Andy Schwerin [ 30/Mar/15 ]

Recursive tracking like this sounds helpful, but only for operations that last more than a few network roundtrips (such as aggregations).

Generated at Thu Feb 08 03:45:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.