[SERVER-5477] when sharded, no need to merge groups if $group _id is the shard key or original document _id Created: 02/Apr/12  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Daniel Pasette (Inactive) Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 11
Labels: asya, neweng, optimization, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-42160 $group stages that use a DISTINCT_SCA... Backlog
depends on SERVER-41750 Refactor renamedPaths() helpers to su... Closed
Duplicate
is duplicated by SERVER-22912 Query Optimizer Closed
Related
related to SERVER-56583 Push $setWindowFields to shards when ... Backlog
is related to SERVER-27115 Track fields renamed by $project in a... Closed
is related to SERVER-28942 sort by shard key or prefix of shard ... Backlog
is related to SERVER-55200 DISTINCT_SCAN not used for $sort+$mat... Backlog
Assigned Teams:
Query Optimization
Backwards Compatibility: Fully Compatible
Sprint: QuInt 8 08/28/15, Query 2019-06-17, Query 2019-07-01, Query 2019-07-15, Query 2019-07-29, Query 2019-08-12
Participants:
Case:

 Description   

Copied from SERVER-4961:

On sharded environment, using early grouping, besides the use of an index, it would be nice that we be able to avoid the mongos regrouping process.

I'll try to explain that:

  * result_node1: [
     {
       id: "value1",
       totalcount: 50
     },
     {
       id: "value2",
       totalcount: 100
     },
   ]
 * result_node2: [
     {
       id: "value1",
       totalcount: 60
     }
   ]

The real results(after mongos regroup) must looks like:

 [
     {
       id: "value1",
       totalcount: 110
     },
     {
       id: "value2",
       totalcount: 100
     },
 ]

But, in some cases, mongos regrouping process is nonsense since the grouping key is same as sharding key. So, never got same group key from different shards.

So, the prior example, now looks like:

  * result_node1: [
     {
       id: "value1",
       totalcount: 110
     }
   ]
 * result_node2: [
     {
       id: "value2",
       totalcount: 100
     }
   ]

The real results must looks like:

 [
     {
       id: "value1",
       totalcount: 110
     },
     {
       id: "value2",
       totalcount: 100
     },
 ]

So, the point is mongos regrouping process is a waste of time when you group using same key as sharding key.



 Comments   
Comment by Charlie Swanson [ 16/Sep/15 ]

We've taken a look at how we might achieve this, and the changes required are non-trivial. After some discussion, we've decided this will not be completed for 3.2, and we will re-prioritize when planning for the next release.

What complicates this is that another stage such as a $project stage could modify the _id or shard key of a document, e.g. the $group stage could not be performed entirely on the shards for the following pipeline.

db.coll.aggregate([
    {$project: {_id: {$literal: 1}}},
    {$group: {_id: '$_id'}}
])

While this is still a very useful optimization, we believe several of the new expressions capable of operating on arrays available in the $project stage should reduce the need for unwinding and then regrouping, in turn reducing the need for this optimization. e.g. SERVER-9625, SERVER-4589, SERVER-8141, SERVER-10626, SERVER-14872.

Comment by Jon Rangel (Inactive) [ 10/Apr/15 ]

This optimization should also support hashed shard keys.

e.g. if sharding on {foo:"hashed"}, grouping on "foo" should employ the optimization.

Comment by Asya Kamsky [ 08/Apr/15 ]

The same optimization can be extended if $group is done on the original document _id field (as would be the case if you $unwind and $group by _id to process some array in the document).

Comment by Samuel García Martínez [ 29/Aug/12 ]

Hi. I submitted a pull request for this issue. I hope it helps.

https://github.com/mongodb/mongo/pull/294

Comment by Ian Whalen (Inactive) [ 24/Aug/12 ]

@samuel, the first step is to fill out the contributor agreement - http://www.10gen.com/contributor - and then open a pull request at https://github.com/mongodb/mongo/pulls

Comment by Samuel García Martínez [ 23/Aug/12 ]

I developed a fix for this issue. Is there any process or prerrequisites to do a pull request on Github with this fix?

To give this more accuracy, $group _id can be a superset of shardkey too.

Comment by Samuel García Martínez [ 16/Apr/12 ]

Since in this case mongos regrouping process isn't needed, mongos shouldn't fetch entire resultset, sending limit/skip to shards (if there is no sort operation).

Generated at Thu Feb 08 03:09:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.