[SERVER-27283] Sharded aggregations that need merging should only consider for merging the shards that have documents to contribute Created: 05/Dec/16  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Asya Kamsky Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 0
Labels: optimization, performance, zones
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-18940 Optimise sharded aggregations that ar... Closed
is related to SERVER-24815 Merging aggregation pipeline strategy... Backlog
Assigned Teams:
Query Optimization
Participants:

 Description   

Even though a scatter-gather aggregation is sent to all shards, it may be the case that a small subset of shards have any results to contribute to the merge stage.

In those cases it would be better for performance if we only considered for merge stage the shards that have documents to contribute to the merge phase.

(This would require shards doing more work on the cursor before returning it to mongos for forwarding to selected merging shard).



 Comments   
Comment by David Storch [ 26/Oct/17 ]

I see, thanks for the clarification! It makes sense to me that we could always choose as the merger the shard which has the largest number of results to contribute, in order to minimize the amount of data that has to be moved across the network to the merger. I think implementing this optimization properly depends on having statistics which will allow us to estimate the size of the result set on each targeted shard.

Comment by Asya Kamsky [ 26/Oct/17 ]

To clarify:

Scatter gather aggregation is sent to all shards. Let's say only one of the shards actually has documents to send to the merging shard, when every other shard has 0 documents (simplest case is $match stage result was 0 on those shards).

In this case rather than considering every shard as the merge shard, having the merge happen on the single shard that has documents to contribute would be optimal from performance point of view.

When only two shards have documents to contribute only one of them should be considered for being the merge shard, even if all shards were queries (because the others returned 0 documents).

Comment by David Storch [ 26/Oct/17 ]

asya, I don't understand the suggested optimization. What does it mean to only consider certain shards for the merge stage?

Generated at Thu Feb 08 04:14:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.