[SERVER-27283] Sharded aggregations that need merging should only consider for merging the shards that have documents to contribute Created: 05/Dec/16 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Asya Kamsky | Assignee: | Backlog - Query Optimization |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | optimization, performance, zones | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Query Optimization
|
||||||||||||
| Participants: | |||||||||||||
| Description |
|
Even though a scatter-gather aggregation is sent to all shards, it may be the case that a small subset of shards have any results to contribute to the merge stage. In those cases it would be better for performance if we only considered for merge stage the shards that have documents to contribute to the merge phase. (This would require shards doing more work on the cursor before returning it to mongos for forwarding to selected merging shard). |
| Comments |
| Comment by David Storch [ 26/Oct/17 ] |
|
I see, thanks for the clarification! It makes sense to me that we could always choose as the merger the shard which has the largest number of results to contribute, in order to minimize the amount of data that has to be moved across the network to the merger. I think implementing this optimization properly depends on having statistics which will allow us to estimate the size of the result set on each targeted shard. |
| Comment by Asya Kamsky [ 26/Oct/17 ] |
|
To clarify: Scatter gather aggregation is sent to all shards. Let's say only one of the shards actually has documents to send to the merging shard, when every other shard has 0 documents (simplest case is $match stage result was 0 on those shards). In this case rather than considering every shard as the merge shard, having the merge happen on the single shard that has documents to contribute would be optimal from performance point of view. When only two shards have documents to contribute only one of them should be considered for being the merge shard, even if all shards were queries (because the others returned 0 documents). |
| Comment by David Storch [ 26/Oct/17 ] |
|
asya, I don't understand the suggested optimization. What does it mean to only consider certain shards for the merge stage? |