[SERVER-28667] Provide a way for the Aggregation framework to query against intervals of a hashed index Created: 07/Apr/17 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | 3.4.3 |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Sylvain Chambon | Assignee: | Backlog - Query Optimization |
| Resolution: | Unresolved | Votes: | 5 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Query Optimization
|
||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||
| Description |
|
The Spark connector uses the Aggregation Framework to create data partitions that are sent to Spark workers. In a sharded cluster it makes sense to align these partitions to chunk boundaries so that each worker's data loading query is targeted to a single shard. This however is impossible when the shard key is a hashed index. A simple find can use $min / $max but there is no comparable facility in the aggregation framework. |
| Comments |
| Comment by Andrew Ryder (Inactive) [ 11/Apr/17 ] |
|
I think SERVER-24274 is a different (though related) use case. Here is my summary: SERVER-24274: Ability to query against a non-sharded collection in a way that makes it behave a lot like it's sharded. $bucketAuto should be a suitable candidate to fulfil these use cases (the cases specifically are not load related, they are about data distribution). SERVER-28667 (this ticket): Ability to constrain query results to being sourced from a particular shard. This permits the ability to manage workloads at the shards caused by distributed processing systems like Spark – these systems would ideally be prescient about the load that is generated at any single shard by their members in combination without actually requiring all members to be constantly negotiating. The MongoDB distribution model has this information, and it can be used in a query, but not in aggregation. That is the only missing piece of the puzzle. |
| Comment by Sylvain Chambon [ 07/Apr/17 ] |
|
Hi anonymous.user, Charlie's $bucketAuto is a good idea for creating partitions in general. Unfortunately it doesn't help with the specific goal of having partitions targeted to a single shard while using a hashed shard key. Using $bucketAuto, each partition would be spread across all shards. |
| Comment by Kelsey Schubert [ 07/Apr/17 ] |
|
Hi sylvain.chambon, I've linked to SERVER-24274, which might provide the functionality that your looking for. Can you take a look and let me know? Specifically, please see Charlie's comment about a possible workaround. Thank you, |