[SERVER-28667] Provide a way for the Aggregation framework to query against intervals of a hashed index Created: 07/Apr/17  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 3.4.3
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Sylvain Chambon Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 5
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
Related
related to SERVER-24274 Create a command to provide query bou... Backlog
is related to SPARK-98 MongoShardedPartitioner and hashed sh... Closed
is related to SERVER-14400 Using $min and $max on shard key does... Backlog
Assigned Teams:
Query Optimization
Participants:
Case:

 Description   

The Spark connector uses the Aggregation Framework to create data partitions that are sent to Spark workers. In a sharded cluster it makes sense to align these partitions to chunk boundaries so that each worker's data loading query is targeted to a single shard.

This however is impossible when the shard key is a hashed index. A simple find can use $min / $max but there is no comparable facility in the aggregation framework.



 Comments   
Comment by Andrew Ryder (Inactive) [ 11/Apr/17 ]

I think SERVER-24274 is a different (though related) use case. Here is my summary:

SERVER-24274: Ability to query against a non-sharded collection in a way that makes it behave a lot like it's sharded. $bucketAuto should be a suitable candidate to fulfil these use cases (the cases specifically are not load related, they are about data distribution).

SERVER-28667 (this ticket): Ability to constrain query results to being sourced from a particular shard. This permits the ability to manage workloads at the shards caused by distributed processing systems like Spark – these systems would ideally be prescient about the load that is generated at any single shard by their members in combination without actually requiring all members to be constantly negotiating. The MongoDB distribution model has this information, and it can be used in a query, but not in aggregation. That is the only missing piece of the puzzle.

Comment by Sylvain Chambon [ 07/Apr/17 ]

Hi anonymous.user, Charlie's $bucketAuto is a good idea for creating partitions in general. Unfortunately it doesn't help with the specific goal of having partitions targeted to a single shard while using a hashed shard key. Using $bucketAuto, each partition would be spread across all shards.

Comment by Kelsey Schubert [ 07/Apr/17 ]

Hi sylvain.chambon,

I've linked to SERVER-24274, which might provide the functionality that your looking for. Can you take a look and let me know? Specifically, please see Charlie's comment about a possible workaround.

Thank you,
Thomas

Generated at Thu Feb 08 04:18:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.