[SERVER-24274] Create a command to provide query bounds for partitioning data in a collection Created: 24/May/16  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Ross Lawley Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-25289 Make it possible to select a subset o... Closed
Related
is related to SERVER-28667 Provide a way for the Aggregation fra... Backlog
is related to SERVER-33998 Remove the parallelCollectionScan com... Closed
Assigned Teams:
Query Execution
Participants:

 Description   

Both the Spark and Hadoop connectors have custom code to partition data in a collection so they can be processed externally in parallel.

This requires either SplitVector for non sharded systems or access to query the config database for sharded systems. The permissions to determine the partitions may not be possible in a sharded or hosted MongoDB setup.

Adding a command that could provide the min, max query bounds for splitting a collection into multiple parts would allow any external framework to query in parallel each partition and process in parallel.



 Comments   
Comment by Geert Bosch [ 11/Apr/18 ]

One note is that the $sample/$bucketAuto approach will be an index scan + fetch plan, not a collection scan plan. So, the plan may be far slower...

Comment by Andrew Doumaux [ 30/Mar/18 ]

With SERVER-17688 being closed: won't fix because of the work in SERVER-33998. This ticket should be assigned to Fix Version: 3.7 Desired as SERVER-17688 was.

The need to parallel process all the data within a collection is still a needed feature since the migration to wiredTiger.

Comment by Ross Lawley [ 16/Jun/16 ]

charlie.swanson your example looks good, looks like $sample and $bucketAuto will meet the need for general cases. Only downside I can think of is on sharded clusters the partitions may be across multiple shards. I don't think thats an insurmountable issue and there may not be that much requirement for it.

Comment by Charlie Swanson [ 14/Jun/16 ]

I have an idea that might provide a workaround in the meantime, although it relies on a feature we haven't built yet.

We're planning to add a $bucketAuto stage in SERVER-24152, which is currently scheduled for the 3.4 release. I think that using this stage in combination with the $sample stage may allow you to get a roughly even distribution of boundaries? Specifically, let's say you wanted 10 partitions, you could do something like this:

db.collection.aggregate([
    {$sample: {size: 1000}},  // This could be configured to tune the tradeoff between accuracy and performance.
    {$bucketAuto: {groupBy: "$_id", buckets: 10}}
])

Which would generate some output like so:

{_id: {min: XX, max: YY}, count: 100}
{_id: {min: WW, max: ZZ}, count: 100}
...

Does that sound like a reasonable approach?

Generated at Thu Feb 08 04:05:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.