[SERVER-24274] Create a command to provide query bounds for partitioning data in a collection Created: 24/May/16 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Ross Lawley | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Both the Spark and Hadoop connectors have custom code to partition data in a collection so they can be processed externally in parallel. This requires either SplitVector for non sharded systems or access to query the config database for sharded systems. The permissions to determine the partitions may not be possible in a sharded or hosted MongoDB setup. Adding a command that could provide the min, max query bounds for splitting a collection into multiple parts would allow any external framework to query in parallel each partition and process in parallel. |
| Comments |
| Comment by Geert Bosch [ 11/Apr/18 ] | |||||||
|
One note is that the $sample/$bucketAuto approach will be an index scan + fetch plan, not a collection scan plan. So, the plan may be far slower... | |||||||
| Comment by Andrew Doumaux [ 30/Mar/18 ] | |||||||
|
With The need to parallel process all the data within a collection is still a needed feature since the migration to wiredTiger. | |||||||
| Comment by Ross Lawley [ 16/Jun/16 ] | |||||||
|
charlie.swanson your example looks good, looks like $sample and $bucketAuto will meet the need for general cases. Only downside I can think of is on sharded clusters the partitions may be across multiple shards. I don't think thats an insurmountable issue and there may not be that much requirement for it. | |||||||
| Comment by Charlie Swanson [ 14/Jun/16 ] | |||||||
|
I have an idea that might provide a workaround in the meantime, although it relies on a feature we haven't built yet. We're planning to add a $bucketAuto stage in
Which would generate some output like so:
Does that sound like a reasonable approach? |