-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Atlas Streams
-
None
-
0
-
None
-
None
-
None
-
None
-
None
-
None
For the ObjectId case, we tentatively have the following partitioning strategies:
1) bucketAuto (being implemented as part of https://jira.mongodb.org/browse/SERVER-102099)
2) Random sampling
3) Use the information in the ObjectIds (timestamp) to come up with the ranges
Using bucketAuto would be simplest. We need to investigate what the ETA is for the bucketAuto to finish and if it is "too large" then use one of the other two strategies. To evaluate this we can set up a M40 cluster with a collection of size 100G and also have some random ongoing writes into the cluster. Have a SP do an initialSync of this collection and time the bucketAuto phase (which we will be able to do from splunk logs). Repeat for collection sizes ranging from 100G to 1T and see how the time scales. If the time scales sublinearly or linearly (but is reasonable for 50T) , we can use the bucketAuto strategy.
For the random sampling approach, we will do the following:
Let M be the parallelism specified by the SP. To minimize the chances of having extremely imbalanced buckets, we will create a larger number of buckets where the exact number will depend on the collection size.
The overall strategy could look like this:
1) For collections of size < 10G, whole collection is one partition
2) For collections of size >= 10G but less than some upper threshold (selected after running above tests), we will use bucketAuto
3) For collections larger than the upper threshold of (2), we will use random sampling - sample 10M times and create 10M partitions.
- depends on
-
SERVER-102099 Initial plumbing for InitialSync
-
- Closed
-