[SERVER-38205] Optimize splitVector for the jumbo-chunk case Created: 19/Nov/18 Updated: 29/Oct/23 Resolved: 08/Feb/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.15, 4.0.7, 4.1.8 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Matthew Saltz (Inactive) | Assignee: | Kevin Pulo |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||
| Sprint: | Sharding 2018-12-31, Sharding 2019-01-14, Sharding 2019-01-28, Sharding 2019-02-11 | ||||||||||||
| Participants: | |||||||||||||
| Case: | (copied to CRM) | ||||||||||||
| Linked BF Score: | 0 | ||||||||||||
| Description |
|
If a chunk only contains a single shard key (or very few shard keys), it will be marked as jumbo and not be moveable by the balancer. However, the autosplitter will continue to try to split this chunk periodically, even if there's only a single unique key, which would mean that it could never be split. There are several ways we could optimize for this case:
|
| Comments |
| Comment by Githook User [ 16/Sep/19 ] |
|
Author: {'name': 'Kevin Pulo', 'username': 'devkev', 'email': 'kevin.pulo@mongodb.com'}Message: (cherry picked from commit 4da738debb1aea49524ff8e364254afb5bfda612) |
| Comment by Githook User [ 10/Feb/19 ] |
|
Author: {'name': 'Kevin Pulo', 'email': 'kevin.pulo@mongodb.com', 'username': 'devkev'}Message: (cherry picked from commit 4da738debb1aea49524ff8e364254afb5bfda612) |
| Comment by Githook User [ 08/Feb/19 ] |
|
Author: {'name': 'Kevin Pulo', 'email': 'kevin.pulo@mongodb.com', 'username': 'devkev'}Message: |
| Comment by Matthew Saltz (Inactive) [ 29/Nov/18 ] |
|
david.storch Your (1) is indeed the idea for implementing solution 1 in the ticket. I think another option to implement option 2 would be, instead of seeking to the next key, open a new InternalPlanner with a new query for key > currentKeyBeingLookedAt. If that would work I think doing solution 2 would be best and most general purpose, since like Kal said the first is limited if we have a situation with a high cardinality key in the middle of a chunk. |
| Comment by David Storch [ 27/Nov/18 ] |
|
kaloian.manassiev, the ability to perform arbitrary inclusive or exclusive index seeks is definitely supported by the storage subsystem's SortedDataInterface: We take advantage of this to skip keys in the query layer, in particular for the DISTINCT_SCAN stage. However, I don't think this is exposed by the InternalPlanner in a way that splitVector could use. You could either 1) Use the InternalPlanner to seek forwards (inclusive) from the min, limited to one key. Then use the InternalPlanner again to seek backwards (exclusive) from the max, also limited to one key. I believe this is how you would implement solution 1 above? 2) Circumvent the query layer and use the storage interface directly. One problem with this approach is that you would lose the code responsible for yielding and WriteConflictException handling. |
| Comment by Kevin Pulo [ 21/Nov/18 ] |
|
There's also a "0" optimisation, which is that splitVector() should short-circuit here if minKey == maxKey. Currently in this situation it will still scan the shard key index entries for this shard key value (pointlessly, because there's no hope of finding any split points) (and I can't see any higher-level code which prevents splitVector() from being called with min == max).
This is true, but it doesn't preclude the optimisations — if 0 or 1 were implemented then it would at least allow the workaround of manually splitting immediately around the (manually identified) low-cardinality (high duplication) shard key value. Future auto-splitter work could try to paint a more detailed picture of the statistical distribution of shard key values encountered during splitVector of a chunk, and so better handle situations like this. |
| Comment by Kaloian Manassiev [ 20/Nov/18 ] |
|
matthew.saltz, in addition as we spoke yesterday, we realized that there is a bug with the chunk splitter where if a high-cardinality key is in the middle of a chunk, it is possible that the low-cardinality keys that precede it may never get split away. Having situation like this would preclude optimization #1, wouldn't it? |
| Comment by Kaloian Manassiev [ 20/Nov/18 ] |
|
Solution 1 sounds like like it should be fairly easy to implement without adding undue load to the splitVector command, so that would be my preference. For 2, I am not sure whether the query stage exposes such capability to "jump over" to the next unique key. david.storch, do you know whether this would be possible with this usage of the IndexScan executor? |