Currently, chunk splits, whether manual or initiated by the auto-splitter, acquire the collection distributed lock. This is bad for 2 reasons:
- Even if there is a single imbalanced shard, which is performing balancing, the chunk splitter will not be able to acquire the distributed lock and will repeatedly fail
- With or without the presence of migrations, the dist lock acquisition still happens after we have performed the splitVector scan in order to determine the split-points, which means that good portion of these scans could end-up being wasted. This point is somewhat mitigated by the fact that the dist lock is taken with the default timeout of 5 seconds, but given that we try with 500 ms back-off there is still some chance that we waste scans on a sufficiently loaded system.
This ticket is to figure out how to remove the dist lock acquisition from splits without causing an impact in the reverse direction. I.e., splits causing the much more expensive moves to start failing, because the chunk being moved got split.
We could achieve this by having the split logic ignore size tracking and/or splitting chunks, which are currently being moved.
- depends on
SERVER-56779 Do not use the collection distributed lock for chunk merges
- is depended on by
SERVER-57032 Acquire only local distLock for Sharding DDL coordinators
- is duplicated by
SERVER-25359 Create collection-specific ResourceMutex map to allow split/merge/move chunk on different collections to proceed in parallel