Today’s chunk migration behaves like any other collection DDL operation and uses exclusive collection lock for synchronization. Because of the fairness of the lock manager collection locks, in the presence of multi-statement transactions, these X-lock acquisitions have the potential to stall access to the entire collection for up to the transaction timeout (which defaults to 1 minute). What is worse is that as opposed to DDL, chunk migration is not user-initiated and customers have no control over it apart from disabling the balancer.
Since this is not acceptable and may lead to outages, we will cap all migration-related lock acquisition stalls to at most some configurable parameter value, defaulted to 500 milliseconds.
On the donor side, there are 3 state transitions, which use the collection X lock to synchronize with concurrent workload, which can lead to the described stalls:
- Starting the clone phase
- Entering the catch-up (read-only) phase of the critical section
- Entering the commit phase of the critical section
As part of this ticket, we should implement the following:
- Startup and runtime-configurable parameter called migrationLockAcquisitionMaxWaitMS, defaulted to 500 msec
- Change all the usages of AutoGetCollection in the three locations above to pass a deadline of now() + migrationLockAcquisitionMaxWaitMS
- Run tests multiple time in order to make sure that they don't fail on slower machines and if need be up the migrationLockAcquisitionMaxWaitMS default to 30 seconds if test commands are enabled.