Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.1.8
Affects Version/s: None
Component/s: Sharding
Labels:
- newgrad

Backwards Compatibility:
Fully Compatible
Sprint:
Sharding 2018-12-31, Sharding 2019-01-14, Sharding 2019-01-28
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Today’s chunk migration behaves like any other collection DDL operation and uses exclusive collection lock for synchronization. Because of the fairness of the lock manager collection locks, in the presence of multi-statement transactions, these X-lock acquisitions have the potential to stall access to the entire collection for up to the transaction timeout (which defaults to 1 minute). What is worse is that as opposed to DDL, chunk migration is not user-initiated and customers have no control over it apart from disabling the balancer.

Since this is not acceptable and may lead to outages, we will cap all migration-related lock acquisition stalls to at most some configurable parameter value, defaulted to 500 milliseconds.

On the donor side, there are 3 state transitions, which use the collection X lock to synchronize with concurrent workload, which can lead to the described stalls:

Starting the clone phase
Entering the catch-up (read-only) phase of the critical section
Entering the commit phase of the critical section

As part of this ticket, we should implement the following:

Startup and runtime-configurable parameter called migrationLockAcquisitionMaxWaitMS, defaulted to 500 msec
Change all the usages of AutoGetCollection in the three locations above to pass a deadline of now() + migrationLockAcquisitionMaxWaitMS
Run tests multiple time in order to make sure that they don't fail on slower machines and if need be up the migrationLockAcquisitionMaxWaitMS default to 30 seconds if test commands are enabled.

depends on

SERVER-39091 startClone can trigger invariant failure on error

Closed

is duplicated by

SERVER-34018 Stashed transaction resources for snapshot reads can block the migration critical section, leading to stalls

Closed

Assignee:: Kim Tao
Reporter:: Kaloian Manassiev
Participants:: Esha Maharishi, Githook User, Kaloian Manassiev, Kim Tao
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Nov 15 2018 12:11:54 PM UTC
Updated:: Oct 29 2023 10:26:32 PM UTC
Resolved:: Jan 24 2019 09:47:09 PM UTC
Confidence Status Last Update:: 20/Dec/18 9:56 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates