[SERVER-38147] Cap donor migration lock acquisition stalls in the presence of active transactions Created: 15/Nov/18  Updated: 29/Oct/23  Resolved: 24/Jan/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.1.8

Type: Task Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Kim Tao
Resolution: Fixed Votes: 0
Labels: newgrad
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-39091 startClone can trigger invariant fail... Closed
Duplicate
is duplicated by SERVER-34018 Stashed transaction resources for sna... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2018-12-31, Sharding 2019-01-14, Sharding 2019-01-28
Participants:

 Description   

Today’s chunk migration behaves like any other collection DDL operation and uses exclusive collection lock for synchronization. Because of the fairness of the lock manager collection locks, in the presence of multi-statement transactions, these X-lock acquisitions have the potential to stall access to the entire collection for up to the transaction timeout (which defaults to 1 minute). What is worse is that as opposed to DDL, chunk migration is not user-initiated and customers have no control over it apart from disabling the balancer.

Since this is not acceptable and may lead to outages, we will cap all migration-related lock acquisition stalls to at most some configurable parameter value, defaulted to 500 milliseconds.

On the donor side, there are 3 state transitions, which use the collection X lock to synchronize with concurrent workload, which can lead to the described stalls:

  1. Starting the clone phase
  2. Entering the catch-up (read-only) phase of the critical section
  3. Entering the commit phase of the critical section

As part of this ticket, we should implement the following:

  • Startup and runtime-configurable parameter called migrationLockAcquisitionMaxWaitMS, defaulted to 500 msec
  • Change all the usages of AutoGetCollection in the three locations above to pass a deadline of now() + migrationLockAcquisitionMaxWaitMS
  • Run tests multiple time in order to make sure that they don't fail on slower machines and if need be up the migrationLockAcquisitionMaxWaitMS default to 30 seconds if test commands are enabled.


 Comments   
Comment by Esha Maharishi (Inactive) [ 01/Jun/21 ]

I see, it was because even read-only transactions used MODE_IX ("write") locks (I'm not sure if read-only transactions still use MODE_IX locks). Thank you!

Comment by Kaloian Manassiev [ 31/May/21 ]

At the time when I wrote this response, transactions were using MODE_IX for locks, even if they were read-only transactions. The mode of the global lock is what we use as a means to decide whether to apply the critical section for an operation or not. This means that for transactions, even if we are in the read-only/catch-up phase, it will still block behind the CS (rather it will abort the transaction).

Comment by Esha Maharishi (Inactive) [ 25/May/21 ]

There cannot be any active transactions on this collection once we enter the read-only part of the critical section.

kaloian.manassiev, did you mean there cannot be active transactions once we enter the commit phase of the critical section? During the read-only phase (catch-up phase), I think new read-only transactions can still be started.

I still agree that the timeout is not needed on the lock acquisitions in the commit phase, since there cannot be active transactions once in the commit phase.

Comment by Githook User [ 24/Jan/19 ]

Author:

{'email': 'kimberly.tao@mongodb.com', 'name': 'Kim Tao', 'username': 'Kimchelly'}

Message: SERVER-38147: cap donor migration lock acquisition stalls in the presence of active transactions
Branch: master
https://github.com/mongodb/mongo/commit/ae2607974156a4141cceec7b682418d57057e89e

Comment by Kaloian Manassiev [ 29/Nov/18 ]

Yes, these two are part of the "cleanup" and that's why I didn't account for them. There cannot be any active transactions on this collection once we enter the read-only part of the critical section.

Comment by Esha Maharishi (Inactive) [ 29/Nov/18 ]

kaloian.manassiev, there are also the two collection X lock acquisitions to refresh the CSS after the migration commit (if the remote refresh succeeds and or if the remote refresh fails).

However, these acquisitions are inside the critical section - is that why they don't need the deadline?

Note that the "remote refresh succeeds" acquisition is actually inside forceShardFilteringMetadataRefresh() (which we are not going to initially put a timeout on), and the "remote refresh fails" one is under an UninterruptibleLockGuard.

Generated at Thu Feb 08 04:48:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.