[SERVER-38147] Cap donor migration lock acquisition stalls in the presence of active transactions Created: 15/Nov/18 Updated: 29/Oct/23 Resolved: 24/Jan/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.8 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Kim Tao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | newgrad | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Sprint: | Sharding 2018-12-31, Sharding 2019-01-14, Sharding 2019-01-28 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Today’s chunk migration behaves like any other collection DDL operation and uses exclusive collection lock for synchronization. Because of the fairness of the lock manager collection locks, in the presence of multi-statement transactions, these X-lock acquisitions have the potential to stall access to the entire collection for up to the transaction timeout (which defaults to 1 minute). What is worse is that as opposed to DDL, chunk migration is not user-initiated and customers have no control over it apart from disabling the balancer. Since this is not acceptable and may lead to outages, we will cap all migration-related lock acquisition stalls to at most some configurable parameter value, defaulted to 500 milliseconds. On the donor side, there are 3 state transitions, which use the collection X lock to synchronize with concurrent workload, which can lead to the described stalls:
As part of this ticket, we should implement the following:
|
| Comments |
| Comment by Esha Maharishi (Inactive) [ 01/Jun/21 ] |
|
I see, it was because even read-only transactions used MODE_IX ("write") locks (I'm not sure if read-only transactions still use MODE_IX locks). Thank you! |
| Comment by Kaloian Manassiev [ 31/May/21 ] |
|
At the time when I wrote this response, transactions were using MODE_IX for locks, even if they were read-only transactions. The mode of the global lock is what we use as a means to decide whether to apply the critical section for an operation or not. This means that for transactions, even if we are in the read-only/catch-up phase, it will still block behind the CS (rather it will abort the transaction). |
| Comment by Esha Maharishi (Inactive) [ 25/May/21 ] |
kaloian.manassiev, did you mean there cannot be active transactions once we enter the commit phase of the critical section? During the read-only phase (catch-up phase), I think new read-only transactions can still be started. I still agree that the timeout is not needed on the lock acquisitions in the commit phase, since there cannot be active transactions once in the commit phase. |
| Comment by Githook User [ 24/Jan/19 ] |
|
Author: {'email': 'kimberly.tao@mongodb.com', 'name': 'Kim Tao', 'username': 'Kimchelly'}Message: |
| Comment by Kaloian Manassiev [ 29/Nov/18 ] |
|
Yes, these two are part of the "cleanup" and that's why I didn't account for them. There cannot be any active transactions on this collection once we enter the read-only part of the critical section. |
| Comment by Esha Maharishi (Inactive) [ 29/Nov/18 ] |
|
kaloian.manassiev, there are also the two collection X lock acquisitions to refresh the CSS after the migration commit (if the remote refresh succeeds and or if the remote refresh fails). However, these acquisitions are inside the critical section - is that why they don't need the deadline? Note that the "remote refresh succeeds" acquisition is actually inside forceShardFilteringMetadataRefresh() (which we are not going to initially put a timeout on), and the "remote refresh fails" one is under an UninterruptibleLockGuard. |