Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 6.0.2
Component/s: None
Labels:
None

Operating System:
ALL
Steps To Reproduce:

Hide

Create a two shard cluster. Add all the data while collection is unsharded and create 10 indexes and then shard the collection.

The data will be migrated from one shard to the other and after the 15 minute range deletion timeout, we will start seeing critical section releases fail on the donor shard because of failing to acquire the critical section.

Show
Create a two shard cluster. Add all the data while collection is unsharded and create 10 indexes and then shard the collection. The data will be migrated from one shard to the other and after the 15 minute range deletion timeout, we will start seeing critical section releases fail on the donor shard because of failing to acquire the critical section.
Sprint:
Sharding EMEA 2022-10-17, Sharding EMEA 2022-10-31, Sharding EMEA 2022-11-14
Case:
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

This ticket is to investigate the lock timeouts caused by the range deleter on v6.0. The ~~release~~ acquisition of the critical section is timing out trying to acquire the collection lock that seems to be held by the range deleter.

This is a problem on 6.0 because of the acquisition of the ScopedRangeDeleterLock before running the deletion. This lock is necessary to prevent the FCV upgrade orphan counter code from setting incorrect counters on the range deletions. However, the problem we are running into here is that the ScopedRangeDeleterLock acquires the DBLock on the config database, which automatically acquires the GlobalLock. But we have already acquired the DBLock on the user database, which has already automatically acquired the GlobalLock. This double acquisition means that we are recursively locking the global lock, and so the yield policy is replaced with NO_YIELD.

We also cannot change the ScopedRangeDeleterLock to never acquire the global lock, because some usages of the persistUpdatedNumOrphans function during migrations are not holding the global lock already.

One option would be to change the ScopedRangeDeleterLock to acquire the GlobalLock conditionally based on whether we already hold it, something like replacing this line with the code below.

_configLock(opCtx, NamespaceString::kConfigDb, MODE_IX, Date_t::max(), opCtx->lockState()->isLocked())

But it may be better to find a more general solution that also considers the general problem of the ScopedRangeDeleterLock in ~~SERVER-70322~~.

duplicates

SERVER-70864 Get rid of fine grained scoped range deleter lock

Closed

Assignee:: Tommaso Tocci
Reporter:: Allison Easton
Participants:: Allison Easton, Andrew Witten, Garaudy Etienne, Tommaso Tocci
Votes:: 0 Vote for this issue
Watchers:: 14 Start watching this issue

Created:: Oct 07 2022 02:06:40 PM UTC
Updated:: Nov 30 2022 05:52:35 PM UTC
Resolved:: Nov 01 2022 02:19:40 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates