Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-70340

Range deleter can cause migration lock acquisition timeouts

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 6.0.2
    • Component/s: None
    • None
    • ALL
    • Hide

      Create a two shard cluster. Add all the data while collection is unsharded and create 10 indexes and then shard the collection.

      The data will be migrated from one shard to the other and after the 15 minute range deletion timeout, we will start seeing critical section releases fail on the donor shard because of failing to acquire the critical section.

      Show
      Create a two shard cluster. Add all the data while collection is unsharded and create 10 indexes and then shard the collection. The data will be migrated from one shard to the other and after the 15 minute range deletion timeout, we will start seeing critical section releases fail on the donor shard because of failing to acquire the critical section.
    • Sharding EMEA 2022-10-17, Sharding EMEA 2022-10-31, Sharding EMEA 2022-11-14

      This ticket is to investigate the lock timeouts caused by the range deleter on v6.0. The release  acquisition of the critical section is timing out trying to acquire the collection lock that seems to be held by the range deleter.

      This is a problem on 6.0 because of the acquisition of the ScopedRangeDeleterLock before running the deletion. This lock is necessary to prevent the FCV upgrade orphan counter code from setting incorrect counters on the range deletions. However, the problem we are running into here is that the ScopedRangeDeleterLock acquires the DBLock on the config database, which automatically acquires the GlobalLock. But we have already acquired the DBLock on the user database, which has already automatically acquired the GlobalLock. This double acquisition means that we are recursively locking the global lock, and so the yield policy is replaced with NO_YIELD.

      We also cannot change the ScopedRangeDeleterLock to never acquire the global lock, because some usages of the persistUpdatedNumOrphans function during migrations are not holding the global lock already.

      One option would be to change the ScopedRangeDeleterLock to acquire the GlobalLock conditionally based on whether we already hold it, something like replacing this line with the code below.

       

      _configLock(opCtx, NamespaceString::kConfigDb, MODE_IX, Date_t::max(), opCtx->lockState()->isLocked())
      

      But it may be better to find a more general solution that also considers the general problem of the ScopedRangeDeleterLock in SERVER-70322.

       

       

            Assignee:
            tommaso.tocci@mongodb.com Tommaso Tocci
            Reporter:
            allison.easton@mongodb.com Allison Easton
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated:
              Resolved: