[SERVER-60161] Deadlock between config server stepdown and _configsvrRenameCollectionMetadata command Created: 22/Sep/21  Updated: 29/Oct/23  Resolved: 01/Oct/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.0
Fix Version/s: 5.0.4, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Jordi Serra Torrens
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-70003 Alternative client for deleting range... Closed
related to SERVER-70127 Default system operations to be killa... Closed
is related to SERVER-59226 Deadlock when stepping down with a pr... Closed
is related to SERVER-52564 Deadlock between step down and MongoD... Closed
is related to SERVER-55007 Deadlock between step down and MongoD... Closed
is related to SERVER-55573 Deadlock between stepdown and chunk m... Closed
is related to SERVER-58364 ShardServerCatalogCacheLoader::waitFo... Closed
is related to SERVER-58775 Mark ConfigsvrSetAllowMigrationsComma... Closed
is related to SERVER-59329 Make sure that withTemporaryOperation... Closed
is related to SERVER-60521 Deadlock on stepup due to moveChunk c... Closed
is related to SERVER-60958 Avoid server hang in chunk migration ... Closed
is related to SERVER-70873 Stepdown during drop collection can l... Closed
is related to SERVER-70888 ScopedRangeDeleterLock might lead to ... Closed
is related to SERVER-76273 SessionCatalogMigrationDestination is... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Sprint: Sharding EMEA 2021-10-04
Participants:
Linked BF Score: 135

 Description   

The OperationContext for ShardingCatalogManager::renameShardedMetadata() has a logical session checked out while doing an uninterruptible wait on the _kChunkOpLock. If the _kChunkOpLock is currently held (e.g. from a running _configsvrSetAllowMigrations command), then _configsvrRenameCollectionMetadata will block until the _kChunkOpLock is released. In particular, the _configsvrSetAllowMigrations command will acquire the _kChunkOpLock and then attempt to acquire additional LockManager locks such as the RSTL IX lock. If a stepdown occurs on the primary, then the RstlKillOpThread interrupt the OperationContext running ShardingCatalogManager::renameShardedMetadata(). But the uninterruptible wait means that the no attention is given to the kill status. ReplicationCoordinatorImpl::_stepDownFinish() will then block attempting to check out the logical session to kill it as part of invalidateSessionsForStepdown() while holding the RSTL X lock.

  • _configsvrRenameCollectionMetadata (holding "logical session" resource) -> _kChunkOpLock
  • _configsvrSetAllowMigrations (holding _kChunkOpLock) -> RSTL IX lock
  • Stepdown (holding RSTL X lock) -> acquiring "logical session" resource

I think the solution here would be to make the _kChunkOpLock and _kZoneOpLock acquisitions interruptible by using the 3-argument constructor for Lock::ExclusiveLock.

Lock::ExclusiveLock chunkLk(opCtx, opCtx->lockState(), _kChunkOpLock);
Lock::ExclusiveLock zoneLk(opCtx, opCtx->lockState(), _kZoneOpLock);



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 01/Oct/21 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-60161 Make _kChunkOpLock and _kZoneOpLock acquisitions interruptable

(cherry picked from commit c6884c0b66da1c23499f12f18688c9cd8f0619f1)
Branch: v5.0
https://github.com/mongodb/mongo/commit/aaa04f81be98415e75c146e227803fa4df8bb732

Comment by Githook User [ 01/Oct/21 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-60161 Make _kChunkOpLock and _kZoneOpLock acquisitions interruptable
Branch: master
https://github.com/mongodb/mongo/commit/c6884c0b66da1c23499f12f18688c9cd8f0619f1

Generated at Thu Feb 08 05:49:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.