[SERVER-60161] Deadlock between config server stepdown and _configsvrRenameCollectionMetadata command Created: 22/Sep/21 Updated: 29/Oct/23 Resolved: 01/Oct/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 5.0.0 |
| Fix Version/s: | 5.0.4, 5.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Max Hirschhorn | Assignee: | Jordi Serra Torrens |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Sharding EMEA 2021-10-04 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 135 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
The OperationContext for ShardingCatalogManager::renameShardedMetadata() has a logical session checked out while doing an uninterruptible wait on the _kChunkOpLock. If the _kChunkOpLock is currently held (e.g. from a running _configsvrSetAllowMigrations command), then _configsvrRenameCollectionMetadata will block until the _kChunkOpLock is released. In particular, the _configsvrSetAllowMigrations command will acquire the _kChunkOpLock and then attempt to acquire additional LockManager locks such as the RSTL IX lock. If a stepdown occurs on the primary, then the RstlKillOpThread interrupt the OperationContext running ShardingCatalogManager::renameShardedMetadata(). But the uninterruptible wait means that the no attention is given to the kill status. ReplicationCoordinatorImpl::_stepDownFinish() will then block attempting to check out the logical session to kill it as part of invalidateSessionsForStepdown() while holding the RSTL X lock.
I think the solution here would be to make the _kChunkOpLock and _kZoneOpLock acquisitions interruptible by using the 3-argument constructor for Lock::ExclusiveLock.
|
| Comments |
| Comment by Vivian Ge (Inactive) [ 06/Oct/21 ] |
|
Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you! |
| Comment by Githook User [ 01/Oct/21 ] |
|
Author: {'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}Message: (cherry picked from commit c6884c0b66da1c23499f12f18688c9cd8f0619f1) |
| Comment by Githook User [ 01/Oct/21 ] |
|
Author: {'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}Message: |