The OperationContext for ShardingCatalogManager::renameShardedMetadata() has a logical session checked out while doing an uninterruptible wait on the _kChunkOpLock. If the _kChunkOpLock is currently held (e.g. from a running _configsvrSetAllowMigrations command), then _configsvrRenameCollectionMetadata will block until the _kChunkOpLock is released. In particular, the _configsvrSetAllowMigrations command will acquire the _kChunkOpLock and then attempt to acquire additional LockManager locks such as the RSTL IX lock. If a stepdown occurs on the primary, then the RstlKillOpThread interrupt the OperationContext running ShardingCatalogManager::renameShardedMetadata(). But the uninterruptible wait means that the no attention is given to the kill status. ReplicationCoordinatorImpl::_stepDownFinish() will then block attempting to check out the logical session to kill it as part of invalidateSessionsForStepdown() while holding the RSTL X lock.
- _configsvrRenameCollectionMetadata (holding "logical session" resource) -> _kChunkOpLock
- _configsvrSetAllowMigrations (holding _kChunkOpLock) -> RSTL IX lock
- Stepdown (holding RSTL X lock) -> acquiring "logical session" resource
I think the solution here would be to make the _kChunkOpLock and _kZoneOpLock acquisitions interruptible by using the 3-argument constructor for Lock::ExclusiveLock.