Note: this deadlock is similar to
SERVER-28688 but this is another one.
Note: I observed this deadlock in 3.6.1.
ReplicationCoordinatorExternalStateImpl::shutdown calls _taskExecutor->join() while having _threadMutex locked. In most cases there are no tasks for worker threads and _taskExecutor->join() returns immediately. But in some rare situations DropPendingCollectionReaper has some collections to drop and while these tasks are running signal processing thread keeps _threadMutex locked. If at this moment replication logic decides to stepdown then we have a deadlock because ReplicationCoordinatorExternalStateImpl::startProducerIfStopped tries to acquire _threadMutex while holding the global exclusive lock. After startProducerIfStopped starts its wait for _threadMutex drop collection tasks are also blocked by the global lock.
Attached file contains output of mongodb-waitsfor-graph, mongodb-show-locks, mongodb-uniqstack commands. In this file:
- thread 2 (signalProcessingThread) owns _threadMutex lock (acquired in ReplicationCoordinatorExternalStateImpl::shutdown)
and waits for shutdown of worker threads (_taskExecutor->shutdown(); _taskExecutor->join()
- thread 47: "replexec-9" waits for _threadMutex (owned by thread 2)
is processing _stepDownFinish event
which calls _updateMemberStateFromTopologyCoordinator_inlock
which calls startProducerIfStopped
which tries to aquire _threadMutex
- thread 48 (worker thread executing dropCollection task)
waits for global lock owned by thread 47