Details
-
Bug
-
Resolution: Fixed
-
Major - P3
-
None
-
None
-
None
-
Fully Compatible
-
ALL
-
Service Arch 2023-04-17, Service Arch 2023-05-01
-
120
Description
NetworkInterfaceTL::_mutex exists at HierarchicalAcquisitionLevel(3), while the executor::ConnectionPool::_mutex exists at HierarchicalAcquisitionLevel(1). When ConnectionPool readies pending requests for connections, it emplaces the connections in promises that correspond to SemiFuture's here: https://github.com/mongodb/mongo/blob/45d2505b998a717c2f2e155fea3e71e9c703803e/src/mongo/executor/connection_pool.cpp#L1201 . This is done while the ConnectionPool mutex is held. However, if those SemiFutures are converted to ExecutorFutures via `.thenRunOn`, the connection-pool is forced to call `OutOfLineExecutor::schedule` after emplacing the value. If the executor is a ThreadPoolTaskExecutor, this can result in an attempt to acquire the NetworkInterfaceTL::_mutex via a call to NetworkInterfaceTL::signalWorkAvailable here. This is a lock-ordering violation and results in an fassert/crash on debug builds.
We haven't seen this yet, because continuations off the futures returned by ConnectionPool::get are only ever scheduled on the reactor, whose scheduling function does not attempt to acquire the NetworkInterfaceTL's mutex. However, the PinnedConnectionTaskExecutor does chain a continuation that is scheduled on a TPTE, so this fassert in debug-mode has become possible.
I do think there is no actual bug/deadlock possible here, for the following reason. The NetworkInterfaceTL::_mutex is only acquired in 5 places that I can identify:
- NetworkInterfaceTL::appendConnectionStats
- NetworkInterfaceTL::startup
- NetworkInterfaceTL::waitForWork
- NetworkInterfaceTL::waitForWorkUntil
- NetworkInterfaceTL::signalWorkAvailable
None of these places attempt to then acquire the ConnectionPool::_mutex while the NetworkInterfaceTL::_mutex is held, so it shouldn't ever deadlock due to the ordering being inverted elsewhere. So I think this is a debug-only/test-only problem, but one we need to fix anyway for test and should consider doing in a a way that lets us keep the lock-hierarchies.
Update: upon investigation, we realized the NetworkInterfaceTL mutex was no longer required and removed it entirely