[SERVER-75830] Fix acquisition-level inversion between NetworkInterfaceTL::_mutex and executor::ConnectionPool::_mutex Created: 07/Apr/23 Updated: 29/Oct/23 Resolved: 26/Apr/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | George Wangensteen | Assignee: | George Wangensteen |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Sprint: | Service Arch 2023-04-17, Service Arch 2023-05-01 | ||||
| Participants: | |||||
| Linked BF Score: | 120 | ||||
| Description |
|
NetworkInterfaceTL::_mutex exists at HierarchicalAcquisitionLevel(3), while the executor::ConnectionPool::_mutex exists at HierarchicalAcquisitionLevel(1). When ConnectionPool readies pending requests for connections, it emplaces the connections in promises that correspond to SemiFuture's here: https://github.com/mongodb/mongo/blob/45d2505b998a717c2f2e155fea3e71e9c703803e/src/mongo/executor/connection_pool.cpp#L1201 . This is done while the ConnectionPool mutex is held. However, if those SemiFutures are converted to ExecutorFutures via `.thenRunOn`, the connection-pool is forced to call `OutOfLineExecutor::schedule` after emplacing the value. If the executor is a ThreadPoolTaskExecutor, this can result in an attempt to acquire the NetworkInterfaceTL::_mutex via a call to NetworkInterfaceTL::signalWorkAvailable here. This is a lock-ordering violation and results in an fassert/crash on debug builds.
We haven't seen this yet, because continuations off the futures returned by ConnectionPool::get are only ever scheduled on the reactor, whose scheduling function does not attempt to acquire the NetworkInterfaceTL's mutex. However, the PinnedConnectionTaskExecutor does chain a continuation that is scheduled on a TPTE, so this fassert in debug-mode has become possible.
I do think there is no actual bug/deadlock possible here, for the following reason. The NetworkInterfaceTL::_mutex is only acquired in 5 places that I can identify:
None of these places attempt to then acquire the ConnectionPool::_mutex while the NetworkInterfaceTL::_mutex is held, so it shouldn't ever deadlock due to the ordering being inverted elsewhere. So I think this is a debug-only/test-only problem, but one we need to fix anyway for test and should consider doing in a a way that lets us keep the lock-hierarchies.
Update: upon investigation, we realized the NetworkInterfaceTL mutex was no longer required and removed it entirely |
| Comments |
| Comment by Githook User [ 26/Apr/23 ] |
|
Author: {'name': 'George Wangensteen', 'email': 'george.wangensteen@mongodb.com', 'username': 'gewa24'}Message: |