Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 7.1.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Service Arch 2023-04-17, Service Arch 2023-05-01
Linked BF Score:
120
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

NetworkInterfaceTL::_mutex exists at HierarchicalAcquisitionLevel(3), while the executor::ConnectionPool::_mutex exists at HierarchicalAcquisitionLevel(1). When ConnectionPool readies pending requests for connections, it emplaces the connections in promises that correspond to SemiFuture's here: https://github.com/mongodb/mongo/blob/45d2505b998a717c2f2e155fea3e71e9c703803e/src/mongo/executor/connection_pool.cpp#L1201 . This is done while the ConnectionPool mutex is held. However, if those SemiFutures are converted to ExecutorFutures via `.thenRunOn`, the connection-pool is forced to call `OutOfLineExecutor::schedule` after emplacing the value. If the executor is a ThreadPoolTaskExecutor, this can result in an attempt to acquire the NetworkInterfaceTL::_mutex via a call to NetworkInterfaceTL::signalWorkAvailable here. This is a lock-ordering violation and results in an fassert/crash on debug builds.

We haven't seen this yet, because continuations off the futures returned by ConnectionPool::get are only ever scheduled on the reactor, whose scheduling function does not attempt to acquire the NetworkInterfaceTL's mutex. However, the PinnedConnectionTaskExecutor does chain a continuation that is scheduled on a TPTE, so this fassert in debug-mode has become possible.

I do think there is no actual bug/deadlock possible here, for the following reason. The NetworkInterfaceTL::_mutex is only acquired in 5 places that I can identify:

NetworkInterfaceTL::appendConnectionStats
NetworkInterfaceTL::startup
NetworkInterfaceTL::waitForWork
NetworkInterfaceTL::waitForWorkUntil
NetworkInterfaceTL::signalWorkAvailable

None of these places attempt to then acquire the ConnectionPool::_mutex while the NetworkInterfaceTL::_mutex is held, so it shouldn't ever deadlock due to the ordering being inverted elsewhere. So I think this is a debug-only/test-only problem, but one we need to fix anyway for test and should consider doing in a a way that lets us keep the lock-hierarchies.

Update: upon investigation, we realized the NetworkInterfaceTL mutex was no longer required and removed it entirely

Assignee:: George Wangensteen (Inactive)
Reporter:: George Wangensteen (Inactive)
Participants:: George Wangensteen, Githook User
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Apr 07 2023 01:52:58 PM UTC
Updated:: Oct 29 2023 09:23:25 PM UTC
Resolved:: Apr 26 2023 05:14:57 PM UTC

Details

Description

Attachments

Activity

People

Dates