Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-75830

Fix acquisition-level inversion between NetworkInterfaceTL::_mutex and executor::ConnectionPool::_mutex

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major - P3 Major - P3
    • 7.1.0-rc0
    • None
    • None
    • None
    • Fully Compatible
    • ALL
    • Service Arch 2023-04-17, Service Arch 2023-05-01
    • 120

    Description

      NetworkInterfaceTL::_mutex exists at HierarchicalAcquisitionLevel(3), while the executor::ConnectionPool::_mutex exists at HierarchicalAcquisitionLevel(1). When ConnectionPool readies pending requests for connections, it emplaces the connections in promises that correspond to SemiFuture's here: https://github.com/mongodb/mongo/blob/45d2505b998a717c2f2e155fea3e71e9c703803e/src/mongo/executor/connection_pool.cpp#L1201 . This is done while the ConnectionPool mutex is held. However, if those SemiFutures are converted to ExecutorFutures via `.thenRunOn`, the connection-pool is forced to call `OutOfLineExecutor::schedule` after emplacing the value. If the executor is a ThreadPoolTaskExecutor, this can result in an attempt to acquire the NetworkInterfaceTL::_mutex via a call to NetworkInterfaceTL::signalWorkAvailable here. This is a lock-ordering violation and results in an fassert/crash on debug builds. 

       

      We haven't seen this yet, because continuations off the futures returned by ConnectionPool::get are only ever scheduled on the reactor, whose scheduling function does not attempt to acquire the NetworkInterfaceTL's mutex. However, the PinnedConnectionTaskExecutor does chain a continuation that is scheduled on a TPTE, so this fassert in debug-mode has become possible.

       

      I do think there is no actual bug/deadlock possible here, for the following reason. The NetworkInterfaceTL::_mutex is only acquired in 5 places that I can identify: 

      • NetworkInterfaceTL::appendConnectionStats
      • NetworkInterfaceTL::startup
      • NetworkInterfaceTL::waitForWork
      • NetworkInterfaceTL::waitForWorkUntil
      • NetworkInterfaceTL::signalWorkAvailable

      None of these places attempt to then acquire the ConnectionPool::_mutex while the NetworkInterfaceTL::_mutex is held, so it shouldn't ever deadlock due to the ordering being inverted elsewhere. So I think this is a debug-only/test-only problem, but one we need to fix anyway for test and should consider doing in a a way that lets us keep the lock-hierarchies. 

       

      Update: upon investigation, we realized the NetworkInterfaceTL mutex was no longer required and removed it entirely 

      Attachments

        Activity

          People

            george.wangensteen@mongodb.com George Wangensteen
            george.wangensteen@mongodb.com George Wangensteen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: