[SERVER-57941] Make it safe to shutdown a ThreadPoolTaskExecutor and backing NetworkInterfaceTL before starting it up Created: 22/Jun/21  Updated: 14/Jan/22  Resolved: 14/Jan/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: George Wangensteen Assignee: Vojislav Stojkovic
Resolution: Won't Fix Votes: 0
Labels: servicearch-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Sprint: Service Arch 2021-11-15, Service Arch 2021-12-13, Service Arch 2022-1-10, Service Arch 2022-1-24
Participants:
Story Points: 4

 Description   

If we call shutdown on a NetworkInterfaceTL before it starts up, we simply move kStopped into its _state member and do nothing else: see here

void NetworkInterfaceTL::shutdown() {
    if (_state.swap(kStopped) != kStarted)
        return;

This becomes a problem if startup is ever called on the same NetworkInterfaceTL before shutdown completes; we'll fail this invariant in startup here:

void NetworkInterfaceTL::startup() {
    stdx::lock_guard<Latch> lk(_mutex);
 
    _ioThread = stdx::thread([this] {
        setThreadName(_instanceName);
        _run();
    });
 
    invariant(_state.swap(kStarted) == kDefault);
}

because the _state will will been kStopped, not kDefault, before this was called.

This is also a problem for the shutdown of ThreadPoolTaskExecutors, as they wrap a NetworkInterfaceTL. On startup, these executors call startup on the backing NetworkInterfaceTL. And if join is called on such an executor, it will call shutdown on the underlying NetworkInterface. So if join is called before startup is on such an executor, we'll encounter the same invariant.

Acceptance criteria: make the invariant an exception/investigate it thoroughly and ensure lifetime semantics of NetworkInterfaceTL is clear



 Comments   
Comment by Vojislav Stojkovic [ 14/Jan/22 ]

After some discussion, with george.wangensteen and amirsaman.memaripour, we agreed that the best course of action right now is to close this ticket without fixing it.

In order to satisfy the acceptance criteria, we would have to change some invariants in NetworkInterfaceTL::startup and ThreadPoolTaskExecutor::startup into {{iassert}}s. This, in turn, could introduce new bugs in the existing code that might be relying on these invariants being invariants. The exception thrown might end up being caught, converted into error code, and passed up to the code that doesn't expect it or know how to deal with it. In order to ensure, this doesn't happen, we would have to audit all the call sites and trace the logic from there outwards, until we're reasonably sure we're not introducing new bugs. A cursory search reveals approximately 46 such call sites.

On the other hand, the failure the current invariant produces depends on a very specific set of conditions – shutting down the process soon enough after starting it, so that the shutdown of NITL occurs before startup – which should not happen in a production deployment under any normal circumstances. In fact, it seems to occur only occasionally in tests and only on a build variant that slows down the code through instrumentation.

Given that there already exists a project (PM-1962) to establish a clear shutdown procedure, the effort needed to implement this fix and be reasonably sure it doesn't break anything seems to outweigh the benefit.

Generated at Thu Feb 08 05:43:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.