[SERVER-57941] Make it safe to shutdown a ThreadPoolTaskExecutor and backing NetworkInterfaceTL before starting it up Created: 22/Jun/21 Updated: 14/Jan/22 Resolved: 14/Jan/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | George Wangensteen | Assignee: | Vojislav Stojkovic |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | servicearch-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Sprint: | Service Arch 2021-11-15, Service Arch 2021-12-13, Service Arch 2022-1-10, Service Arch 2022-1-24 |
| Participants: | |
| Story Points: | 4 |
| Description |
|
If we call shutdown on a NetworkInterfaceTL before it starts up, we simply move kStopped into its _state member and do nothing else: see here
This becomes a problem if startup is ever called on the same NetworkInterfaceTL before shutdown completes; we'll fail this invariant in startup here:
because the _state will will been kStopped, not kDefault, before this was called. This is also a problem for the shutdown of ThreadPoolTaskExecutors, as they wrap a NetworkInterfaceTL. On startup, these executors call startup on the backing NetworkInterfaceTL. And if join is called on such an executor, it will call shutdown on the underlying NetworkInterface. So if join is called before startup is on such an executor, we'll encounter the same invariant. Acceptance criteria: make the invariant an exception/investigate it thoroughly and ensure lifetime semantics of NetworkInterfaceTL is clear |
| Comments |
| Comment by Vojislav Stojkovic [ 14/Jan/22 ] |
|
After some discussion, with george.wangensteen and amirsaman.memaripour, we agreed that the best course of action right now is to close this ticket without fixing it. In order to satisfy the acceptance criteria, we would have to change some invariants in NetworkInterfaceTL::startup and ThreadPoolTaskExecutor::startup into {{iassert}}s. This, in turn, could introduce new bugs in the existing code that might be relying on these invariants being invariants. The exception thrown might end up being caught, converted into error code, and passed up to the code that doesn't expect it or know how to deal with it. In order to ensure, this doesn't happen, we would have to audit all the call sites and trace the logic from there outwards, until we're reasonably sure we're not introducing new bugs. A cursory search reveals approximately 46 such call sites. On the other hand, the failure the current invariant produces depends on a very specific set of conditions – shutting down the process soon enough after starting it, so that the shutdown of NITL occurs before startup – which should not happen in a production deployment under any normal circumstances. In fact, it seems to occur only occasionally in tests and only on a build variant that slows down the code through instrumentation. Given that there already exists a project (PM-1962) to establish a clear shutdown procedure, the effort needed to implement this fix and be reasonably sure it doesn't break anything seems to outweigh the benefit. |