SERVER-54805 describes a case where the replication machinery on replica set secondaries ceases to make progress, with the symptom being that all threads in the replication writer thread pool are idle and the thread driving secondary replication is simultaneously blocked waiting for those writer threads to finish their work.
So far, this behavior has only manifest on systems with glibc versions susceptible to this glibc pthread condition variable bug. While I have not been able to build a minimal reproduction using our ThreadPool type, the scenario proven to exist in this blog post about using TLA+ to model glibc condition variables is perfectly analogous to how replication uses thread pools. In this scenario, a signal delivery that is lost due to the glibc bug leads to incomplete work being left in the thread pool, and no threads waking up to perform the work.
Fortunately, a low-risk workaround for this bug as it manifests in the replication system's use of ThreadPool exists. By setting minThreads to 0 instead of its current value, which is equal to maxThreads, we ensure that all waits performed by worker threads eventually wake up due to expiration of the idle thread timeout.
The task in this ticket is to change the value of minThreads in the writer thread pool used by replication to 0. This will not eliminate all possible failures due to the glibc bug, but it will eliminate the only one we've seen in practice until such time as the bug in glibc is corrected.