In ReplicationCoordinatorImpl::_scheduleNextLivenessUpdate_inlock(), we do not schedule a new liveness update if the nextTimeout would be in the past. This is wrong; we should schedule an immediate liveness update in that case.
One scenario is that we have just run our liveness check and the earliest live member was just barely fresh ("almost stale"), so we do nothing. A small time passes before we schedule the new one, and now that member is stale, so the next timeout period is in the past. We then stop doing liveness checks.