Mitigate the effects of a long repl waiter list for eligible majority writes

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: None
    • Replication
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      A long repl waiter list creates a negative feedback loop that slows down the system, further increasing the length of the list, and is briefly mentioned in the playbook. A comment in HELP-63535 explains what happened in one customer instance.

      The root cause is that the mutex that serializes most repl behavior is held while traversing the waiter list, which prevents the system from making progress. _wakeReadyWaiters holds the lock and calls _doneWaitingForReplication on all waiters in WriteConcernWaiterList::setValueIf (up to the early exit in setValueIf).

      The mutex is required because _doneWaitingForReplication uses _topCoord and _currentCommittedSnapshot (and a few others).

      There is already an optimization noted in a comment stating that in most cases, checking _currentCommittedSnapshot is sufficient for majority writes. We could use this to build a fast path for majority writes that meet the optimization conditions that are used in the existing code. We could keep WriteConcernWaiterList::_waiters as it is and introduce an analogous container for majority writes that meet the existing optimization conditions, so we would keep WriteConcernWaiterList::setValueIf unchanged. We could process the eligible majority writes in the new container outside of the mutex by copying the value of _currentCommittedSnapshot under the mutex, then passing it to the new container outside the mutex. The new container would be responsible for locking its own members with its own mutex.

      Under high w:majority load, this could improve write performance by acknowledging satisfiable writes while allowing other repl functions to proceed concurrently as normal.

      I have a few concerns that we should think through if we try this:

      How would the new container handle state transitions? For example, would it be allowed to acknowledge successful writes during stepdown?

      Would this change the order in which waiters with different writeConcerns are fulfilled? For example, would a later w:<integer> be satisfied before an earlier (eligible) w:majority? This already happens today because we iterate over sorted lists of waiters by writeConcern - that is, we do all of writeConcern0 first, then all of writeConcern1, ...

      Would this change the order in which eligible w:majority writes are fulfilled? No, because even if 2 threads race to execute setValueIf with different snapshot opTimes, we already fulfill writes that have the same writeConcern in order. Therefore, if the thread with the earlier snapshot runs second, it will do no work because the first thread already fulfilled whatever waiters the second thread would have fulfilled.

      Is there any case where we could fulfill a waiter that we should not have? It seems unlikely since this would use existing logic, but we might introduce a bug while shifting where it runs.

      Do we have a suitable workload to reproduce a long waiter list?

      Do most writes use the optimized path today? If not, then this would not help. Product may be able to help us get statistics.

            Assignee:
            Unassigned
            Reporter:
            Brad Cater
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: