Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:

Assigned Teams:

Replication
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

A long repl waiter list creates a negative feedback loop that slows down the system, further increasing the length of the list, and is briefly mentioned in the playbook. A comment in HELP-63535 explains what happened in one customer instance.

The root cause is that the mutex that serializes most repl behavior is held while traversing the waiter list, which prevents the system from making progress. _wakeReadyWaiters holds the lock and calls _doneWaitingForReplication on all waiters in WriteConcernWaiterList::setValueIf (up to the early exit in setValueIf).

The mutex is required because _doneWaitingForReplication uses _topCoord and _currentCommittedSnapshot (and a few others).

There is already an optimization noted in a comment stating that in most cases, checking _currentCommittedSnapshot is sufficient for majority writes. We could use this to build a fast path for majority writes that meet the optimization conditions that are used in the existing code. We could keep WriteConcernWaiterList::_waiters as it is and introduce an analogous container for majority writes that meet the existing optimization conditions, so we would keep WriteConcernWaiterList::setValueIf unchanged. We could process the eligible majority writes in the new container outside of the mutex by copying the value of _currentCommittedSnapshot under the mutex, then passing it to the new container outside the mutex. The new container would be responsible for locking its own members with its own mutex.

Under high w:majority load, this could improve write performance by acknowledging satisfiable writes while allowing other repl functions to proceed concurrently as normal.

I have a few concerns that we should think through if we try this:

How would the new container handle state transitions? For example, would it be allowed to acknowledge successful writes during stepdown?

Would this change the order in which waiters with different writeConcerns are fulfilled? For example, would a later w:<integer> be satisfied before an earlier (eligible) w:majority? This already happens today because we iterate over sorted lists of waiters by writeConcern - that is, we do all of writeConcern0 first, then all of writeConcern1, ...

Would this change the order in which eligible w:majority writes are fulfilled? No, because even if 2 threads race to execute setValueIf with different snapshot opTimes, we already fulfill writes that have the same writeConcern in order. Therefore, if the thread with the earlier snapshot runs second, it will do no work because the first thread already fulfilled whatever waiters the second thread would have fulfilled.

Is there any case where we could fulfill a waiter that we should not have? It seems unlikely since this would use existing logic, but we might introduce a bug while shifting where it runs.

Do we have a suitable workload to reproduce a long waiter list?

Do most writes use the optimized path today? If not, then this would not help. Product may be able to help us get statistics.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

1D44AEA9-8BD6-43F2-928E-A289493D4AAE.png
150 kB
Feb 19 2026 07:04:19 PM UTC
8B5ECCC9-D5B0-43F4-9C53-767E924F0774.png
143 kB
Feb 19 2026 07:04:24 PM UTC

is duplicated by

SERVER-113499 Replication scales poorly with number of waiters

Closed

Assignee:: Unassigned
Reporter:: Brad Cater
Participants:: Brad Cater
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Jan 12 2026 02:17:59 PM UTC
Updated:: Mar 12 2026 07:03:26 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates