Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Won't Do
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Replication
Sprint:
Repl 2025-12-08, Repl 2025-12-22
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Following the pattern in ~~SERVER-113363~~, we can get ~10% average read latency improvements (POC) by caching _currentCommittedSnapshot.

The ReplicationCoordinatorImpl maintains a laundry list of private members, most of which are protected by the same mutex, _mutex. This is a source of slowness that customers experience; if ~everything queues on the same mutex, then ~everything is sequential. There’s even a playbook entry for this problem.

We would like to reduce contention on _mutex to improve availability, which could mean taking / holding it in fewer places, less often, or for less time. This ticket is an opportunity to do that by applying a technique that is already used for two members:

_electionIdTermShadow (link)
_termShadow (link)

When reading a single value like those above without the requirement that the value be "frozen" (unchanged) for some period of time (such as until a caller completes its work), it is sufficient to read a cached copy of that value without holding _mutex.

Consider a caller of getCurrentCommittedSnapshotOpTime (link):

OpTime ReplicationCoordinatorImpl::getCurrentCommittedSnapshotOpTime() const {
  stdx::lock_guard lk(_mutex);
  return _getCurrentCommittedSnapshotOpTime(lk);
}

OpTime ReplicationCoordinatorImpl::_getCurrentCommittedSnapshotOpTime(WithLock) const {
  return _currentCommittedSnapshot.value_or(OpTime());
}

The guarantee is that getCurrentCommittedSnapshotOpTime() returns the value of currentCommittedSnapshot (or OpTime(), in the value_or case) at the instant that mutex is acquired. We know that _currentCommittedSnapshot (_ccs) is set while holding _mutex, so we know that from the time we take _mutex to the time we return, the value does not change. However, immediately upon releasing _mutex, _ccs could (and probably will!) change, and the value that we read will be stale. Therefore, the guarantee is that the value returned was correct at a point in time, but it is not guaranteed to be correct or permanent for any subsequent period of time. Callers must be resilient to this today. If a caller wishes to read one or more protected values in a way that guarantees that they will not change for some defined duration, then the caller must hold _mutex.

Now, let’s consider how we can safely read _ccs without holding _mutex:

(_updateCommittedSnapshot)

OpTime ReplicationCoordinatorImpl::getCurrentCommittedSnapshotOpTime() const {
  return _currentCommittedSnapshotCached.value_or(OpTime());
}

bool ReplicationCoordinatorImpl::_updateCommittedSnapshot(WithLock lk,
  const OpTime& newCommittedSnapshot) {
  ...complicated stuff...
  _currentCommittedSnapshot = newCommittedSnapshot;
  _currentCommittedSnapshotCached.set(newCommittedSnapshot); // THIS IS NEW.
  _currentCommittedSnapshotCond.notify_all();
  ...complicated stuff...
}

Note that we set _currentCommittedSnapshotCached under _mutex when we set _ccs. This must be true for values cached in this way: we must set them under the same mutex. If we did not, then a subsequent reader who holds _mutex might find that the uncached value and the cached value disagree, which would be impossible to reconcile.

Suppose that _ccs = 5 and that newCcs = 7. As the code is written today, there are 2 (and only 2) possible orderings of read and write: reader before writer, and reader after writer. The use of _mutex prevents them from happening simultaneously. If getCcs (reader) acquires _mutex before _updateCcs (writer), then it will read 5, and 7 will be written sometime later. On the other hand, if getCcs acquires _mutex after _updateCcs, then 7 will be written, and 7 will be read sometime later.

Now consider the possible orderings with the cached value, and again suppose that _ccs = 5 and that newCcs = 7. If getCcs (reader) runs strictly before _updateCcs (writer), then it will read 5; if it runs strictly after, then it will read 7. Nothing has changed with those 2 orderings.

The new possibility is that getCcs (reader) can run after the cached value is set to the new value (from 5 to 7) but before _updateCcs (writer) releases _mutex. This means that a caller who does not hold _mutex could see the new value before a caller who does hold it.

Could an external caller do something dangerous with that value? No. That value is correct at some point in time. Although it first becomes visible before _mutex is released, it is identical to the uncached copy of the value that will be read by the very next caller who acquires _mutex. By the same guarantee that we make today, the cached value is not guaranteed to be permanent for any subsequent period of time, just like the uncached value.

Could an internal caller who does not hold _mutex do something dangerous with the new value? No. By the same logic as for an external caller, no behavior has changed.

Could an internal caller who does hold _mutex do something dangerous with the cached value? No, because the cached value is always set under the same _mutex as the uncached value and is always equal to the uncached value. In practice, once you hold _mutex, there’s no reason to use the cached value.

is related to

SERVER-113363 Make ReplicationCoordinatorImpl public accessors lock-free

Closed

related to

SERVER-116323 Revert SERVER-113982: DataWithLockFreeReads is not properly synchronized

Closed

SERVER-115766 Add cached versions of _memberState and _oplogSyncState

Closed

Assignee:: Unassigned
Reporter:: Brad Cater
Participants:: Brad Cater, Githook User
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Nov 13 2025 05:34:33 PM UTC
Updated:: Jan 09 2026 09:56:39 PM UTC
Resolved:: Jan 09 2026 09:55:33 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates