-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Replication
-
Repl 2025-12-08, Repl 2025-12-22
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Following the pattern in SERVER-113363, we can get ~10% average read latency improvements (POC) by caching _currentCommittedSnapshot.
The ReplicationCoordinatorImpl maintains a laundry list of private members, most of which are protected by the same mutex, _mutex. This is a source of slowness that customers experience; if ~everything queues on the same mutex, then ~everything is sequential. There’s even a playbook entry for this problem.
We would like to reduce contention on _mutex to improve availability, which could mean taking / holding it in fewer places, less often, or for less time. This ticket is an opportunity to do that by applying a technique that is already used for two members:
When reading a single value like those above without the requirement that the value be "frozen" (unchanged) for some period of time (such as until a caller completes its work), it is sufficient to read a cached copy of that value without holding _mutex.
Consider a caller of getCurrentCommittedSnapshotOpTime (link):
OpTime ReplicationCoordinatorImpl::getCurrentCommittedSnapshotOpTime() const { stdx::lock_guard lk(_mutex); return _getCurrentCommittedSnapshotOpTime(lk); } OpTime ReplicationCoordinatorImpl::_getCurrentCommittedSnapshotOpTime(WithLock) const { return _currentCommittedSnapshot.value_or(OpTime()); }
The guarantee is that getCurrentCommittedSnapshotOpTime() returns the value of currentCommittedSnapshot (or OpTime(), in the value_or case) at the instant that mutex is acquired. We know that _currentCommittedSnapshot (_ccs) is set while holding _mutex, so we know that from the time we take _mutex to the time we return, the value does not change. However, immediately upon releasing _mutex, _ccs could (and probably will!) change, and the value that we read will be stale. Therefore, the guarantee is that the value returned was correct at a point in time, but it is not guaranteed to be correct or permanent for any subsequent period of time. Callers must be resilient to this today. If a caller wishes to read one or more protected values in a way that guarantees that they will not change for some defined duration, then the caller must hold _mutex.
Now, let’s consider how we can safely read _ccs without holding _mutex:
OpTime ReplicationCoordinatorImpl::getCurrentCommittedSnapshotOpTime() const { return _currentCommittedSnapshotCached.value_or(OpTime()); } bool ReplicationCoordinatorImpl::_updateCommittedSnapshot(WithLock lk, const OpTime& newCommittedSnapshot) { ...complicated stuff... _currentCommittedSnapshot = newCommittedSnapshot; _currentCommittedSnapshotCached.set(newCommittedSnapshot); // THIS IS NEW. _currentCommittedSnapshotCond.notify_all(); ...complicated stuff... }
Note that we set _currentCommittedSnapshotCached under _mutex when we set _ccs. This must be true for values cached in this way: we must set them under the same mutex. If we did not, then a subsequent reader who holds _mutex might find that the uncached value and the cached value disagree, which would be impossible to reconcile.
Suppose that _ccs = 5 and that newCcs = 7. As the code is written today, there are 2 (and only 2) possible orderings of read and write: reader before writer, and reader after writer. The use of _mutex prevents them from happening simultaneously. If getCcs (reader) acquires _mutex before _updateCcs (writer), then it will read 5, and 7 will be written sometime later. On the other hand, if getCcs acquires _mutex after _updateCcs, then 7 will be written, and 7 will be read sometime later.
Now consider the possible orderings with the cached value, and again suppose that _ccs = 5 and that newCcs = 7. If getCcs (reader) runs strictly before _updateCcs (writer), then it will read 5; if it runs strictly after, then it will read 7. Nothing has changed with those 2 orderings.
The new possibility is that getCcs (reader) can run after the cached value is set to the new value (from 5 to 7) but before _updateCcs (writer) releases _mutex. This means that a caller who does not hold _mutex could see the new value before a caller who does hold it.
Could an external caller do something dangerous with that value? No. That value is correct at some point in time. Although it first becomes visible before _mutex is released, it is identical to the uncached copy of the value that will be read by the very next caller who acquires _mutex. By the same guarantee that we make today, the cached value is not guaranteed to be permanent for any subsequent period of time, just like the uncached value.
Could an internal caller who does not hold _mutex do something dangerous with the new value? No. By the same logic as for an external caller, no behavior has changed.
Could an internal caller who does hold _mutex do something dangerous with the cached value? No, because the cached value is always set under the same _mutex as the uncached value and is always equal to the uncached value. In practice, once you hold _mutex, there’s no reason to use the cached value.
- is related to
-
SERVER-113363 Make ReplicationCoordinatorImpl public accessors lock-free
-
- Closed
-