-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Replication
Background
We use a heuristic to determine sync source selection via ping time from the heartbeat operation RTTs. If the ping time is > 5ms between nodes, we assume the nodes are not in the same DC. However, the heartbeat operation acquires the replication coordinator mutex and thus we are prone to false negatives when this mutex is under contention. The heartbeat operation grabs the mutex in order to read heartbeat data consistently. Moreover, this heartbeat operation is used by several processes and care must be taken in making changes not to introduce regressions to those other operations.
Options
- Introduce a new operation, for the purposes of replacing the in-DC heuristic, which enables the measurement of round-trip time between nodes but does not acquire the replication coordinator lock.
- Determine whether heartbeat operation implementation can be updated to read the heartbeat data in a volatile consistent manner without the need for grabbing the mutex. (Is recently stale good enough?)
- Do nothing: perhaps the current behavior helps optimize selecting the sync source.
Other Considerations
- Is the heartbeat operation contributing to replication coordinator mutex contention in general?
SERVER-96982
Acceptance Criteria
- We are satisfied with how we use intra-node communication to select sync source even in the presence of replication coordinator mutex contention.
- related to
-
SERVER-96982 Reevaluate using 5ms threshold to determine if nodes are in the same DC for sync source selection
- Closed