Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-97281

Consider separating internal ping time calculations from heartbeat commands

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Replication

      Background

      We use a heuristic to determine sync source selection via ping time from the heartbeat operation RTTs. If the ping time is > 5ms between nodes, we assume the nodes are not in the same DC. However, the heartbeat operation acquires the replication coordinator mutex and thus we are prone to false negatives when this mutex is under contention. The heartbeat operation grabs the mutex in order to read heartbeat data consistently. Moreover, this heartbeat operation is used by several processes and care must be taken in making changes not to introduce regressions to those other operations.

      Options

      • Introduce a new operation, for the purposes of replacing the in-DC heuristic, which enables the measurement of round-trip time between nodes but does not acquire the replication coordinator lock.
      • Determine whether heartbeat operation implementation can be updated to read the heartbeat data in a volatile consistent manner without the need for grabbing the mutex. (Is recently stale good enough?)
      • Do nothing: perhaps the current behavior helps optimize selecting the sync source.

      Other Considerations

      • Is the heartbeat operation contributing to replication coordinator mutex contention in general?
      • SERVER-96982

      Acceptance Criteria

      • We are satisfied with how we use intra-node communication to select sync source even in the presence of replication coordinator mutex contention.

            Assignee:
            Unassigned Unassigned
            Reporter:
            austin.miller@mongodb.com Austin Miller
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: