Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-8375

upon clock skew detection, sync directly from a primary

    • Type: Icon: Improvement Improvement
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 2.4.10, 2.5.3
    • Affects Version/s: 2.2.2, 2.3.2
    • Component/s: Replication
    • None

      Issue Status as of March 30, 2014

      ISSUE SUMMARY
      The replication code has logic to automatically detect clock skew between two replica set members. It prints a warning message in the log file ("replSet error possible failover clock skew issue?") but takes no further action. This can lead to a sync cycle, where two secondary nodes replicate from each other via the chaining mechanism, each assuming the other node is further ahead in the oplog.

      USER IMPACT
      A sync cycle (two replica set secondaries syncing from each other) can affect high availability, as the nodes no longer receive the writes from the primary node and will eventually contain stale data. This situation may not be detected immediately, leaving the replica set vulnerable to failure and in the worst case data loss.

      SOLUTION
      When a node detects clock skew between itself and its sync source, it now switches to the primary node as its sync source to avoid sync cycles.

      WORKAROUNDS
      Chaining can be globally disabled for a replica set, forcing all members to sync from the primary. See the chainingAllowed setting.

      AFFECTED VERSIONS
      All recent production release versions up to 2.4.9 are affected.

      PATCHES
      The fix is included in the 2.4.10 production release and the 2.5.3 development version, which will evolve into the 2.6.0 production release.

      Original Description

      When replication detects clock skew (the next applied op on a secondary is not strictly after the previous applied op), it logs an error and continues.

      Instead, we should force syncing only from the primary, and not attempt to sync from any other secondary via chaining. This will avoid any situations where we might have created a chain cycle.

            Assignee:
            matt.dannenberg Matt Dannenberg
            Reporter:
            milkie@mongodb.com Eric Milkie
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: