[SERVER-8375] upon clock skew detection, sync directly from a primary Created: 29/Jan/13  Updated: 11/Jul/16  Resolved: 10/Sep/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.2.2, 2.3.2
Fix Version/s: 2.4.10, 2.5.3

Type: Improvement Priority: Major - P3
Reporter: Eric Milkie Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Participants:

 Description   
Issue Status as of March 30, 2014

ISSUE SUMMARY
The replication code has logic to automatically detect clock skew between two replica set members. It prints a warning message in the log file ("replSet error possible failover clock skew issue?") but takes no further action. This can lead to a sync cycle, where two secondary nodes replicate from each other via the chaining mechanism, each assuming the other node is further ahead in the oplog.

USER IMPACT
A sync cycle (two replica set secondaries syncing from each other) can affect high availability, as the nodes no longer receive the writes from the primary node and will eventually contain stale data. This situation may not be detected immediately, leaving the replica set vulnerable to failure and in the worst case data loss.

SOLUTION
When a node detects clock skew between itself and its sync source, it now switches to the primary node as its sync source to avoid sync cycles.

WORKAROUNDS
Chaining can be globally disabled for a replica set, forcing all members to sync from the primary. See the chainingAllowed setting.

AFFECTED VERSIONS
All recent production release versions up to 2.4.9 are affected.

PATCHES
The fix is included in the 2.4.10 production release and the 2.5.3 development version, which will evolve into the 2.6.0 production release.

Original Description

When replication detects clock skew (the next applied op on a secondary is not strictly after the previous applied op), it logs an error and continues.

Instead, we should force syncing only from the primary, and not attempt to sync from any other secondary via chaining. This will avoid any situations where we might have created a chain cycle.



 Comments   
Comment by Githook User [ 09/Mar/14 ]

Author:

{u'name': u'Dan Pasette', u'email': u'dan@10mongodb.com'}

Message: SERVER-8375 sync from primary on clock skew

Manual backport of git commit ebd13ab35a338370a44e3e2891a06d31718f83aa
Branch: v2.4
https://github.com/mongodb/mongo/commit/801d87f5c8d66d5f5a462c5e0daae67e6b848976

Comment by auto [ 10/Sep/13 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-8375 sync from primary on clock skew

testing this was difficult (as we couldn't manage to simulate a cycle)
what I eneded up doing was:
1. added a fail point to oplog such that when the fail point is active the new code will be hit
2. altered sync_passive2.js (because it had chaining in it) to do the chaining portion twice with the failpoint active the second time to make sure the new code caused us to sync from the primary again
3. https://gist.github.com/dannenberg/dcda9353b637edda6a16 for the full diff
Branch: master
https://github.com/mongodb/mongo/commit/ebd13ab35a338370a44e3e2891a06d31718f83aa

Comment by Scott Hernandez (Inactive) [ 29/Jan/13 ]

If it gets in a cycle won't there be no new entries to cause this condition?

Generated at Thu Feb 08 03:17:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.