-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
ALL
-
-
Repl 2018-07-30
-
(copied to CRM)
Consider a replica set spread over two DCs with multiple secondaries in each connected by a WAN. (Not an uncommon scenario for many users with main and DR sites.) Under normal conditions, the secondaries will chain such that a single copy of the replicated data will cross the WAN because most secondaries will chain such that lag is minimized.
Now consider what happens when the WAN becomes overloaded. The remote secondary replicating from the main site will start to lag as it cannot pull operations across the WAN fast enough. The other remote secondaries will notice this lag and when it hits 30 seconds (aka maxSyncSourceLagSecs) those secondaries will re-evaluate their sync sources selecting one of the nodes in the main DC. This will put additional strain on the already overloaded WAN and make it more likely that all remote secondaries will fall off the oplog as they fall further and further behind the primary.