[SERVER-35950] Replication storm when replica set members lag over WAN Created: 03/Jul/18 Updated: 26/Jul/18 Resolved: 12/Jul/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | James Kovacs | Assignee: | Tess Avitabile (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Steps To Reproduce: |
|
| Sprint: | Repl 2018-07-30 |
| Participants: | |
| Case: | (copied to CRM) |
| Description |
|
Consider a replica set spread over two DCs with multiple secondaries in each connected by a WAN. (Not an uncommon scenario for many users with main and DR sites.) Under normal conditions, the secondaries will chain such that a single copy of the replicated data will cross the WAN because most secondaries will chain such that lag is minimized. Now consider what happens when the WAN becomes overloaded. The remote secondary replicating from the main site will start to lag as it cannot pull operations across the WAN fast enough. The other remote secondaries will notice this lag and when it hits 30 seconds (aka maxSyncSourceLagSecs) those secondaries will re-evaluate their sync sources selecting one of the nodes in the main DC. This will put additional strain on the already overloaded WAN and make it more likely that all remote secondaries will fall off the oplog as they fall further and further behind the primary. |
| Comments |
| Comment by Tess Avitabile (Inactive) [ 12/Jul/18 ] |
|
james.kovacs, please reopen this ticket if you have more questions about this issue. |
| Comment by Tess Avitabile (Inactive) [ 11/Jul/18 ] |
|
james.kovacs, is it okay to close this ticket? |
| Comment by Tess Avitabile (Inactive) [ 05/Jul/18 ] |
|
I would recommend increasing maxSyncSourceLagSecs. Without doing so, the remote secondaries will refuse to sync from the lagged secondary, regardless of whether it is suggested as a sync source, so if the desired configuration is to have chaining in the remote DC, then maxSyncSourceLagSecs must be increased. It can be problematic to increase maxSyncSourceLagSecs if this causes a majority of nodes to be lagged so that writes cannot be majority committed. However, if the remote DC contains a minority of nodes in the replica set, then this should not be a problem. |