[SERVER-8235] too-frequent sync source changes causes node to fall behind Created: 18/Jan/13 Updated: 11/Jul/16 Resolved: 16/Feb/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 2.4.0-rc1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Greg Studer | Assignee: | Kristina Chodorow (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | 1) Setup a replica set with 5 nodes A, B, C, D, E, primary is A The idea here is that there are three lagging nodes in the replica set, B, C, and D. Because E is 30s ahead of B and C, D will keep trying to change sync sources. However, since B and C are closer in ping time than E, E will never be chosen and the sync source changes between C and D depending on the stochastic ping delay. This kind of problem might occur in the wild if three lagging nodes were in a separate data center, for example. Once a node falls too far behind the others, the sync swapping would push the node into permanent recovery mode (until one of the other nodes catches up). |
||||||||
| Participants: | |||||||||
| Description |
|
Given particular replica set configurations and network delays, it's possible for a node to repeatedly change the host it syncs from after every replicated batch. This causes the node to fall behind, and eventually the node ends up in recovery mode. |
| Comments |
| Comment by auto [ 16/Feb/13 ] |
|
Author: {u'date': u'2013-02-14T22:09:12Z', u'name': u'Kristina', u'email': u'kristina@10gen.com'}Message: |
| Comment by Greg Studer [ 18/Jan/13 ] |
|
Confirmed reproducible with only connection delays, no need to forbid connections between primary and swapping node. |
| Comment by Greg Studer [ 18/Jan/13 ] |
|
Attached two logs with periodic replSetGetStatus output from all nodes before and during the problem period, along with .js test. |
| Comment by Greg Studer [ 18/Jan/13 ] |
|
Also tried testing with slaveDelay of 20s, to rule out the delays and chaining in general as the problem. Without sync source swapping, all nodes are able to stay up-to-date. |