[SERVER-58988] Avoid sync source selection cycle during primary catchup. Created: 30/Jul/21 Updated: 29/Oct/23 Resolved: 02/Sep/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.2.18, 4.4.10, 5.0.4, 5.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Wenbin Zhu | Assignee: | Wenbin Zhu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Backport Requested: |
v5.0, v4.4, v4.2
|
||||||||||||||||||||
| Sprint: | Repl 2021-08-23, Repl 2021-09-06 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
During primary catchup, the primary does not necessarily sync from the most up-to-date node, we assume its sync source will sync from a more up-to-date node. However in some cases, the primary that is doing catchup can end up in a situation that it is syncing from a node who in turn selects this primary as sync source. This forms a sync source selection cycle and if this sync source node is not up-to-date in primary's view, then the primary can never be able to catchup to the target opTime, which causes a long unavailability window. If this happens, we rely on catchupTimeout (by default infinite), catchup takeover (by default 30s) or manual abort to bail it out, but the long unavailability window is undesirable. |
| Comments |
| Comment by Vivian Ge (Inactive) [ 06/Oct/21 ] |
|
Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you! |
| Comment by Githook User [ 04/Oct/21 ] |
|
Author: {'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}Message: (cherry picked from commit b46acdbba8ec51810b6f402dbe18ed7ea98fd13d) |
| Comment by Githook User [ 01/Oct/21 ] |
|
Author: {'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}Message: (cherry picked from commit b46acdbba8ec51810b6f402dbe18ed7ea98fd13d) |
| Comment by Githook User [ 01/Oct/21 ] |
|
Author: {'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}Message: (cherry picked from commit b46acdbba8ec51810b6f402dbe18ed7ea98fd13d) |
| Comment by Githook User [ 02/Sep/21 ] |
|
Author: {'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}Message: |
| Comment by Siyuan Zhou [ 03/Aug/21 ] |
|
I don't think we modeled sync source selection in the TLA+ spec. I remembered we wanted to extend the TLA+ spec for that when Samy and Xuerui were working on the sync source selection. Happy to brainstorm if you need. |
| Comment by Wenbin Zhu [ 02/Aug/21 ] |
|
judah.schvimer after a second thought, I think this might not be a big problem. The scenario I was describing is that say n2 is the primary elect, n0 was old primary and is up-to-date, n1 is syncing from n0, n2 is syncing from n1 and now n1,n2 are equally up-to-date. Now if there is a partition between n0 and n1, n1 needs to change sync source but will not sync from n2 because chaining is enabled and n2 is not ahead of it, so n1 will clear its sync source. When n2 sees n1 cleared its sync source, n2 will also select a new sync source, which will be n0 and so we won't be stuck, it's just that n2 will spend a bit more time to converge. |
| Comment by Judah Schvimer [ 02/Aug/21 ] |
wenbin.zhu, I'm not following this scenario. I'd expect the above TLA+ spec to catch this case. Can you please elaborate on this scenario? |
| Comment by Judah Schvimer [ 02/Aug/21 ] |
|
I think as part of this ticket, we should model the bug and the fix in the RaftMongo.tla spec. CC siyuan.zhou and jesse who may be interested. |
| Comment by Wenbin Zhu [ 02/Aug/21 ] |
When chaining is disabled, we call `_choosePrimaryAsSyncSource` where we allow choosing a primary that is equally up-to-date, which seems make sense because we can only sync from primary, but we can rethink about this decision. And I think you're right that this problem is exclusively when chaining is disable, because when chaining is enabled, we don't allow choosing a sync source that is equally up-to-date. But I think there is one case that needs to be considered when chaining is enabled: if there is a partition between n0 and n1, and primary selects n1 as its sync source, we can still get stuck, but if we choose n0 in the first place (with the first approach), we can avoid this case, that's one of the reason that we initially want to combine both approaches as a better solution. |
| Comment by Judah Schvimer [ 02/Aug/21 ] |
|
I'd be pretty nervous about the first solution. This problem is exclusively for when chaining is disabled, correct? If so, I'm inclined to make the fix only for when chaining is disabled. When chaining is disabled, why is a node allowed to sync from a primary that's not ahead of it? Nodes syncing from nodes not strictly ahead of them seems prone to cycles. |
| Comment by Wenbin Zhu [ 02/Aug/21 ] |
|
Hey pavithra.vetriselvan, currently when checking shouldChangeSyncSource, we know the syncSourceIndex from OplogQueryMetadata. This can serve the purpose of knowing which node is syncing from us, although using syncSourceIndex is not 100% correct, since the index might com from a different config (similar to |
| Comment by Pavithra Vetriselvan [ 02/Aug/21 ] |
|
The second solution feels a little more intuitive to me, but do we currently have a way to track which nodes are syncing from us? Would we have to add that mechanism? |
| Comment by Wenbin Zhu [ 30/Jul/21 ] |
|
1) How exactly can this happen: Say we have 3 nodes and chaining is disabled. n0 is the previous primary with is has newer lastApplied opTime than the other two nodes. n2 was elected to be the new primary and entered catchup mode with vote from n1. Now n1 is syncing from n0, and synced some new entries from n0 but not the newest yet. The primary n2 starts to select its sync source for catchup and because in this case we choose sync source based on ping latency, n2 could choose n1 as its sync source. Then n2 quickly caught up to n1 and they become equal. At the end of each opLogFetcher's successful batch, we check to see if we should keep syncing from the current sync source, and let's say at this point n1 has not changed its sync source to n2 (due to n2 became primary, n1 will eventually sync from n2 because chaining is disabled). Since n1 still has a sync source, n2 keep thinking that n1 is a valid sync source for it. Now n1 realizes that n2 becomes primary so it will choose n2 as its new sync source. Now the sync source selection cycle is formed and the problem is this cycle is not just transient, both n2 and n1 will not change their sync source after that. For n2 even though once a while the fetcher timeout and returns an empty batch which allow n2 to re-check if sync source is valid, it always thinks n1 is its valid sync source because n1 has its own sync source, and for n1, since chaining is disabled, n2 is its only valid sync source, and we don't require a sync source to be ahead of us if it is a primary so n1 won't change its sync source either and the problem persists. 2) How can we improve to avoid it: For now, we think there are two things that can be done to avoid the sync source selection cycle. First when selecting sync source when we're in primary catchup, instead of choosing based on ping latency, we can sort the candidates based on whoever is newer in oplog and select the most up-to-date node that is alive. Another thing to do is when checking shouldChangeSyncSource after each batch, we return true if we are primary and already caught up to the sync source, but the sync source is also syncing from us, thus breaking the cycle. |