[SERVER-80258] Failover stuck in catch up mode Created: 08/Aug/23 Updated: 24/Aug/23 Resolved: 24/Aug/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Suraj Narkhede | Assignee: | Huayu Ouyang |
| Resolution: | Done | Votes: | 1 |
| Labels: | Bug | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Sprint: | Repl 2023-09-04 |
| Participants: |
| Description |
|
Hi – We recently encountered a potential bug in Mongo where the failover can get stuck in catch up mode. Fundamentally – the bug seems to be sync source selection selects a node with higher optime, but not necessarily the one with highest optime, whereas catch up during failover is trying to catch up to highest optime in the shard. With replication chaining disabled, this doesn't seem to work. It seems like the bug would trigger in the following scenario:
During the catch up mode –
This would cause the primary to get stuck until `catchupTimeoutMillis` hits.
Mongo version: 4.4 |
| Comments |
| Comment by Huayu Ouyang [ 24/Aug/23 ] |
|
surajn.vnit@gmail.com "I was just curious about the trade-offs of doing approach one here." - I believe we decided not to implement approach one to choose the most up-to-date node as the sync source because the problem described only occurs when chaining is disabled, and approach two directly fixes the primary catchup problem for when chaining is disabled rather than changing how sync source selection works as a whole. Additionally, choosing the nearest sync source instead of the most up-to-date node is generally faster and cheaper since the most up-to-date node might be in a different region/data center, and it also aligns with how sync source selection works during steady state replication instead of having to change sync source selection specifically for primary catchup. |
| Comment by Suraj Narkhede [ 24/Aug/23 ] |
|
We are validating the fix in 4.4.10, I will circle back on that. Seems like it should work. Thanks for the help. Feel free to resolve the ticket. |
| Comment by Suraj Narkhede [ 24/Aug/23 ] |
|
Hi huayu.ouyang@mongodb.com – Thanks for the response. "In general, during primary catchup, the primary-elect might not necessarily sync from the most up-to-date node, but its sync source will sync from a more up-to-date node. This will mean that the primary-elect will still be able to catchup to the most caught up secondary." – This is not true if replication chaining is disabled though, right? As per comment on the fix in 4.4.10 – "we think there are two things that can be done to avoid the sync source selection cycle. First when selecting sync source when we're in primary catchup, instead of choosing based on ping latency, we can sort the candidates based on whoever is newer in oplog and select the most up-to-date node that is alive. Another thing to do is when checking shouldChangeSyncSource after each batch, we return true if we are primary and already caught up to the sync source, but the sync source is also syncing from us, thus breaking the cycle." I was just curious about the trade-offs of doing approach one here. |
| Comment by Huayu Ouyang [ 22/Aug/23 ] |
|
Hi surajn.vnit@gmail.com, oplogFetcherSteadyStateMaxFetcherRestarts controls the maximum number of times the oplog fetcher will consecutively restart tailing the oplog on non-cancellation errors during steady state replication, so setting it to 0 just means if there's an error with oplog fetching during primary catchup , the primary-elect will re-do sync source selection for primary catchup. Are you looking for the catchUpTimeoutMillis field? Setting this to 0 will make it so that newly elected primaries will not attempt to catch up, so it will become primary immediately before it is fully caught up. However, in this case, the primary would not roll back, and instead the non-voting node that is ahead of the primary would rollback its oplog entries that are ahead of the primary (and these oplog entries wouldn't have been majority committed by 3/5 of the voting nodes because if they were, the newly elected primary that is 2nd most caught up would not have been able to been elected by 3/5 voting nodes). I'm not sure what you mean by "What are the trade-offs of selecting the most caught up secondary as sync source directly during _catchup phase?", could you elaborate? Thanks! In general, during primary catchup, the primary-elect might not necessarily sync from the most up-to-date node, but its sync source will sync from a more up-to-date node. This will mean that the primary-elect will still be able to catchup to the most caught up secondary. |
| Comment by Suraj Narkhede [ 21/Aug/23 ] |
|
Hi huayu.ouyang@mongodb.com – Thanks for digging into this. I did see the cycle as mentioned in the bug during the incident. So, this fix would have likely helped with the faster failover for us. I have a follow up question – we have oplogFetcherSteadyStateMaxFetcherRestarts _ = 0. Will this cause new primary to assume primary before its fully caught up and rollback on the node that was most ahead in oplog? What are the trade-offs of selecting the most caught up secondary as sync source directly during _catchup phase? |
| Comment by Huayu Ouyang [ 21/Aug/23 ] |
|
Hi surajn.vnit@gmail.com, I can't confirm if this is definitely the cause without seeing the logs, but an issue with sync source selection cycles during primary catchup was fixed in |
| Comment by Suraj Narkhede [ 21/Aug/23 ] |
|
It's 4.4.9. |
| Comment by Huayu Ouyang [ 21/Aug/23 ] |
|
Hi surajn.vnit@gmail.com, do you also know which version of mongoDB this was on (like 4.4.0 vs 4.4.1, etc)? |
| Comment by Huayu Ouyang [ 18/Aug/23 ] |
|
Hi surajn.vnit@gmail.com Just to confirm, we don't have any logs from this incident either? |
| Comment by Suraj Narkhede [ 16/Aug/23 ] |
|
Thanks chris.kelly@mongodb.com . I actually linked internal code link (not accessible) in the original description. Can you please update the link with https://github.com/mongodb/mongo/blob/v4.4/src/mongo/db/repl/topology_coordinator.cpp#L292-L420. Thanks! |
| Comment by Chris Kelly [ 14/Aug/23 ] |
|
Thanks surajn.vnit@gmail.com, I've passed this on to the relevant team to look into your scenario further, in case we can say something definitive based off of it. If you happen to get logs again, please upload them (as it may prove helpful to look into this behavior further).
|
| Comment by Suraj Narkhede [ 14/Aug/23 ] |
|
Hello chris.kelly@mongodb.com – thanks for your help! Anything I can help with to investigate this issue? |
| Comment by Suraj Narkhede [ 10/Aug/23 ] |
|
Hello chris.kelly@mongodb.com – thanks for checking this. Unfortunately, we do not have the diagnostic data going back to 08/05 on shard now. I can shed more light here to explain the scenario:
|
| Comment by Chris Kelly [ 10/Aug/23 ] |
|
Thanks for your report. To investigate this further, it'd be helpful to get some diagnostic data covering the time you're observing this happen. I've created a secure upload link for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time. For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:
Chris |