[SERVER-80258] Failover stuck in catch up mode Created: 08/Aug/23  Updated: 24/Aug/23  Resolved: 24/Aug/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Suraj Narkhede Assignee: Huayu Ouyang
Resolution: Done Votes: 1
Labels: Bug
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Sprint: Repl 2023-09-04
Participants:

 Description   

Hi – We recently encountered a potential bug in Mongo where the failover can get stuck in catch up mode.

Fundamentally – the bug seems to be sync source selection selects a node with higher optime, but not necessarily the one with highest optime, whereas catch up during failover is trying to catch up to highest optime in the shard. With replication chaining disabled, this doesn't seem to work.

It seems like the bug would trigger in the following scenario:

  1. Replication Chaining is Disabled.
  2. A shard has 5 voting nodes and 1 non voting node.
  3. In case of unplanned failover, non-voting node is most caught up
  4. In the failover, 2nd most caught up voting node is elected as leader.

During the catch up mode – 

  1. The elected primary would determine the optime of non-voting node as the target optime.
  2. The node would select the most caught voting node as the sync source – as we ignore non-voting nodes in the first attempt.

This would cause the primary to get stuck until `catchupTimeoutMillis` hits.

 

Mongo version: 4.4



 Comments   
Comment by Huayu Ouyang [ 24/Aug/23 ]

surajn.vnit@gmail.com
"This is not true if replication chaining is disabled though, right?" - yes, sorry, in the case when replication chaining is disabled, after the fix from 4.4.10, even if the primary-elect is not necessarily syncing from the most up-to-date node, and that node is syncing from the primary-elect, the primary-elect will re-evaluate its sync source after it catches up with the currently chosen one, detect the cycle, and switch sync sources to eventually choose the most up-to-date node.

"I was just curious about the trade-offs of doing approach one here." - I believe we decided not to implement approach one to choose the most up-to-date node as the sync source because the problem described only occurs when chaining is disabled, and approach two directly fixes the primary catchup problem for when chaining is disabled rather than changing how sync source selection works as a whole. Additionally, choosing the nearest sync source instead of the most up-to-date node is generally faster and cheaper since the most up-to-date node might be in a different region/data center, and it also aligns with how sync source selection works during steady state replication instead of having to change sync source selection specifically for primary catchup.

Comment by Suraj Narkhede [ 24/Aug/23 ]

We are validating the fix in 4.4.10, I will circle back on that. Seems like it should work. Thanks for the help. Feel free to resolve the ticket.

Comment by Suraj Narkhede [ 24/Aug/23 ]

Hi huayu.ouyang@mongodb.com – Thanks for the response. 

"In general, during primary catchup, the primary-elect might not necessarily sync from the most up-to-date node, but its sync source will sync from a more up-to-date node. This will mean that the primary-elect will still be able to catchup to the most caught up secondary." –  This is not true if replication chaining is disabled though, right?

As per comment on the fix in 4.4.10 – "we think there are two things that can be done to avoid the sync source selection cycle. First when selecting sync source when we're in primary catchup, instead of choosing based on ping latency, we can sort the candidates based on whoever is newer in oplog and select the most up-to-date node that is alive. Another thing to do is when checking shouldChangeSyncSource after each batch, we return true if we are primary and already caught up to the sync source, but the sync source is also syncing from us, thus breaking the cycle." 

I was just curious about the trade-offs of doing approach one here. 

Comment by Huayu Ouyang [ 22/Aug/23 ]

Hi surajn.vnit@gmail.com, oplogFetcherSteadyStateMaxFetcherRestarts controls the maximum number of times the oplog fetcher will consecutively restart tailing the oplog on non-cancellation errors during steady state replication, so setting it to 0 just means if there's an error with oplog fetching during primary catchup , the primary-elect will re-do sync source selection for primary catchup.

Are you looking for the catchUpTimeoutMillis field? Setting this to 0 will make it so that newly elected primaries will not attempt to catch up, so it will become primary immediately before it is fully caught up. However, in this case, the primary would not roll back, and instead the non-voting node that is ahead of the primary would rollback its oplog entries that are ahead of the primary (and these oplog entries wouldn't have been majority committed by 3/5 of the voting nodes because if they were, the newly elected primary that is 2nd most caught up would not have been able to been elected by 3/5 voting nodes).

I'm not sure what you mean by "What are the trade-offs of selecting the most caught up secondary as sync source directly during _catchup phase?", could you elaborate? Thanks! In general, during primary catchup, the primary-elect might not necessarily sync from the most up-to-date node, but its sync source will sync from a more up-to-date node. This will mean that the primary-elect will still be able to catchup to the most caught up secondary.

Comment by Suraj Narkhede [ 21/Aug/23 ]

Hi huayu.ouyang@mongodb.com – Thanks for digging into this.

I did see the cycle as mentioned in the bug during the incident. So, this fix would have likely helped with the faster failover for us.

I have a follow up question – we have oplogFetcherSteadyStateMaxFetcherRestarts _ = 0. Will this cause new primary to assume primary before its fully caught up and rollback on the node that was most ahead in oplog? What are the trade-offs of selecting the most caught up secondary as sync source directly during _catchup phase?

Comment by Huayu Ouyang [ 21/Aug/23 ]

Hi surajn.vnit@gmail.com, I can't confirm if this is definitely the cause without seeing the logs, but an issue with sync source selection cycles during primary catchup was fixed in SERVER-58988, which was released in 4.4.10, so it's possible that upgrading to >=4.4.10 would fix your case.

Comment by Suraj Narkhede [ 21/Aug/23 ]

It's 4.4.9. 

Comment by Huayu Ouyang [ 21/Aug/23 ]

Hi surajn.vnit@gmail.com, do you also know which version of mongoDB this was on (like 4.4.0 vs 4.4.1, etc)?

Comment by Huayu Ouyang [ 18/Aug/23 ]

Hi surajn.vnit@gmail.com Just to confirm, we don't have any logs from this incident either?

Comment by Suraj Narkhede [ 16/Aug/23 ]

Thanks chris.kelly@mongodb.com . I actually linked internal code link (not accessible) in the original description. Can you please update the link with https://github.com/mongodb/mongo/blob/v4.4/src/mongo/db/repl/topology_coordinator.cpp#L292-L420. Thanks!

Comment by Chris Kelly [ 14/Aug/23 ]

Thanks surajn.vnit@gmail.com,

I've passed this on to the relevant team to look into your scenario further, in case we can say something definitive based off of it. If you happen to get logs again, please upload them (as it may prove helpful to look into this behavior further).

 

Comment by Suraj Narkhede [ 14/Aug/23 ]

Hello chris.kelly@mongodb.com – thanks for your help! Anything I can help with to investigate this issue?

Comment by Suraj Narkhede [ 10/Aug/23 ]

Hello chris.kelly@mongodb.com – thanks for checking this. Unfortunately, we do not have the diagnostic data going back to 08/05 on shard now.

I can shed more light here to explain the scenario:

  • Let's say – shard has 6 Nodes (A, B, C, D, E, F) with A being the primary.
  • A, B, C, D, E are hidden=false, priority=1, vote=1
  • F is hidden=true, priority=0, vote=0
  • At t1, node A crashed. 
  • After A crashed, nodes in the decreasing order of optimes are F, B, C, D, E.
  • Node C triggered the election and it got two votes (from D and E) and it elected itself as primary.
    • Mongo logs: "Election succeeded, assuming primary role"
  • Node C, via hearts determine the target optime as optime of F. 
    • '{"t": {"$date":"2023-08-05T19:41:58.583+00:00"}

      ,"s":"I", "c":"REPL", "id":21365, "ctx":"ReplCoord-1525","msg":"Heartbeats updated catchup target optime","attr":{"targetOpTime":{"ts":{"$timestamp":{"t":1691264511,"i":983}},"t":102}}}'

  • But Node C, chose Node B as sync source because chose sync source just looks for a node with higher optime than the chosen primary.
  • Now, C is trying to catchup with optime of F but B is behind F and replication chaining is disabled. So the failover is stuck in catch up mode.
Comment by Chris Kelly [ 10/Aug/23 ]

Hi surajn.vnit@gmail.com,

Thanks for your report. To investigate this further, it'd be helpful to get some diagnostic data covering the time you're observing this happen.

I've created a secure upload link for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Chris

Generated at Thu Feb 08 06:43:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.