Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-80258

Failover stuck in catch up mode

    • Type: Icon: Question Question
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
    • Repl 2023-09-04

      Hi – We recently encountered a potential bug in Mongo where the failover can get stuck in catch up mode.

      Fundamentally – the bug seems to be sync source selection selects a node with higher optime, but not necessarily the one with highest optime, whereas catch up during failover is trying to catch up to highest optime in the shard. With replication chaining disabled, this doesn't seem to work.

      It seems like the bug would trigger in the following scenario:

      1. Replication Chaining is Disabled.
      2. A shard has 5 voting nodes and 1 non voting node.
      3. In case of unplanned failover, non-voting node is most caught up
      4. In the failover, 2nd most caught up voting node is elected as leader.

      During the catch up mode – 

      1. The elected primary would determine the optime of non-voting node as the target optime.
      2. The node would select the most caught voting node as the sync source – as we ignore non-voting nodes in the first attempt.

      This would cause the primary to get stuck until `catchupTimeoutMillis` hits.

       

      Mongo version: 4.4

            Assignee:
            huayu.ouyang@mongodb.com Huayu Ouyang
            Reporter:
            surajn.vnit@gmail.com Suraj Narkhede
            Votes:
            1 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: