Loading...

XML

Word

Printable

JSON

Type: Question
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- Bug

Sprint:
Repl 2023-09-04
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Hi – We recently encountered a potential bug in Mongo where the failover can get stuck in catch up mode.

Fundamentally – the bug seems to be sync source selection selects a node with higher optime, but not necessarily the one with highest optime, whereas catch up during failover is trying to catch up to highest optime in the shard. With replication chaining disabled, this doesn't seem to work.

It seems like the bug would trigger in the following scenario:

Replication Chaining is Disabled.
A shard has 5 voting nodes and 1 non voting node.
In case of unplanned failover, non-voting node is most caught up
In the failover, 2nd most caught up voting node is elected as leader.

During the catch up mode –

The elected primary would determine the optime of non-voting node as the target optime.
The node would select the most caught voting node as the sync source – as we ignore non-voting nodes in the first attempt.

This would cause the primary to get stuck until `catchupTimeoutMillis` hits.

Mongo version: 4.4

Assignee:: Huayu Ouyang
Reporter:: Suraj Narkhede
Participants:: Chris Kelly, Huayu Ouyang, Suraj Narkhede
Votes:: 1 Vote for this issue
Watchers:: 12 Start watching this issue

Created:: Aug 08 2023 10:49:12 PM UTC
Updated:: Aug 24 2023 08:48:27 PM UTC
Resolved:: Aug 24 2023 08:48:27 PM UTC

Details

Description

Attachments

Activity

People

Dates