Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.4.9
Component/s: None
Labels:
None

Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We are on 4.4.9 (and are in the process of upgrading), but I opened this ticket as I don't think I have seen this issue in any other jiras. We have 5 voters + 6 non-voters in the replset. We disable replication chaining, and set oplogFetcherSteadyStateMaxFetcherRestarts=0

Overview
We encountered an incident where our replset lost quorum for ~5 minutes, when the old primary was encountering intermittent SSD errors, stopped heartbeating with only 1 node (other nodes had recent heartbeats). That node triggered an election and won ("electionTimeoutPeriodMillis":5000) within ~2ms. `replSetRequestVotes` has the following config/term (lastAppliedOpTime is ~9 seconds behind wall clock - replset appeared wedged?).

configVersion: 15424, configTerm: 442, lastAppliedOpTime: { ts: Timestamp(1718661913, 4888), t: 442 }

The old primary saw the new term (443) within ~700ms, and but only completed the state transition to SECONDARY ~4 seconds after the election.

~1 second before the election, all nodes' OplogFetcher failed with a "Socket operation timed out", and reselected the old primary as the sync source.

Only 4-5 seconds after the election, the voting secondaries' OplogFetchers discovered the new primary (including the old primary).

Choosing new sync source because chaining is disabled and we are aware of a new primary

However, at that point, they all needed to rollback, to undo 373 writes, which took 4-5 minutes (using RTS). I was curious that the timestamps matched, and only the term was different. Timestamp is ~4 seconds after the election (which seems to coincide with the old primary's stepDown).

Our last optime fetched:

Unknown macro: { ts}

. source's GTE:

Unknown macro: { ts}

Questions
1/ The rollback took the secondary nodes offline for an additional 4-5 minutes - is that expected? We have ~13TB of data.

2/ Is this a known issue, where the existing protocol may need a rollback after unplanned failures like this?

3/ It struck me as odd that the last optime fetched and the new source (new primary)'s timestamps match exactly, and only the term is different - is that plausible?

4/ My understanding is that the election logic and OplogFetcher/bgsync run independently, is that accurate?

5/ If so, and if this isn't a bug, I'm curious if one of the following make sense:
5a/ once a node has voted, it tells the OplogFetcher to stop sync'ing for <election timeout>, and perform re-discovery
5b/ before processing each batch that we have fetched

Details
Will update this with more details.

is related to

SERVER-95560 Majority secondaries can enter rollback state during a failover where the old primary is frozen momentarily

Closed

related to

SERVER-95560 Majority secondaries can enter rollback state during a failover where the old primary is frozen momentarily

Closed

Assignee:: Unassigned
Reporter:: Tim T
Participants:: Chris Kelly, Tim T
Votes:: 0 Vote for this issue
Watchers:: 14 Start watching this issue

Created:: Jun 21 2024 05:40:07 PM UTC
Updated:: Oct 07 2024 08:36:00 PM UTC
Resolved:: Oct 07 2024 08:35:59 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates