-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 4.4.9
-
Component/s: None
-
None
-
ALL
We are on 4.4.9 (and are in the process of upgrading), but I opened this ticket as I don't think I have seen this issue in any other jiras. We have 5 voters + 6 non-voters in the replset. We disable replication chaining, and set oplogFetcherSteadyStateMaxFetcherRestarts=0
Overview
We encountered an incident where our replset lost quorum for ~5 minutes, when the old primary was encountering intermittent SSD errors, stopped heartbeating with only 1 node (other nodes had recent heartbeats). That node triggered an election and won ("electionTimeoutPeriodMillis":5000) within ~2ms. `replSetRequestVotes` has the following config/term (lastAppliedOpTime is ~9 seconds behind wall clock - replset appeared wedged?).
configVersion: 15424, configTerm: 442, lastAppliedOpTime: { ts: Timestamp(1718661913, 4888), t: 442 }
The old primary saw the new term (443) within ~700ms, and but only completed the state transition to SECONDARY ~4 seconds after the election.
~1 second before the election, all nodes' OplogFetcher failed with a "Socket operation timed out", and reselected the old primary as the sync source.
Only 4-5 seconds after the election, the voting secondaries' OplogFetchers discovered the new primary (including the old primary).
Choosing new sync source because chaining is disabled and we are aware of a new primary
However, at that point, they all needed to rollback, to undo 373 writes, which took 4-5 minutes (using RTS). I was curious that the timestamps matched, and only the term was different. Timestamp is ~4 seconds after the election (which seems to coincide with the old primary's stepDown).
Our last optime fetched:
Unknown macro: { ts}. source's GTE:
Unknown macro: { ts}
Questions
1/ The rollback took the secondary nodes offline for an additional 4-5 minutes - is that expected? We have ~13TB of data.
2/ Is this a known issue, where the existing protocol may need a rollback after unplanned failures like this?
3/ It struck me as odd that the last optime fetched and the new source (new primary)'s timestamps match exactly, and only the term is different - is that plausible?
4/ My understanding is that the election logic and OplogFetcher/bgsync run independently, is that accurate?
5/ If so, and if this isn't a bug, I'm curious if one of the following make sense:
5a/ once a node has voted, it tells the OplogFetcher to stop sync'ing for <election timeout>, and perform re-discovery
5b/ before processing each batch that we have fetched
Details
Will update this with more details.
- is related to
-
SERVER-95560 Majority secondaries can enter rollback state during a failover where the old primary is frozen momentarily
- Closed
- related to
-
SERVER-95560 Majority secondaries can enter rollback state during a failover where the old primary is frozen momentarily
- Closed