Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-91764

Election of new primary caused all secondaries to rollback

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.4.9
    • Component/s: None
    • None
    • ALL

      We are on 4.4.9 (and are in the process of upgrading), but I opened this ticket as I don't think I have seen this issue in any other jiras.  We have 5 voters + 6 non-voters in the replset.  We disable replication chaining, and set oplogFetcherSteadyStateMaxFetcherRestarts=0

      Overview
      We encountered an incident where our replset lost quorum for ~5 minutes, when the old primary was encountering intermittent SSD errors, stopped heartbeating with only 1 node (other nodes had recent heartbeats).  That node triggered an election and won ("electionTimeoutPeriodMillis":5000) within ~2ms.  `replSetRequestVotes` has the following config/term (lastAppliedOpTime is ~9 seconds behind wall clock - replset appeared wedged?).

      configVersion: 15424, configTerm: 442, lastAppliedOpTime: { ts: Timestamp(1718661913, 4888), t: 442 }

      The old primary saw the new term (443) within ~700ms, and but only completed the state transition to SECONDARY ~4 seconds after the election.

      ~1 second before the election, all nodes' OplogFetcher failed with a "Socket operation timed out", and reselected the old primary as the sync source.

      Only 4-5 seconds after the election, the voting secondaries' OplogFetchers discovered the new primary (including the old primary).

      Choosing new sync source because chaining is disabled and we are aware of a new primary

      However, at that point, they all needed to rollback, to undo 373 writes, which took 4-5 minutes (using RTS).  I was curious that the timestamps matched, and only the term was different.  Timestamp is ~4 seconds after the election (which seems to coincide with the old primary's stepDown).

      Our last optime fetched:

      Unknown macro: { ts}

      . source's GTE:

      Unknown macro: { ts}

      Questions
      1/ The rollback took the secondary nodes offline for an additional 4-5 minutes - is that expected?  We have ~13TB of data.

      2/ Is this a known issue, where the existing protocol may need a rollback after unplanned failures like this?

      3/ It struck me as odd that the last optime fetched and the new source (new primary)'s timestamps match exactly, and only the term is different - is that plausible?

      4/ My understanding is that the election logic and OplogFetcher/bgsync run independently, is that accurate?

      5/ If so, and if this isn't a bug, I'm curious if one of the following make sense:
      5a/ once a node has voted, it tells the OplogFetcher to stop sync'ing for <election timeout>, and perform re-discovery
      5b/ before processing each batch that we have fetched

      Details
      Will update this with more details.

            Assignee:
            Unassigned Unassigned
            Reporter:
            tim@eekz.com Tim T
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated:
              Resolved: