Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-1954

SDAM should give priority to electionId over setVersion when updating topology

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major - P3
    • Resolution: Duplicate
    • None
    • SDAM
    • None
    • Needed
    • Hide

      Filed DRIVERS-1954 to track drivers change

      Show
      Filed DRIVERS-1954 to track drivers change
    • Hide
      • Sync spec changes in 5bd06a8
      • Confirm tests fail with current updateRSFromPrimary implementation
      • Update updateRSFromPrimary to
        • Prioritize electionId before setVersion
        • Handle nullish values for both setVersion and electionId
        • always set maxElectionId and maxSetVersion together (they're a tuple value)
      • Confirm tests pass with changes
      • Update April 1 2022: Minor fixes were made, notably the spec files were using hello, where it should have been helloOk, see commit: 316c650 to pull in the latest.
      Show
      Sync spec changes in 5bd06a8 Confirm tests fail with current updateRSFromPrimary implementation Update updateRSFromPrimary to Prioritize electionId before setVersion Handle nullish values for both setVersion and electionId always set maxElectionId and maxSetVersion together (they're a tuple value) Confirm tests pass with changes Update April 1 2022: Minor fixes were made, notably the spec files were using hello, where it should have been helloOk, see commit: 316c650 to pull in the latest.

    Description

      in progress...

      Summary

      SDAM spec specifies that RSM is using the { setVersion, electionId } in that order to detect stale primaries. The motivation for this is that if the protocol version changes (like it happened in 3.2.0) the electionId might not be directly comparable but the setVersion is guaranteed to increment. Details: https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#using-setversion-and-electionid-to-detect-stale-primaries 

      The problem with that if the failover happens before the former primary was able to get the consensus on setVersion increment the new primary will communicate a decremented setVersion while electionId incremented. The existing SDAM treats this as stale primary, which leads to full cluster outage and requires manual intervention. Details in SERVER-59409.

      Drawback: if we need to make non-compatible protocol versions in future, which will make the electionId non monotonical, if will require an additional contingency plan.

      Tests: the SDAM updated in head to match new behavior: https://github.com/mongodb/mongo/tree/master/src/mongo/client/sdam/json_tests/sdam_tests 

      Motivation

      Who is the affected end user?

      Who are the stakeholders? Divers team, server teams.

      How does this affect the end user?

      Full cluster outage is possible.

      How likely is it that this problem or use case will occur?

      It happens in tests all the time.

      If the problem does occur, what are the consequences and how severe are they?

      Outage.

      Is this issue urgent?

      Not urgent but high priority.

      Is this ticket required by a downstream team?

      TBD, might be just normal upgrade path.

      Is this ticket only for tests?

      No.

      Attachments

        Issue Links

          Activity

            People

              neal.beeken@mongodb.com Neal Beeken
              andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: