Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-1954

SDAM should give priority to electionId over setVersion when updating topology

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Component/s: SDAM
    • None
    • Needed
    • Hide

      Filed DRIVERS-1954 to track drivers change

      Show
      Filed DRIVERS-1954 to track drivers change
    • Hide
      • Sync spec changes in 5bd06a8
      • Confirm tests fail with current updateRSFromPrimary implementation
      • Update updateRSFromPrimary to
        • Prioritize electionId before setVersion
        • Handle nullish values for both setVersion and electionId
        • always set maxElectionId and maxSetVersion together (they're a tuple value)
      • Confirm tests pass with changes
      • Update April 1 2022: Minor fixes were made, notably the spec files were using hello, where it should have been helloOk, see commit: 316c650 to pull in the latest.
      Show
      Sync spec changes in 5bd06a8 Confirm tests fail with current updateRSFromPrimary implementation Update updateRSFromPrimary to Prioritize electionId before setVersion Handle nullish values for both setVersion and electionId always set maxElectionId and maxSetVersion together (they're a tuple value) Confirm tests pass with changes Update April 1 2022: Minor fixes were made, notably the spec files were using hello, where it should have been helloOk, see commit: 316c650 to pull in the latest.
    • $i18n.getText("admin.common.words.hide")
      Key Status/Resolution FixVersion
      CDRIVER-4203 Duplicate
      CXX-2404 Duplicate
      CSHARP-3934 Duplicate
      GODRIVER-2207 Duplicate
      JAVA-4375 Duplicate
      NODE-3712 Duplicate 4.11.0
      PHPC-2068 Duplicate
      PYTHON-2970 Fixed 4.3
      MOTOR-847 Duplicate
      RUBY-2829 Duplicate
      RUST-1081 Duplicate
      SWIFT-1400 Duplicate
      $i18n.getText("admin.common.words.show")
      #scriptField, #scriptField *{ border: 1px solid black; } #scriptField{ border-collapse: collapse; } #scriptField td { text-align: center; /* Center-align text in table cells */ } #scriptField td.key { text-align: left; /* Left-align text in the Key column */ } #scriptField a { text-decoration: none; /* Remove underlines from links */ border: none; /* Remove border from links */ } /* Add green background color to cells with FixVersion */ #scriptField td.hasFixVersion { background-color: #00FF00; /* Green color code */ } /* Center-align the first row headers */ #scriptField th { text-align: center; } Key Status/Resolution FixVersion CDRIVER-4203 Duplicate CXX-2404 Duplicate CSHARP-3934 Duplicate GODRIVER-2207 Duplicate JAVA-4375 Duplicate NODE-3712 Duplicate 4.11.0 PHPC-2068 Duplicate PYTHON-2970 Fixed 4.3 MOTOR-847 Duplicate RUBY-2829 Duplicate RUST-1081 Duplicate SWIFT-1400 Duplicate

      in progress...

      Summary

      SDAM spec specifies that RSM is using the { setVersion, electionId } in that order to detect stale primaries. The motivation for this is that if the protocol version changes (like it happened in 3.2.0) the electionId might not be directly comparable but the setVersion is guaranteed to increment. Details: https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#using-setversion-and-electionid-to-detect-stale-primaries 

      The problem with that if the failover happens before the former primary was able to get the consensus on setVersion increment the new primary will communicate a decremented setVersion while electionId incremented. The existing SDAM treats this as stale primary, which leads to full cluster outage and requires manual intervention. Details in SERVER-59409.

      Drawback: if we need to make non-compatible protocol versions in future, which will make the electionId non monotonical, if will require an additional contingency plan.

      Tests: the SDAM updated in head to match new behavior: https://github.com/mongodb/mongo/tree/master/src/mongo/client/sdam/json_tests/sdam_tests 

      Motivation

      Who is the affected end user?

      Who are the stakeholders? Divers team, server teams.

      How does this affect the end user?

      Full cluster outage is possible.

      How likely is it that this problem or use case will occur?

      It happens in tests all the time.

      If the problem does occur, what are the consequences and how severe are they?

      Outage.

      Is this issue urgent?

      Not urgent but high priority.

      Is this ticket required by a downstream team?

      TBD, might be just normal upgrade path.

      Is this ticket only for tests?

      No.

            Assignee:
            neal.beeken@mongodb.com Neal Beeken
            Reporter:
            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: