[DRIVERS-1954] SDAM should give priority to electionId over setVersion when updating topology Created: 14/Oct/21  Updated: 07/Oct/22  Resolved: 05/Oct/22

Status: Closed
Project: Drivers
Component/s: SDAM
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Neal Beeken
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-59409 Race between reconfig replication and... Closed
depends on DRIVERS-2196 Sync SDAM tests from mongo server rep... Closed
Duplicate
duplicates DRIVERS-2412 SDAM should prioritize electionId ove... Implementing
Issue split
split to CDRIVER-4203 SDAM should give priority to election... Closed
split to CSHARP-3934 SDAM should give priority to election... Closed
split to CXX-2404 SDAM should give priority to election... Closed
split to GODRIVER-2207 SDAM should give priority to election... Closed
split to MOTOR-847 SDAM should give priority to election... Closed
split to NODE-3712 SDAM should give priority to election... Closed
split to PHPC-2068 SDAM should give priority to election... Closed
split to PYTHON-2970 SDAM should give priority to election... Closed
split to RUBY-2829 SDAM should give priority to election... Closed
split to RUST-1081 SDAM should give priority to election... Closed
split to JAVA-4375 SDAM should give priority to election... Closed
Related
is related to DRIVERS-2412 SDAM should prioritize electionId ove... Implementing
Driver Changes: Needed
Server Compat: 4.4, 5.0, 5.1, 5.3
Quarter: FY23Q3
Upstream Changes Summary:

Filed DRIVERS-1954 to track drivers change

Downstream Changes Summary:
  • Sync spec changes in 5bd06a8
  • Confirm tests fail with current updateRSFromPrimary implementation
  • Update updateRSFromPrimary to
    • Prioritize electionId before setVersion
    • Handle nullish values for both setVersion and electionId
    • always set maxElectionId and maxSetVersion together (they're a tuple value)
  • Confirm tests pass with changes
  • Update April 1 2022: Minor fixes were made, notably the spec files were using hello, where it should have been helloOk, see commit: 316c650 to pull in the latest.
Driver Compliance:
Key Status/Resolution FixVersion
CDRIVER-4203 Duplicate
CXX-2404 Duplicate
CSHARP-3934 Duplicate
GODRIVER-2207 Duplicate
JAVA-4375 Duplicate
NODE-3712 Duplicate 4.11.0
PHPC-2068 Duplicate
PYTHON-2970 Fixed 4.3
MOTOR-847 Duplicate
RUBY-2829 Duplicate
RUST-1081 Duplicate
SWIFT-1400 Duplicate

 Description   

in progress...

Summary

SDAM spec specifies that RSM is using the { setVersion, electionId } in that order to detect stale primaries. The motivation for this is that if the protocol version changes (like it happened in 3.2.0) the electionId might not be directly comparable but the setVersion is guaranteed to increment. Details: https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#using-setversion-and-electionid-to-detect-stale-primaries 

The problem with that if the failover happens before the former primary was able to get the consensus on setVersion increment the new primary will communicate a decremented setVersion while electionId incremented. The existing SDAM treats this as stale primary, which leads to full cluster outage and requires manual intervention. Details in SERVER-59409.

Drawback: if we need to make non-compatible protocol versions in future, which will make the electionId non monotonical, if will require an additional contingency plan.

Tests: the SDAM updated in head to match new behavior: https://github.com/mongodb/mongo/tree/master/src/mongo/client/sdam/json_tests/sdam_tests 

Motivation

Who is the affected end user?

Who are the stakeholders? Divers team, server teams.

How does this affect the end user?

Full cluster outage is possible.

How likely is it that this problem or use case will occur?

It happens in tests all the time.

If the problem does occur, what are the consequences and how severe are they?

Outage.

Is this issue urgent?

Not urgent but high priority.

Is this ticket required by a downstream team?

TBD, might be just normal upgrade path.

Is this ticket only for tests?

No.



 Comments   
Comment by Githook User [ 13/Sep/22 ]

Author:

{'name': 'Shane Harvey', 'email': 'shnhrv@gmail.com', 'username': 'ShaneHarvey'}

Message: DRIVERS-1954 Fix hello->helloOk typo + regenerate all json tests (#1306)
Branch: master
https://github.com/mongodb/specifications/commit/133bf0c47dd97f4ea8a748d62482b7371bcb3dc8

Comment by Githook User [ 01/Apr/22 ]

Author:

{'name': 'Boris', 'email': 'boris.dogadov@mongodb.com', 'username': 'BorisDog'}

Message: DRIVERS-1954: Minor fixes
Branch: master
https://github.com/mongodb/specifications/commit/316c6501ea5870b54e2f932e65d2603cfe44e454

Comment by Githook User [ 07/Mar/22 ]

Author:

{'name': 'Neal Beeken', 'email': 'neal.beeken@mongodb.com', 'username': 'nbbeeken'}

Message: DRIVERS-1954: SDAM should give priority to electionId over setVersion when updating topology (#1122)
Branch: master
https://github.com/mongodb/specifications/commit/5bd06a81d97850a1365d73f08bbb7dea562e72b9

Comment by Shane Harvey [ 07/Feb/22 ]

PR: https://github.com/mongodb/specifications/pull/1122

Generated at Thu Feb 08 08:24:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.