[SERVER-48776] Remove config version and term check during the reconfig quorum check Created: 15/Jun/20  Updated: 29/Oct/23  Resolved: 15/Jul/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.4.1, 4.7.0

Type: Improvement Priority: Major - P3
Reporter: Pavithra Vetriselvan Assignee: Pavithra Vetriselvan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-47948 Replica set reconfig quorum check sho... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4
Sprint: Repl 2020-06-29, Repl 2020-07-13, Repl 2020-07-27
Participants:
Linked BF Score: 9

 Description   

During this step, if we learn that another node has a newer config, we will fail the reconfig command with NewReplicaSetConfigurationIncompatible.

This extra check seems unnecessary with the safe reconfig protocol.

The error is also confusing in a concurrent stepdown/reconfig scenario:

  • We have a 5 node replica set, with three voting nodes (node0, node2, and node4)
  • The current config is {version: 22, term: 10}

    and the current primary is node2

  • We step up node0, and it runs for an election in term 11
  • Node2 receives a reconfig command for {version: 23, term: 10}
  • Node2 steps down because it hears of a new term, 11, via a vote request from node2. Note, during stepdown, we do not kill the reconfig command unless we are writing down the config document (which takes a DB X lock).
  • Node0 wins the election (with votes from node2 and node4) and successfully increments the term on step up. The current config is {version: 22, term: 11}
  • Node2 does not install the newer config since it's already in the midst of a reconfig
  • Finally, Node2 fails during its quorum check because Node0 already has a newer config.

If we remove the quorum check, we will fail later in the protocol here. This is still safe and also returns a more accurate error (NotMaster).



 Comments   
Comment by Githook User [ 20/Aug/20 ]

Author:

{'name': 'Pavi Vetriselvan', 'email': 'pavithra.vetriselvan@mongodb.com', 'username': 'pvselvan'}

Message: SERVER-48776 remove config term/version check in quorum checker

(cherry picked from commit 1c3532ee1941e37f934f6c14bdc7786619d6b258)
Branch: v4.4
https://github.com/mongodb/mongo/commit/28d7f0bb30fd99a9be0a317e06f1099433ffe39c

Comment by Pavithra Vetriselvan [ 15/Jul/20 ]

This is causing BF's on the 4.4 branch so it will need to be backported to 4.4.1. It it not a release blocker.

Comment by Githook User [ 15/Jul/20 ]

Author:

{'name': 'Pavi Vetriselvan', 'email': 'pavithra.vetriselvan@mongodb.com', 'username': 'pvselvan'}

Message: SERVER-48776 remove config term/version check in quorum checker
Branch: master
https://github.com/mongodb/mongo/commit/1c3532ee1941e37f934f6c14bdc7786619d6b258

Comment by William Schultz (Inactive) [ 15/Jun/20 ]

Just adding a note that we should make sure to not re-introduce any bugs addressed in SERVER-47948.

Generated at Thu Feb 08 05:18:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.