[SERVER-38366] Replica set nodes update the term without verifying the config version can lead to unnecessary stepdown. Created: 03/Dec/18  Updated: 06/Dec/22  Resolved: 10/Dec/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.1.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Backlog - Replication Team
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-35608 Invariant that term from lastAppliedO... Closed
is related to DOCS-12253 Add a comment in "Modify Replica Set ... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

Currently, the replica set nodes can learn about the higher term via heartbeart, oplog fetcher and cmds (like find & getmore).  When the term is learnt via oplog fetcher,  it calls ReplicationCoordinatorImpl::_processReplSetMetadata_inlock which updates the term only if the config version of the sync source is same as mine. We are missing that config version check in heartbeat, find and getmore before updating the term.

Also to be noted is that in ReplicationCoordinatorImpl::_handleHeartbeatResponse we update the term in 2 places     

 

Note : This bug was captured for this particular upgrade/downgrade sequence (pv1->pv0->pv1) where it lead to unnecessary stepdown.

1) Start a replica set in pv1.

2) Insert some document in pv1 (for term =1)

3)Downgrade to pv0 while the secondaries are still replicating the documents from previous pv1 (term =1)

4) Upgrade to pv1 before the secondaries downgrade to pv0.

5) The current primary which is in term 0 receives heartbeat from the secondaries which think they are still in term 1(from step 1)

6) As a result, the current primary updates its term to 1 and steps down and starts a new election for term 2.



 Comments   
Comment by Gregory McKeon (Inactive) [ 10/Dec/18 ]

Since this only affects 3.6 and doesn't cause data corruption, we won't fix this.

Generated at Thu Feb 08 04:48:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.