[SERVER-55465] Fix Invariant upon failed request for a vote from the current primary in the election dry-run of catchup takeover Created: 23/Mar/21  Updated: 02/Jan/24  Resolved: 11/Jun/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.2.15, 4.4.7, 4.0.26, 4.2.16, 5.1.0-rc0, 5.0.24

Type: Bug Priority: Major - P3
Reporter: Moustafa Maher Assignee: Vishnu Kaushik
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
Related
related to SERVER-29502 Require the vote from the current pri... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0, v4.4, v4.2, v4.0, v3.6
Sprint: Repl 2021-06-14
Participants:

 Description   

We are vulnerable to reach to this Invariant following this scenario:

1- A follower node called for a dry run election due to catchup takeover and sent primaryIndex to voteRequester to start scatter_and_gather_runner to send voteRequests to all voters.

2- All Requests has been scheduled  with processResponse as callbacks 

3- Primary node's response is the last response to be processed, which indicates that _callbacks list is empty.

4- The request to the Primary Node failed so we end-up with failed response and we return leaving _primaryVote unchanged as  PrimaryVote::Pending.

5- So the call to hasReceivedSufficientResponses will return false causing the Invariant to fire.

 

Ask to add more diagnostic information: 
Can we add primaryIndex to that log line to log what is the last known primaryIndex for this node while running for the dry-run catchup takeover.



 Comments   
Comment by Githook User [ 02/Jan/24 ]

Author:

{'name': 'Vishnu Kaushik', 'email': 'vishnu.kaushik@mongodb.com', 'username': 'kauboy26'}

Message: SERVER-55465 Response from all nodes means sufficient responses have been received even if primary gave bad response in catchup takeover dry run

(cherry picked from commit efec3cc4b253d02fa9e11947ce92d53b727181b0)

GitOrigin-RevId: 26b41a69a0b4f7fe8170158a9f77e63d408d5849
Branch: v5.0
https://github.com/mongodb/mongo/commit/c460c84942fbbbd36948da98c2c25803612379ca

Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 06/Jul/21 ]

Author:

{'name': 'Vishnu Kaushik', 'email': 'vishnu.kaushik@mongodb.com', 'username': 'kauboy26'}

Message: SERVER-55465 Response from all nodes means sufficient responses have been received even if primary gave bad response in catchup takeover dry run
Branch: v4.2
https://github.com/mongodb/mongo/commit/a611900d667b861528d04492c48972be17a00a0a

Comment by Githook User [ 06/Jul/21 ]

Author:

{'name': 'Vishnu Kaushik', 'email': 'vishnu.kaushik@mongodb.com', 'username': 'kauboy26'}

Message: SERVER-55465 Response from all nodes means sufficient responses have been received even if primary gave bad response in catchup takeover dry run
Branch: v4.0
https://github.com/mongodb/mongo/commit/14fb0899e131721ba80a2cc921cdb9b7d6f3dafa

Comment by Githook User [ 29/Jun/21 ]

Author:

{'name': 'Vishnu Kaushik', 'email': 'vishnu.kaushik@mongodb.com', 'username': 'kauboy26'}

Message: SERVER-55465 Response from all nodes means sufficient responses have been received even if primary gave bad response in catchup takeover dry run
Branch: v4.4
https://github.com/mongodb/mongo/commit/ab6c88a7db9b16ff8120a6c2ad63f2fe597bc388

Comment by Githook User [ 10/Jun/21 ]

Author:

{'name': 'Vishnu Kaushik', 'email': 'vishnu.kaushik@mongodb.com', 'username': 'kauboy26'}

Message: SERVER-55465 Response from all nodes means sufficient responses have been received even if primary gave bad response in catchup takeover dry run
Branch: master
https://github.com/mongodb/mongo/commit/efec3cc4b253d02fa9e11947ce92d53b727181b0

Comment by Moustafa Maher [ 24/Mar/21 ]

One proposal is to move this check:

if (_responsesProcessed == static_cast<int>(_targets.size())) {
   return true;
}

to be the first check in hasReceivedSufficientResponses

Generated at Thu Feb 08 05:36:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.