[SERVER-72225] Failed reconfig can appear to succeed Created: 16/Dec/22  Updated: 29/Oct/23  Resolved: 03/Mar/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Huayu Ouyang Assignee: Ali Mir
Resolution: Fixed Votes: 0
Labels: repl-shortlist
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Assigned Teams:
Replication
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2023-03-06
Participants:
Linked BF Score: 5

 Description   

The following situation occurs:
1. Node0 is disconnected from the rest of the set
2. A reconfig for C1 (1007, 1) is issued on node0 (in a parallel shell)
3. The rest of the set elects a new primary and is reconfiged to have a new config C2 (8, 2)
4. Node0 is reconnected, and upon communicating with the rest of the set, successfully fetches and installs the config C2.
5. However, simultaneously, in the reconfig for C1, awaitConfigCommitment calls awaitReplication, which ends up calling _doneWaitingForReplication_inlock to check the config of the other nodes against our own. However, when we make the config predicate we merely check that the other nodes's config version and term are the same as our current config version and term , not that the C1's specific version and term have been committed.

So if the reconfig for C2 on node0 has already succeeded, then node0 does have the same config version and term (8, 2) as the other nodes, which would cause replSetReconfig for C1 to erroneously return that it succeeded, even when the reconfig for C1 has actually failed.

A possible solution for this could be making awaitConfigCommitment here check more specifically that the reconfig with the new config's specific version and term has been committed.



 Comments   
Comment by Githook User [ 03/Mar/23 ]

Author:

{'name': 'Ali Mir', 'email': 'ali.mir@mongodb.com', 'username': 'ali-mir'}

Message: SERVER-72225 Fail waiting for config commitment if new config was installed while waiting
Branch: master
https://github.com/mongodb/mongo/commit/029d05353bf56ee570394c8719816723e90c93bd

Comment by Ali Mir [ 23/Feb/23 ]

Just noting down the exact race that produces this failure. It's a very specific interweaving of events:

thread 1, performing reconfig(1007, 1)                           // thread 2, main thread
                                                                 //  - network split
- reconfig with config (1007, 1)
- awaitReplication is called to wait
for this config to be committed
                                                                 //  - node reconnects
                                                                 //  - receives heartbeat with higher term
                                                                 //  - starts stepdown
                                                                 //  - schedules the stepdown reconfig for config (8, 2), and completes it
- checks if all the member nodes
have the same config version
and term. check succeeds because
all nodes have (8, 2)
- successfully returns from reconfig
                                                                 //  - stepdown thread starts to kill user ops (this normally would interrupt the reconfig in thread 1)

Generated at Thu Feb 08 06:21:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.