-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
Replication
-
Fully Compatible
-
ALL
-
Repl 2023-03-06
-
5
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
The following situation occurs:
1. Node0 is disconnected from the rest of the set
2. A reconfig for C1 (1007, 1) is issued on node0 (in a parallel shell)
3. The rest of the set elects a new primary and is reconfiged to have a new config C2 (8, 2)
4. Node0 is reconnected, and upon communicating with the rest of the set, successfully fetches and installs the config C2.
5. However, simultaneously, in the reconfig for C1, awaitConfigCommitment calls awaitReplication, which ends up calling _doneWaitingForReplication_inlock to check the config of the other nodes against our own. However, when we make the config predicate we merely check that the other nodes's config version and term are the same as our current config version and term , not that the C1's specific version and term have been committed.
So if the reconfig for C2 on node0 has already succeeded, then node0 does have the same config version and term (8, 2) as the other nodes, which would cause replSetReconfig for C1 to erroneously return that it succeeded, even when the reconfig for C1 has actually failed.
A possible solution for this could be making awaitConfigCommitment here check more specifically that the reconfig with the new config's specific version and term has been committed.