We see the following sequence occur when running rollback_set_fcv.js:
- It first successfully runs a test which ends with the replica set in version 4.9.
- Then, in the second test, the primary first sets the FCV of the entire replica set to 4.4. Since this is a downgrade from 4.9, we issue a reconfig, which waits to be propagated to a majority of the nodes.
- We see a log message indicating that it has been propagated to a majority of nodes:
- It also seems like from heartbeat requests, all nodes have actually updated to this new config of configVersion: 6, configTerm: 2.
- However, on the primary, we only receive heartbeat responses with the new config of configVersion: 6, configTerm: 2 from one of the secondaries:
- After this, the primary is network partitioned from the other nodes, and attempts to set its FCV to the target version of 4.9.
- In order to do this, it must first wait for the current replica set config (configVersion: 6, configTerm: 2) to propagate to all nodes. It seems like the primary never updated its local view of one of the secondaries (node with port 20770), so it still thinks it has an old config. Since the primary is partitioned, it is never able to update its view of the other nodes' configs, which causes the hang in awaitReplication, so it is never able to set its FCV to the target version of 4.9.
One possible solution could be to make sure the initial downgrade reconfig is propagated to all of the nodes in the replica set, rather than just the majority. This way, the primary's view of the other nodes' configs would be up to date before trying to set its FCV.