I believe this bug goes back all the way back to the beginning of the setFCV framework. Therefore it will need to be backport'ed.
A setFCV cmd will change the FCV value twice: first to put FCV into upgrading / downgrading; then to put FCV into fully upgraded / fully downgraded. For each of these FCV writes, we wait for majority confirmation before proceeding.
However, setFCV can be interrupted while waiting for majority write concern – InterruptedDueToReplStateChange for example – and roll back a step in FCV value. This manifested in test failures where the in-memory FCV value was found not to match the persisted FCV value: the persisted value had been rolled back, but the in-memory value was left unchanged by roll back. Recover to a stable timestamp wipes out writes back to the checkpoint and then plays writes forward from the oplog up to the desired point, so an FCV value change never goes through the OpObserver, even.
I think it’s okay if rollback moves FCV from fully upgraded/downgraded to upgrading/downgrading because the user can simply rerun setFCV in the right direction and the logic is idempotent. This scenario is the same as if the server fails at any point in setFCV and setFCV is retried and we know it works.
However, rolling back from upgrading/downgrading to fully downgraded/upgraded requires running the setFCV logic to make sure the rest of the server settings match the new FCV. And then I believe we must finish an upgrading/downgrading before we can move to downgrading/upgrading. Config servers will be their own special problem because their setFCV logic involves setting the shard servers first or last (I forget).