[SERVER-44607] Rollback of an interrupted setFCV cmd can result in the in-memory serverGlobalParams.featureCompatibility diverging from what's written on disk Created: 13/Nov/19  Updated: 10/Mar/20  Resolved: 10/Mar/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Dianna Hohensee (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
duplicates SERVER-46758 setFCV can be interrupted before an F... Closed
Operating System: ALL
Backport Requested:
v4.4, v4.2
Sprint: Execution Team 2020-01-13, Execution Team 2020-01-27, Execution Team 2019-12-30, Execution Team 2020-03-09, Execution Team 2020-03-23
Participants:
Linked BF Score: 23

 Description   

I recommend reloading the feature compatibility version from disk, parsing it, and resetting the in-memory serverGlobalParams.featureCompatibility value after a repl rollback has finished.

Scenario:
1) setFCV does the first write to set "downgrading to 4.2", which is majority committed
2) setFCV does the second write to set "fully downgraded to 4.2", which is not majority committed due to a InterruptedDueToReplStateChange error
3) repl rollback undoes the second write, so the FCV document on disk is back to "downgrading to 4.2"
4) The serverGlobalParams.featureCompatibility value is still set to "fully downgraded to 4.2"
5) a new setFCV(4.2) cmd comes in, sees the serverGlobalParams.featureCompatibility value is "fully downgraded to 4.2" and exits early.
6) The node can now be restarted and load "downgrading to 4.2" into serverGlobalParams.featureCompatibility

See the associated test failure for further details of how this happened in a test.

This should be backported to at least v4.2, where there's a test failure as well. I haven't explored earlier versions for presence of the issue, but it seems likely.



 Comments   
Comment by Dianna Hohensee (Inactive) [ 10/Mar/20 ]

I am closing this in favor of SERVER-46758, re-describing the larger realized problem.

Comment by Dianna Hohensee (Inactive) [ 10/Mar/20 ]

I originally was considering adding a reloadFromDiskAfterRollback function to the feature_compatibility_version.h/cpp files. I ran this solution by Maria, because she knows the relevant code well, and she raised an interesting question about whether we needed to notify the oplog observer about the FCV change.

I think we now have a bigger question about how to deal with rollback undoing an FCV change that brings all the FCV server changes with it. The setFCV cmd is idempotent for finishing an upgrade or downgrade that has begun. But we can't just change the FCV (roll it back) without running the FCV changes as usual.

Generated at Thu Feb 08 05:06:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.