[SERVER-48044] FCV::isVersion should invariant if FCV is unset Created: 08/May/20  Updated: 22/Jul/20  Resolved: 22/Jul/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Lingzhi Deng
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
duplicates SERVER-49070 Add FCV gating helpers Closed
Related
is related to SERVER-48050 FCV should be initialized before atte... Closed
is related to SERVER-32630 Minimize reads of the featureCompatib... Closed
Operating System: ALL
Backport Requested:
v4.4
Sprint: Repl 2020-06-01, Repl 2020-06-15, Repl 2020-07-27
Participants:
Linked BF Score: 6

 Description   

A new way was introduced to read the FCV variable, but it did not carry over the check that the FCV value was initialized.



 Comments   
Comment by Lingzhi Deng [ 22/Jul/20 ]

The reason we might want to always invariant that the FCV is set (i.e. not kUnsetDefault44Behavior) both from clean startup and from restart is because we want to fail loudly if we write (i.e. persist data) based on FCV before having a FCV. By invarianting isVersionInitialized(), we force a crash and force the engineer to think about whether their code is supposed to run before having a FCV. And even if yes, the engineer need to figure out what the default behavior should be. The default behavior is not necessarily 44 (last stable) behavior and should be determine case by case (e.g. creating a collection/index before initializing a node as replSet will follow newer behavior when FCV is kUnsetDefault44Behavior).

In order to make sure callers of isVersionInitialized() are safe, we will probably need to evaluate callers of isVersionInitialized before each release, making sure they are not defaulting to a behavior that would cause unexpected behavior after upgrade/downgrade. This is tracked by SERVER-48106.

After SERVER-49070, isVersion() is gone. And for new FCV helpers being added in SERVER-49070, we will also add an invariant for isVersionInitialized(). So I think we can close this.

Comment by Siyuan Zhou [ 08/Jul/20 ]

Discussed with lingzhi.deng, jason.chan and tess.avitabile offline. I have a better understanding of Lingzhi's proposal now.

The FCV could still be uninitialized (i.e. kUnsetDefault44Behavior) even if the FCV has finished loading.

I thought FCV is only uninitialized during startup, but it's not the case. For example, it remains uninitialized when a node started with --replSet but hasn't received replSetInitiate or finished initial sync.

Because of this subtle difference, Lingzhi is proposing to add a new variable to distinguish startup from uninitialized. The new variable is set during startup and will be checked as an invariant when comparing FCV's.

As mentioned before, kUnsetDefault44Behavior should be treated the same as kFullyDowngradedTo44 in most cases. The new isFCVGreaterThanOrEqualTo() helper encourages developers to follow this pattern.

Comment by Lingzhi Deng [ 02/Jul/20 ]

I dont think we change to kFullyDowngradedTo44 after loading from disk if the FCV has never been set before. isVersionInitialized() still return false in that case, which means FCV == kUnsetDefault44Behavior.
Option 2 is to invariant that we only do FCV comparisons after the FCV has finished loading. Note: The FCV could still be uninitialized (i.e. kUnsetDefault44Behavior) even if the FCV has finished loading. On restart, we have timelines like:

mongod startup -------------- load fcv ------------------------- 
loadedFCV=false                 loadedFCV=true

Option 2 is to add an invariant similar to invariant(loadedFCV) (instead of invariant(isVersionInitialized())) in FCV helpers to make sure we only do FCV comparisons after the FCV has finished loading. As I said, it is not a bug if we do FCV comparisons before FCV is ever initialized, but it is a bug if we do FCV comparisons before FCV is loaded from disk after restart. So option 2 helps to detect the latter.

EDIT: By the way, we probably should stop using the term "uninitialized" to mean both default kUnsetDefault44Behavior and "not yet been loaded after restart" because they are fundamentally different. And we could have kUnsetDefault44Behavior even if we finish loading after restarts (e.g. Start a node with --replSet and restart it before running replSetInitiate). Having a kUnsetDefault44Behavior FCV value is not an issue, the real issue is reading the in-memory FCV value when it has "not yet been loaded after restart".

Comment by Siyuan Zhou [ 02/Jul/20 ]

Nice write-up! I don't follow how different would option 2 be from the current behavior + adding the invariant. IIUC, the FCV changes to kFullyDowngradedTo44 after initialization. In the case of shutdownTask() mentioned above, do we have to add the check of initialization for it?

Comment by Jason Chan [ 01/Jul/20 ]

Discussed this offline with lingzhi.deng, we agree that option 2 would be most ideal because it draws the clearest contract of only checking FCV when the FCV value has been loaded from disk. If there is a codepath that checks FCV before we have loaded it from disk, we will crash always, and it will be up to the developer to decide between the following:

  • move their logic to only run after the FCV has finished loading (we believe this to be a possibility for most cases)
  • add a isVersionInitialized() check along with their FCV helper to gate their feature
    • we believe this case to be extremely rare, and so we think we can eliminate the number of sources that use this pattern. We hope this will decrease the chances of people copy and pasting.
  • decide if there is an alternative to FCV to choose whether to run old/new behaviour
Comment by Lingzhi Deng [ 01/Jul/20 ]

After thinking about this a bit more, I think ideally we should not need to worry about if FCV is initialized in FCV helpers. The reasons is that in case 1 (clean startup) I posted above, we need to allow general comparison against kUnsetDefaultXXBehavior (the smallest FCV constant) so that we can use FCV checks to choose default behaviors before FCV is initialized from a clean startup.

In fact, two (SERVER-34523, SERVER-46791) out of the three tickets Siyuan mentioned as "bugs" due to FCV initialization problems were actually expected to run at clean startup with default old stable FCV. They became bugs because we got false positive from the invariant. If we had a helper like isFCVGreaterThanOrEqualTo() as proposed in this project (without the invariant), SERVER-34523 could had been written as isFCVGreaterThanOrEqualTo(kFullyUpgradedTo40) and SERVER-46791 could had been written as isFCVGreaterThanOrEqualTo(kFullyUpgradedTo44). And this would handle the default behaviors as well.

That said, I do understand that loading the FCV from disk after a restart makes things a bit trickier. And I believe we wanted an invariant at the first place because we wanted to catch bugs where we do FCV comparisons before FCV is loaded (Note: I believe it is not a bug if we do FCV comparisons before FCV is ever initialized, but it is a bug if we do FCV comparisons before FCV is loaded from disk after restart). But I could argue that in an ideal world, we should always load FCV as (one of) the very first thing during restart so we don't need to worry about this problem in subsystem that start up after. Currently, we load FCV in repairDatabasesAndCheckVersion(), so this would only be a problem to start up logic that's run before this line.

So here are the options:

  1. Add invariants on isVersionInitialized() to FCV helpers and check for uninitialized FCV explicitly (This includes having an extra argument to the helpers to control the invariant).
    I think this actually makes FCV helpers harder to use for the reasons I stated above and because we have quite a few common code paths that could be run before FCV is ever set after a clean startup.
  2. Add a global variable indicating whether necessary startup steps has completed and we invariant FCV helpers are only used after startup completes (i.e. after FCV finishes loading).
    This is because, it is fine to have an uninitialized (default) FCV after startup if the FCV has never been set. But this will catch bugs like SERVER-48050 because that compares FCV before FCV is loaded from disk after restart.
  3. Do nothing and ensure that FCV is loaded early enough after restart.

Either 2 or 3 sounds good to me. daniel.gottlieb siyuan.zhou jason.chan what do you think?

Comment by Siyuan Zhou [ 30/Jun/20 ]

Discussed with lingzhi.deng offline. My understanding is that upgrade / downgrade is done in two-phase. 

  1. Support both old and new behaviors by upgrading the binary.
  2. Enable the new behavior actively by using FCV.

Because FCV is only used to enable a new behavior, it's desired to treat uninitialized FCV as the lowest FCV. In other words, both old and new behaviors are accepted, it's just the matter of choice. shutdownTask() and ReplSetHeartbeatArgsV1::addToBSON() are good examples.

SERVER-48050 diverged from the two-phase pattern mentioned above. It uses FCV to decide whether to support the new behavior. We've seen problems in mixed-version testing in the safe reconfig project following this pattern, so I'm wondering whether this use case is prone to similar errors.

lingzhi.deng, do you think it's preferable to add the invariant and check for uninitialized FCV explicit?

Comment by Lingzhi Deng [ 30/Jun/20 ]

Agreed. But I think there are two parts of it.
1. The FCV is not initialized because it is a clean startup. In this case, I would argue that defaulting FCV to kUnsetDefault44Behavior is correct. In the new upgrade/downgrade project, if we are planning to introduce less-than and greater-than operators rather than equality comparison, I think all these will start to make sense and server behaviors will default to last-lts. Most of the current callers of isVersion() fall into this category I believe.
2. The FCV is initialized but not yet loaded after a restart. I believe this is the case that you think that's error-prone. If we use kUnsetDefault44Behavior temporarily before the FCV doc is loaded, then it will be problematic because that doesn't necessarily match what's on-disk and/or what existing clients expect. And I believe SERVER-48050 falls into this category.

What's even more interesting is that, hypothetically, if there is a piece of code which is fine with defaulting to older behaviors in case 1 but would like to avoid flipping the behaviors in case 2, then it seems hard for the same logic to differentiate the two cases. Maybe instructing WT to load FCV earlier like Dan.G mentioned above is the ultimate solution for case 2, but until then, I dont know if it is even doable?

Comment by Siyuan Zhou [ 30/Jun/20 ]

Great auditing! If it's the common pattern to allow uninitialized FCV, how could developers support cross-version behavior? If they have to rely on some other subtle assumptions, like timing, ordering or specific locking, I'm afraid it's error-prone and a major limitation of FCV. Perhaps it's fine to follow the old or new behavior blindly before FCV initialization, then we need to include it into our best practice and design the FCV interface for that.

For a developer, it's hard to know whether their code could run before FCV initialization or not. It would be unfortunate if every developer has to know the implementation details of FCV. In reality, they'd probably wait until it fails in patch builds or more likely in build failures of mixed version testing to realize the problem.

On a side note, if we need to change the interface of FCV checking, I'd prefer to leverage the type system, like adding an argument or defining a class, than smart naming.

Comment by Lingzhi Deng [ 30/Jun/20 ]

Interestingly, currently all callers of FeatureCompatibility::isVersion() kind of "expect" an uninitialized FCV (except 1 which I believe is a bug and 4 on mongos). So I am not sure if there is any value fixing isVersion() at this point as having an uninitialized FCV seems to be a common case here (surprisingly).

  1. IndexDescriptor::compareIndexOptions()
    I believe this one is a bug, see SERVER-48050.
  2. shutdownTask()
    We don't necessarily have an initialized FCV if we shut down before a replica set is initialized.
  3. ReplSetHeartbeatArgsV1::addToBSON()
    We could send heartbeat to fetch config before we initialize the server as a node in a replSet (i.e. before the FCV is initialized).
  4. cleanupTask() on mongos
    This one is probably irrelevant because mongos always initialize FCV to kFullyUpgradedTo46 I think?
Comment by Lingzhi Deng [ 30/Jun/20 ]

Yes, that would work too. The goal here is to make it very explicit that engineers should avoid using FCV checks this way unless they have reasons to. I think we can add that argument for isVersion() for now as an example that we might follow later when we add more helpers in this project. (Though we will need to skip invarianting FCV for SERVER-48050 as a TODO)

Comment by Jason Chan [ 30/Jun/20 ]

lingzhi.deng Thanks for looking into this, I think adding a new helper makes sense. I believe our goal should be to make it more obvious to developers that we should only be checking isVersionInitialized for codepaths that expect to run when FCV is not yet initialized. We can replace the existing callers of isVersionInitialized with this new helper to prevent future copy and pasting.

EDIT: Or instead of adding a "_EXPECT_UNINITIALIZED_FCV" version of each new helper, how about passing in an argument flag that denotes whether we should check that the version is initialized?

Comment by Lingzhi Deng [ 30/Jun/20 ]

I am not sure if there is anything else that needs to be done as part of this ticket. And I am not sure if we can really get rid of this pattern completely even if we do what Dan.G recommended above to load the FCV earlier. In the two cases I posted above, we don't initialize the FCV on purpose because the replica set is not initialized yet. But we allow commands like serverStatus and isMaster to run. If some common code paths like these need to check FCV, they will certainly have to differentiate whether the FCV is initialized.

Even if the isVersion() is going away, new helpers would likely still have similar issues if they are used in code paths that run before the FCV is initialized / loaded. Maybe we can have a special version of these helpers that have a "_EXPECT_UNINITIALIZED_FCV" suffix and hopefully we can minimize the use of them? jason.chan siyuan.zhou any thoughts?

Comment by Lingzhi Deng [ 30/Jun/20 ]

Here are the current callers of isVersionInitialized() on the master branch as of 2d910ef unless we want to audit 4.4 (which I think it's probably not worth doing at this point).
As the master branch is relatively new compared to 4.4, there are only two places where we call isVersionInitialized() with getVersion():

  1. VectorClock::GossipFormat::OnlyGossipOutOnNewFCV::out()
    This one is needed because this is part of how we append operationTime to request responses. If a request (e.g. isMaster) comes before the node is initialized as a replSet (or added as a new node in a replSet), the fcv is not initialized and thus we will need to check isVersionInitialized() before using the getVersion(). An example call stack looks like:

     #4  0x0000557e66129179 in mongo::ServerGlobalParams::FeatureCompatibility::getVersion (this=0x557e6c011cb8 <mongo::serverGlobalParams+472>) at src/mongo/db/server_options.h:214
    #5  0x0000557e693020ac in mongo::VectorClock::GossipFormat::OnlyGossipOutOnNewFCV<mongo::VectorClock::GossipFormat::Plain>::out (this=0x7f1e9b8d5c20, service=0x7f1e9b830620, opCtx=
        0x7f1e9b7a6720, permitRefresh=false, out=0x7f1e8341c460, time=..., component=mongo::VectorClock::Component::ConfigTime) at src/mongo/db/vector_clock.cpp:190
    #6  0x0000557e692ff1b0 in mongo::VectorClock::_gossipOutComponent (this=0x7f1e9b55b7b0, opCtx=0x7f1e9b7a6720, out=0x7f1e8341c460, time=...,
        component=mongo::VectorClock::Component::ConfigTime) at src/mongo/db/vector_clock.cpp:389
    #7  0x0000557e66de7059 in mongo::(anonymous namespace)::VectorClockMongoD::_gossipOutInternal (this=0x7f1e9b55b7b0, opCtx=0x7f1e9b7a6720, out=0x7f1e8341c460, time=...)
        at src/mongo/db/vector_clock_mongod.cpp:114
    #8  0x0000557e692fefb0 in mongo::VectorClock::gossipOut (this=0x7f1e9b55b7b0, opCtx=0x7f1e9b7a6720, outMessage=0x7f1e8341c460, defaultClientSessionTags=0)                                         at src/mongo/db/vector_clock.cpp:361
    #9  0x0000557e66933e6b in mongo::(anonymous namespace)::appendClusterAndOperationTime (opCtx=0x7f1e9b7a6720, commandBodyFieldsBob=0x7f1e8341c460, metadataBob=0x7f1e8341c460, startTime=...)
        at src/mongo/db/service_entry_point_common.cpp:448
    #10 0x0000557e66937227 in mongo::(anonymous namespace)::runCommandImpl (opCtx=0x7f1e9b7a6720, invocation=0x7f1e95118160, request=..., replyBuilder=0x7f1e9b51d020, startOperationTime=...,
        behaviors=warning: RTTI symbol not found for class 'mongo::ServiceEntryPointMongod::Hooks'
    ..., extraFieldsBuilder=0x7f1e8341c800, sessionOptions=...) at src/mongo/db/service_entry_point_common.cpp:887
    #11 0x0000557e66939812 in mongo::(anonymous namespace)::execCommandDatabase (opCtx=0x7f1e9b7a6720, command=0x557e6bfd8280 <mongo::repl::(anonymous namespace)::cmdismaster>, request=...,
        replyBuilder=0x7f1e9b51d020, behaviors=warning: RTTI symbol not found for class 'mongo::ServiceEntryPointMongod::Hooks'
    ...) at src/mongo/db/service_entry_point_common.cpp:1176
    

  2. StorageEngineImpl::supportsResumableIndexBuilds()
    This one is called as part of serverStatus. Thus, similarly, if serverStatus is called (or by FTDC) before the node is initialized, we will need to check isVersionInitialized() before using the getVersion(). An example call stack looks like:

    #4  0x000055a271cf3a5a in mongo::ServerGlobalParams::FeatureCompatibility::getVersion (this=0x55a277be1db8 <mongo::serverGlobalParams+472>) at src/mongo/db/server_options.h:214
    #5  0x000055a272ec121c in mongo::StorageEngineImpl::supportsResumableIndexBuilds (this=0x7fdaeec35560) at src/mongo/db/storage/storage_engine_impl.cpp:857
    #6  0x000055a271faa1d0 in mongo::(anonymous namespace)::StorageSSS::generateSection (this=0x55a277ba2720 <mongo::(anonymous namespace)::storageSSS>, opCtx=0x7fdae8898e20, configElement=...)
        at src/mongo/db/storage/storage_init.cpp:60
    #7  0x000055a271c29206 in mongo::ServerStatusSection::appendSection (this=0x55a277ba2720 <mongo::(anonymous namespace)::storageSSS>, opCtx=0x7fdae8898e20, configElement=...,
        result=0x7fdadf88d480) at src/mongo/db/commands/server_status.h:100
    #8  0x000055a2734c9ba7 in mongo::CmdServerStatus::run (this=0x55a277bc53c0 <mongo::(anonymous namespace)::CmdServerStatusInstantiator::getInstance()::instance>, opCtx=0x7fdae8898e20,
        dbname="", cmdObj=..., result=...) at src/mongo/db/commands/server_status.cpp:127
    #9  0x000055a271ffec52 in mongo::BasicCommand::runWithReplyBuilder (this=0x55a277bc53c0 <mongo::(anonymous namespace)::CmdServerStatusInstantiator::getInstance()::instance>,
        opCtx=0x7fdae8898e20, db="", cmdObj=..., replyBuilder=0x7fdadf88d670) at src/mongo/db/commands.h:807
    #10 0x000055a2736b5b60 in mongo::BasicCommandWithReplyBuilderInterface::Invocation::run (this=0x7fdae8575600, opCtx=0x7fdae8898e20, result=0x7fdadf88d670) at src/mongo/db/commands.cpp:778
    #11 0x000055a2736ac8de in mongo::CommandHelpers::runCommandDirectly (opCtx=0x7fdae8898e20, request=...) at src/mongo/db/commands.cpp:157
    #12 0x000055a2725a8270 in mongo::FTDCServerStatusCommandCollector::collect (this=0x7fdaeed78860, opCtx=0x7fdae8898e20, builder=...) at src/mongo/db/ftdc/ftdc_server.cpp:234
    #13 0x000055a2725ec08e in mongo::FTDCCollectorCollection::collect (this=0x7fdae318e168, client=0x7fdae6c99d40) at src/mongo/db/ftdc/collector.cpp:91
    #14 0x000055a2725f145e in mongo::FTDCController::doLoop (this=0x7fdae318e020) at src/mongo/db/ftdc/controller.cpp:249
    #15 0x000055a2725f0b6d in mongo::FTDCController::<lambda()>::operator()(void) const (__closure=0x7fdaeec0b200) at src/mongo/db/ftdc/controller.cpp:145
    #16 0x000055a2725f2186 in std::__invoke_impl<void, mongo::FTDCController::start()::<lambda()> >(std::__invoke_other, mongo::FTDCController::<lambda()> &&) (__f=...)
        at /usr/include/c++/8/bits/invoke.h:60
    

Comment by Steven Vannelli [ 22/Jun/20 ]

This helper will likely be removed. So we will close this ticket, assuming that is the case.

*Even if this helper goes away, we should still audit whether the FCV is initialized.

Comment by Siyuan Zhou [ 15/Jun/20 ]

Good point, daniel.gottlieb! I'm convinced we should add the invariant and I agree we should audit all of the copy and paste. I'm afraid we won't know whether it's possible to remove them until we do the auditing. However, at this time of the beginning of 4.6, we'd better do it now than later.

As we will be using FCV more often than before with quarterly releases, it may be worth considering a better design/implementation of FCV initialization. I found at least three bugs fixed by Execution, Query and Replication team due to the late initialization: SERVER-34523SERVER-46791 and SERVER-32630.

CC tess.avitabile and jason.chan since they are working on upgrade / downgrade for quarterly versions.

Comment by Daniel Gottlieb (Inactive) [ 26/May/20 ]

I understand the goal is to force developers to keep the initialization in mind, but people tend to copy and paste it. I don't think the caller of FCV should always worry about the initialization of FCV.

The goal is not to force developers to keep initialization in mind, and you're right the caller shouldn't care, which is why only a select few things should ever ask if FCV is initialized and the rest should crash the node if it's not. Those cases can then be better understood by looping in the appropriate stakeholders to figure out what the right solution is. I understand that copy/paste is a problem; this ticket should remove all of those uses as well. But there's always a risk the pattern is reintroduced; FCV and startup seem simple which empowers people to make mistakes. I'd love to fix that with code, but I find that's typically futile. I think we can only hope for education to work.

I feel the fundamental problem is that the order of FCV initialization and the code in development isn't clear or FCV initialization should be earlier

I agree FCV should be set earlier, but there are limits to how early it can be set. FCV has a fundamental design flaw; because it's in a collection, WT has to first startup and the catalog has to be loaded. There are a handful of usages at catalog startup that may not be right, but do legitimately have to consider what to do.

Alternatively we could have storage learn about namespaces and how to find the FCV document (pushing things even earlier, but not early enough for the logging changes with recoverable rollback in 4.0), but I don't think we're at the point of giving that serious consideration.

Comment by Siyuan Zhou [ 26/May/20 ]

getVersion() has the invariant, but I realized almost all the equality comparison of getVersion() also checks FCV is initialized, so we added this new way to avoid the duplication.

    auto isFullyUpgradedTo44 =
        (serverGlobalParams.featureCompatibility.isVersionInitialized() &&
         serverGlobalParams.featureCompatibility.getVersion() ==
             ServerGlobalParams::FeatureCompatibility::Version::kFullyUpgradedTo44);

I understand the goal is to force developers to keep the initialization in mind, but people tend to copy and paste it. I don't think the caller of FCV should always worry about the initialization of FCV.

It's a good idea to enforce this invariant, but I think we also need to address the reason why isVersionInitialized() is almost always used with version comparison. Perhaps most of them are unnecessary. I feel the fundament problem is that the order of FCV initialization and the code in development isn't clear or FCV initialization should be earlier as seen in BF-17214.

Generated at Thu Feb 08 05:15:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.