[SERVER-48044] FCV::isVersion should invariant if FCV is unset Created: 08/May/20 Updated: 22/Jul/20 Resolved: 22/Jul/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Daniel Gottlieb (Inactive) | Assignee: | Lingzhi Deng |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||
| Sprint: | Repl 2020-06-01, Repl 2020-06-15, Repl 2020-07-27 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 6 | ||||||||||||||||||||||||
| Description |
|
A new way was introduced to read the FCV variable, but it did not carry over the check that the FCV value was initialized. |
| Comments |
| Comment by Lingzhi Deng [ 22/Jul/20 ] | ||||||||||||||||||||||||||||||||||
|
The reason we might want to always invariant that the FCV is set (i.e. not kUnsetDefault44Behavior) both from clean startup and from restart is because we want to fail loudly if we write (i.e. persist data) based on FCV before having a FCV. By invarianting isVersionInitialized(), we force a crash and force the engineer to think about whether their code is supposed to run before having a FCV. And even if yes, the engineer need to figure out what the default behavior should be. The default behavior is not necessarily 44 (last stable) behavior and should be determine case by case (e.g. creating a collection/index before initializing a node as replSet will follow newer behavior when FCV is kUnsetDefault44Behavior). In order to make sure callers of isVersionInitialized() are safe, we will probably need to evaluate callers of isVersionInitialized before each release, making sure they are not defaulting to a behavior that would cause unexpected behavior after upgrade/downgrade. This is tracked by After | ||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 08/Jul/20 ] | ||||||||||||||||||||||||||||||||||
|
Discussed with lingzhi.deng, jason.chan and tess.avitabile offline. I have a better understanding of Lingzhi's proposal now.
I thought FCV is only uninitialized during startup, but it's not the case. For example, it remains uninitialized when a node started with --replSet but hasn't received replSetInitiate or finished initial sync. Because of this subtle difference, Lingzhi is proposing to add a new variable to distinguish startup from uninitialized. The new variable is set during startup and will be checked as an invariant when comparing FCV's. As mentioned before, kUnsetDefault44Behavior should be treated the same as kFullyDowngradedTo44 in most cases. The new isFCVGreaterThanOrEqualTo() helper encourages developers to follow this pattern. | ||||||||||||||||||||||||||||||||||
| Comment by Lingzhi Deng [ 02/Jul/20 ] | ||||||||||||||||||||||||||||||||||
|
I dont think we change to kFullyDowngradedTo44 after loading from disk if the FCV has never been set before. isVersionInitialized() still return false in that case, which means FCV == kUnsetDefault44Behavior.
Option 2 is to add an invariant similar to invariant(loadedFCV) (instead of invariant(isVersionInitialized())) in FCV helpers to make sure we only do FCV comparisons after the FCV has finished loading. As I said, it is not a bug if we do FCV comparisons before FCV is ever initialized, but it is a bug if we do FCV comparisons before FCV is loaded from disk after restart. So option 2 helps to detect the latter. EDIT: By the way, we probably should stop using the term "uninitialized" to mean both default kUnsetDefault44Behavior and "not yet been loaded after restart" because they are fundamentally different. And we could have kUnsetDefault44Behavior even if we finish loading after restarts (e.g. Start a node with --replSet and restart it before running replSetInitiate). Having a kUnsetDefault44Behavior FCV value is not an issue, the real issue is reading the in-memory FCV value when it has "not yet been loaded after restart". | ||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 02/Jul/20 ] | ||||||||||||||||||||||||||||||||||
|
Nice write-up! I don't follow how different would option 2 be from the current behavior + adding the invariant. IIUC, the FCV changes to kFullyDowngradedTo44 after initialization. In the case of shutdownTask() mentioned above, do we have to add the check of initialization for it? | ||||||||||||||||||||||||||||||||||
| Comment by Jason Chan [ 01/Jul/20 ] | ||||||||||||||||||||||||||||||||||
|
Discussed this offline with lingzhi.deng, we agree that option 2 would be most ideal because it draws the clearest contract of only checking FCV when the FCV value has been loaded from disk. If there is a codepath that checks FCV before we have loaded it from disk, we will crash always, and it will be up to the developer to decide between the following:
| ||||||||||||||||||||||||||||||||||
| Comment by Lingzhi Deng [ 01/Jul/20 ] | ||||||||||||||||||||||||||||||||||
|
After thinking about this a bit more, I think ideally we should not need to worry about if FCV is initialized in FCV helpers. The reasons is that in case 1 (clean startup) I posted above, we need to allow general comparison against kUnsetDefaultXXBehavior (the smallest FCV constant) so that we can use FCV checks to choose default behaviors before FCV is initialized from a clean startup. In fact, two ( That said, I do understand that loading the FCV from disk after a restart makes things a bit trickier. And I believe we wanted an invariant at the first place because we wanted to catch bugs where we do FCV comparisons before FCV is loaded (Note: I believe it is not a bug if we do FCV comparisons before FCV is ever initialized, but it is a bug if we do FCV comparisons before FCV is loaded from disk after restart). But I could argue that in an ideal world, we should always load FCV as (one of) the very first thing during restart so we don't need to worry about this problem in subsystem that start up after. Currently, we load FCV in repairDatabasesAndCheckVersion(), so this would only be a problem to start up logic that's run before this line. So here are the options:
Either 2 or 3 sounds good to me. daniel.gottlieb siyuan.zhou jason.chan what do you think? | ||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 30/Jun/20 ] | ||||||||||||||||||||||||||||||||||
|
Discussed with lingzhi.deng offline. My understanding is that upgrade / downgrade is done in two-phase.
Because FCV is only used to enable a new behavior, it's desired to treat uninitialized FCV as the lowest FCV. In other words, both old and new behaviors are accepted, it's just the matter of choice. shutdownTask() and ReplSetHeartbeatArgsV1::addToBSON() are good examples.
lingzhi.deng, do you think it's preferable to add the invariant and check for uninitialized FCV explicit? | ||||||||||||||||||||||||||||||||||
| Comment by Lingzhi Deng [ 30/Jun/20 ] | ||||||||||||||||||||||||||||||||||
|
Agreed. But I think there are two parts of it. What's even more interesting is that, hypothetically, if there is a piece of code which is fine with defaulting to older behaviors in case 1 but would like to avoid flipping the behaviors in case 2, then it seems hard for the same logic to differentiate the two cases. Maybe instructing WT to load FCV earlier like Dan.G mentioned above is the ultimate solution for case 2, but until then, I dont know if it is even doable? | ||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 30/Jun/20 ] | ||||||||||||||||||||||||||||||||||
|
Great auditing! If it's the common pattern to allow uninitialized FCV, how could developers support cross-version behavior? If they have to rely on some other subtle assumptions, like timing, ordering or specific locking, I'm afraid it's error-prone and a major limitation of FCV. Perhaps it's fine to follow the old or new behavior blindly before FCV initialization, then we need to include it into our best practice and design the FCV interface for that. For a developer, it's hard to know whether their code could run before FCV initialization or not. It would be unfortunate if every developer has to know the implementation details of FCV. In reality, they'd probably wait until it fails in patch builds or more likely in build failures of mixed version testing to realize the problem. On a side note, if we need to change the interface of FCV checking, I'd prefer to leverage the type system, like adding an argument or defining a class, than smart naming. | ||||||||||||||||||||||||||||||||||
| Comment by Lingzhi Deng [ 30/Jun/20 ] | ||||||||||||||||||||||||||||||||||
|
Interestingly, currently all callers of FeatureCompatibility::isVersion() kind of "expect" an uninitialized FCV (except 1 which I believe is a bug and 4 on mongos). So I am not sure if there is any value fixing isVersion() at this point as having an uninitialized FCV seems to be a common case here (surprisingly).
| ||||||||||||||||||||||||||||||||||
| Comment by Lingzhi Deng [ 30/Jun/20 ] | ||||||||||||||||||||||||||||||||||
|
Yes, that would work too. The goal here is to make it very explicit that engineers should avoid using FCV checks this way unless they have reasons to. I think we can add that argument for isVersion() for now as an example that we might follow later when we add more helpers in this project. (Though we will need to skip invarianting FCV for | ||||||||||||||||||||||||||||||||||
| Comment by Jason Chan [ 30/Jun/20 ] | ||||||||||||||||||||||||||||||||||
|
lingzhi.deng Thanks for looking into this, I think adding a new helper makes sense. I believe our goal should be to make it more obvious to developers that we should only be checking isVersionInitialized for codepaths that expect to run when FCV is not yet initialized. We can replace the existing callers of isVersionInitialized with this new helper to prevent future copy and pasting. EDIT: Or instead of adding a "_EXPECT_UNINITIALIZED_FCV" version of each new helper, how about passing in an argument flag that denotes whether we should check that the version is initialized? | ||||||||||||||||||||||||||||||||||
| Comment by Lingzhi Deng [ 30/Jun/20 ] | ||||||||||||||||||||||||||||||||||
|
I am not sure if there is anything else that needs to be done as part of this ticket. And I am not sure if we can really get rid of this pattern completely even if we do what Dan.G recommended above to load the FCV earlier. In the two cases I posted above, we don't initialize the FCV on purpose because the replica set is not initialized yet. But we allow commands like serverStatus and isMaster to run. If some common code paths like these need to check FCV, they will certainly have to differentiate whether the FCV is initialized. Even if the isVersion() is going away, new helpers would likely still have similar issues if they are used in code paths that run before the FCV is initialized / loaded. Maybe we can have a special version of these helpers that have a "_EXPECT_UNINITIALIZED_FCV" suffix and hopefully we can minimize the use of them? jason.chan siyuan.zhou any thoughts? | ||||||||||||||||||||||||||||||||||
| Comment by Lingzhi Deng [ 30/Jun/20 ] | ||||||||||||||||||||||||||||||||||
|
Here are the current callers of isVersionInitialized() on the master branch as of 2d910ef unless we want to audit 4.4 (which I think it's probably not worth doing at this point).
| ||||||||||||||||||||||||||||||||||
| Comment by Steven Vannelli [ 22/Jun/20 ] | ||||||||||||||||||||||||||||||||||
|
This helper will likely be removed. So we will close this ticket, assuming that is the case. *Even if this helper goes away, we should still audit whether the FCV is initialized. | ||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 15/Jun/20 ] | ||||||||||||||||||||||||||||||||||
|
Good point, daniel.gottlieb! I'm convinced we should add the invariant and I agree we should audit all of the copy and paste. I'm afraid we won't know whether it's possible to remove them until we do the auditing. However, at this time of the beginning of 4.6, we'd better do it now than later. As we will be using FCV more often than before with quarterly releases, it may be worth considering a better design/implementation of FCV initialization. I found at least three bugs fixed by Execution, Query and Replication team due to the late initialization: CC tess.avitabile and jason.chan since they are working on upgrade / downgrade for quarterly versions. | ||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 26/May/20 ] | ||||||||||||||||||||||||||||||||||
The goal is not to force developers to keep initialization in mind, and you're right the caller shouldn't care, which is why only a select few things should ever ask if FCV is initialized and the rest should crash the node if it's not. Those cases can then be better understood by looping in the appropriate stakeholders to figure out what the right solution is. I understand that copy/paste is a problem; this ticket should remove all of those uses as well. But there's always a risk the pattern is reintroduced; FCV and startup seem simple which empowers people to make mistakes. I'd love to fix that with code, but I find that's typically futile. I think we can only hope for education to work.
I agree FCV should be set earlier, but there are limits to how early it can be set. FCV has a fundamental design flaw; because it's in a collection, WT has to first startup and the catalog has to be loaded. There are a handful of usages at catalog startup that may not be right, but do legitimately have to consider what to do. Alternatively we could have storage learn about namespaces and how to find the FCV document (pushing things even earlier, but not early enough for the logging changes with recoverable rollback in 4.0), but I don't think we're at the point of giving that serious consideration. | ||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 26/May/20 ] | ||||||||||||||||||||||||||||||||||
|
getVersion() has the invariant, but I realized almost all the equality comparison of getVersion() also checks FCV is initialized, so we added this new way to avoid the duplication.
I understand the goal is to force developers to keep the initialization in mind, but people tend to copy and paste it. I don't think the caller of FCV should always worry about the initialization of FCV. It's a good idea to enforce this invariant, but I think we also need to address the reason why isVersionInitialized() is almost always used with version comparison. Perhaps most of them are unnecessary. I feel the fundament problem is that the order of FCV initialization and the code in development isn't clear or FCV initialization should be earlier as seen in BF-17214. |