[SERVER-50139] Skip level upgrade with --repair leaves repair in incomplete state Created: 06/Aug/20  Updated: 29/Oct/23  Resolved: 06/Aug/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Bug Priority: Major - P3
Reporter: Lingzhi Deng Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-48050 FCV should be initialized before atte... Closed
related to SERVER-49066 Extend generic targeted multiversion ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

diff --git a/jstests/multiVersion/skip_level_upgrade.js b/jstests/multiVersion/skip_level_upgrade.js
index d0ea00222a..0e2f86f4c0 100644
--- a/jstests/multiVersion/skip_level_upgrade.js
+++ b/jstests/multiVersion/skip_level_upgrade.js
@@ -34,7 +34,8 @@ const versions = [
     {binVersion: '3.2', testCollection: 'three_two'},
     {binVersion: '3.4', testCollection: 'three_four'},
     {binVersion: '3.6', testCollection: 'three_six'},
-    {binVersion: '4.0', testCollection: 'four_zero'}
+    {binVersion: '4.0', testCollection: 'four_zero'},
+    {binVersion: '4.2', testCollection: 'four_two'}
 ];

Sprint: Repl 2020-08-10
Participants:

 Description   

After SERVER-48050, if a node starts with --repair, we load FCV document first before marking the repair as done (i.e. remove the _repair_imcomplete file). So if this is a skip level upgrade to the latest binary, the FCV load would fail (i.e. throw) with upgrade errors, leaving the _repair_imcomplete file in the data directory. This prevents the node from restarting with the original binary. And this would fail the skip_level_upgrade.js test (with the diff below). Should the expected behavior be nooping the --repair on skip level upgrade?

Unfortunately, we are currently missing the test coverage in that test for skip level upgrade from 4.2. And this ticket should add the test coverage back after fixing the issue.



 Comments   
Comment by Githook User [ 06/Aug/20 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-50139: Abort repair on FCV initialization errors
Branch: master
https://github.com/mongodb/mongo/commit/7dde791c6f42b06886de41a379430ef6ef0ffa25

Comment by Lingzhi Deng [ 06/Aug/20 ]

Thanks, I will assign this to myself then since I have already written the code above. I will open a CR later.

Comment by Louis Williams [ 06/Aug/20 ]

lingzhi.deng, thanks for filing this. That seems like a reasonable solution.

Comment by Lingzhi Deng [ 06/Aug/20 ]

I tried the following fix and it seems to work:

diff --git a/src/mongo/db/startup_recovery.cpp b/src/mongo/db/startup_recovery.cpp
index b2afe36cb1..08c69d42e6 100644
--- a/src/mongo/db/startup_recovery.cpp
+++ b/src/mongo/db/startup_recovery.cpp
@@ -441,6 +441,10 @@ void startupRepair(OperationContext* opCtx, StorageEngine* storageEngine) {
     // FCV-dependent features are rebuilt properly. Note that we don't try to prevent
     // repairDatabase from repairing this collection again, because it only consists of one
     // document.
+    // If we fail to load the FCV document due to upgrade problems, we need to abort the repair in
+    // order to allow downgrading to older binary versions.
+    auto abortRepairOnFCVErrors = makeGuard(
+        [&] { StorageRepairObserver::get(opCtx->getServiceContext())->onRepairDone(opCtx); });
     if (auto fcvColl = CollectionCatalog::get(opCtx).lookupCollectionByNamespace(
             opCtx, NamespaceString::kServerConfigurationNamespace)) {
         auto databaseHolder = DatabaseHolder::get(opCtx);
@@ -451,6 +455,7 @@ void startupRepair(OperationContext* opCtx, StorageEngine* storageEngine) {
     }
     uassertStatusOK(restoreMissingFeatureCompatibilityVersionDocument(opCtx));
     FeatureCompatibilityVersion::initializeForStartup(opCtx);
+    abortRepairOnFCVErrors.dismiss();
 
     // The local database should be repaired before any other replicated collections so we know
     // whether not to rebuild unfinished two-phase index builds if this is a replica set node

louis.williams Do you think this is a reasonable fix? Or do we expect to resume the --repair on the original binary? Thanks.

Generated at Thu Feb 08 05:21:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.