[SERVER-47126] Clean dbpath when downgrading arbiters in multiversion tests Created: 26/Mar/20  Updated: 29/Oct/23  Resolved: 01/Apr/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.4.0-rc0

Type: Bug Priority: Major - P3
Reporter: Luke Chen Assignee: Alex Cameron (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2020-03-26-14-47-08-170.png    
Issue Links:
Depends
Problem/Incident
Related
is related to SERVER-47219 Correct downgrade_after_rollback_via_... Closed
is related to WT-5926 __verify_txn_addr_cmp failure in mult... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Storage Engines 2020-04-06
Participants:
Linked BF Score: 47
Story Points: 5

 Description   

The multiversion tests on the mongo v4.4 waterfall page kept failing after the recent v4.2 WiredTiger drop (with WT-5892) and the merge of SERVER-47006. WT-5926 was created to address the WiredTiger error captured by do_upgrade_downgrade.js in the latest commit build.

There are other 10+ tests failed in the same task which we should investigate to understand. 



 Comments   
Comment by Githook User [ 01/Apr/20 ]

Author:

{'name': 'Alex Cameron', 'email': 'alex.cameron@10gen.com', 'username': 'tetsuo-cpp'}

Message: SERVER-47126 Clean dbpath when downgrading arbiters in multiversion tests
Branch: master
https://github.com/mongodb/mongo/commit/d2a8b439fad6e6e926645c484b058b695a41f0a0

Comment by Githook User [ 01/Apr/20 ]

Author:

{'name': 'Alex Cameron', 'email': 'alex.cameron@10gen.com', 'username': 'tetsuo-cpp'}

Message: SERVER-47126 Clean dbpath when downgrading arbiters in multiversion tests
Branch: v4.4
https://github.com/mongodb/mongo/commit/869d19299f9b41ae02f3d71beb2d4462c9d628df

Comment by Daniel Gottlieb (Inactive) [ 01/Apr/20 ]

I've created SERVER-47219 for that test.

Comment by Alexander Gorrod [ 01/Apr/20 ]

Thanks for the update daniel.gottlieb - will you use this same ticket and the associated CR to make any changes? Or should we get the CR pushed and open a new ticket?

Comment by Daniel Gottlieb (Inactive) [ 31/Mar/20 ]

alex.cameron, tess.avitabile and I talked. We're going to try and fix the test to do a clean shutdown in the desired state of rollback via refetch.

Comment by Daniel Gottlieb (Inactive) [ 31/Mar/20 ]

Looking into it

Comment by Alex Cameron (Inactive) [ 31/Mar/20 ]

daniel.gottlieb
Can I leave downgrade_after_rollback_via_refetch with you?

Comment by Alex Cameron (Inactive) [ 27/Mar/20 ]

If I change the test to only use secondaries, the issue goes away so it's definitely related to arbiters. But I'm missing a piece here...
I tried to fix 3_upgrade_replset.js by making this test change here to blow away the dbpath if we're downgrading (going from some other version to last-stable).

diff --git a/jstests/multiVersion/libs/multi_rs.js b/jstests/multiVersion/libs/multi_rs.js
index 1529353aca..f60ea4c3c2 100644
--- a/jstests/multiVersion/libs/multi_rs.js
+++ b/jstests/multiVersion/libs/multi_rs.js
@@ -10,11 +10,11 @@
 ReplSetTest.prototype.upgradeSet = function(options, user, pwd) {
     let primary = this.getPrimary();
 
-    this.upgradeSecondaries(primary, options, user, pwd);
-    this.upgradeArbiters(primary, options, user, pwd);
+    this.upgradeSecondaries(primary, Object.assign({}, options), user, pwd);
+    this.upgradeArbiters(primary, Object.assign({}, options), user, pwd);
 
     // Upgrade the primary after stepping down.
-    this.upgradePrimary(primary, options, user, pwd);
+    this.upgradePrimary(primary, Object.assign({}, options), user, pwd);
 };
 
 ReplSetTest.prototype.upgradeMembers = function(primary, members, options, user, pwd) {
@@ -38,6 +38,10 @@ ReplSetTest.prototype.upgradeSecondaries = function(primary, options, user, pwd)
 };
 
 ReplSetTest.prototype.upgradeArbiters = function(primary, options, user, pwd) {
+    // We don't support downgrading data files for Arbiters. We need to instead delete the dbpath.
+    if (options && options.binVersion == "last-stable") {
+        options["startClean"] = true;
+    }
     this.upgradeMembers(primary, this.getArbiters(), options, user, pwd);
 };

This doesn't seem to work though. In fact, I don't even get to the point where this code is exercised and the arbiter appears to get restarted (and fails) before the call to upgradeArbiters. However, if I clear the dbpath when upgrading, then it seems to work.

// We don't support downgrading data files for Arbiters. We need to instead delete the dbpath.
if (options && options.binVersion == "latest") {
    options["startClean"] = true;
}

Comment by Daniel Gottlieb (Inactive) [ 27/Mar/20 ]

I can double check, but the server code I deleted might have coped with it.

Comment by Eric Milkie [ 27/Mar/20 ]

I think I prefer the second option, delete the arbiter data files as part of downgrade; this is what we document in the downgrade instructions for users to do.
Why doesn’t this test fail similarly in the 4.2-to-4.0 downgrade test on the 4.2 beach?

Comment by Alex Cameron (Inactive) [ 27/Mar/20 ]

milkie daniel.gottlieb
I believe I understand this problem now. I've realized that the MongoD instance that is failing to startup is not the same replica set member that is hitting the downgrade path.

After looking at the test, it is PSA configuration and I found this block in wiredtiger_kv_engine.cpp:

    if (memberState.arbiter()) {
        // SERVER-35361: Arbiters will no longer downgrade their data files. To downgrade
        // binaries, the user must delete the dbpath. It's not particularly expensive for a
        // replica set to re-initialize an arbiter that comes online.
        return false;
    }

So the arbiter doesn't downgrade its data files (and the test doesn't blow away its dbpath). If I change the test to use a PSS configuration, it passes without a problem.

What are people's thoughts on this? There are a few possibilities:

  • Amend the test to use PSS since PSA doesn't support seamlessly downgrading like this without blowing away the dbpath.
  • Blow away the arbiter's dbpath as part of the test's downgrade process.
  • Support arbiters downgrading their data files.

Apologies for the false alarm sue.loverso, there's nothing wrong with the log downgrade path.

Comment by Alex Cameron (Inactive) [ 26/Mar/20 ]

Manually going from 4.2 -> 4.4 -> 4.2 doesn't have any issues for me. I see it hitting the same downgrade path when shutting down 4.4.
As sue.loverso suggested, I'm going to turn up verbosity in 3_upgrade_replset.js test and try to preserve the log files after shutting down 4.4.

Comment by Susan LoVerso [ 26/Mar/20 ]

FYI, milkie says it really creates under 4.2, upgrades to 4.4 and does a write or two, then downgrades back to 4.2.

Comment by Susan LoVerso [ 26/Mar/20 ]

The suggested verbose should show what 4.2 (i.e. V4) log file it creates and what log files it archives/removes. The point of the downgrade, switching log files and forcing the checkpoint is to make all "other" version log files able to be removed.

alexander.gorrod mentioned that we need log before the checkpoint now with DH. I don't know if that is there is something disallowing archiving other log files. So, the verbose plus the copied directory should let us know what's going on.

Comment by Alex Cameron (Inactive) [ 26/Mar/20 ]

Sorry sue.loverso, my comment was missing some information.
milkie is correct, we appear to successfully downgrade. And then when we start back up with 4.2, we see a new log that we can't handle yet.
Thanks for the tips, I'll continue looking today.

Comment by Susan LoVerso [ 26/Mar/20 ]

alex.cameron once we understand what MongoDB is doing differently than my WT-only testing, we should make a test that reproduces this. The CONFIG I've been using is using timestamps and uses a small cache so the HS store is in play.

Comment by Susan LoVerso [ 26/Mar/20 ]

alex.cameron I have a couple suggestions for debugging this through MongoDB:
First, turn on verbose output for logging immediately prior to the downgrade so that we can see what WiredTiger is doing. You need something like this (untested) diff:

diff --git a/src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp b/src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp
index 714e35a356..831b1c4c45 100644
--- a/src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp
+++ b/src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp
@@ -948,6 +948,7 @@ void WiredTigerKVEngine::cleanShutdown() {
                     1,
                     "Downgrade compatibility configuration: {fileVersion_getDowngradeString}",
                     "fileVersion_getDowngradeString"_attr = _fileVersion.getDowngradeString());
+        invariantWTOK(_conn->reconfigure(_conn, "verbose=(log)"));
         invariantWTOK(_conn->reconfigure(_conn, _fileVersion.getDowngradeString().c_str()));
     }
 

You need to add that additional line (as opposed to appending it the the downgrade string) because in reconfigure the verbose options are parsed last. So if it is part of one reconfigure call, the compatibility stuff will be complete before verbose logging is enabled.

Second, modify <something> (the js test? the resmoke script?) to save a copy of the database directory before the startup that gives the log version error. Then we can look at the verbose output and compare it to the log files as they were after shutdown and maybe get more understanding about the issue.

Comment by Susan LoVerso [ 26/Mar/20 ]

Ok, I found the original error you're seeing in some output:

[js_test:3_upgrade_replset] 2020-03-26T14:45:42.178+0000 d20272| 2020-03-26T14:45:42.177+0000 E  STORAGE  [initandlisten] WiredTiger error (-31802) [1585233942:177917][13961:0x7fbcf2107bc0], connection: __log_open_verify, 926: unsupported WiredTiger file version: this build only supports versions up to 4, and the file is version 5: WT_ERROR: non-specific WiredTiger error

Looking closer, I see this in the log for WiredTiger open.

[js_test:3_upgrade_replset] 2020-03-26T14:45:41.505+0000 d20272| 2020-03-26T14:45:41.505+0000 I  STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=3234M,cache_overflow=(file_max=0M),session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress],compatibility=(release=3.2,require_max="3.3.0"),

Trying that myself it still works for me from the mongodb-4.2 tree:

mongodb-4.2 %  ./wt -h /tmp/RUNDIR -C "compatibility=(release=3.2,require_max=3.3.0),verbose=(log,recovery)" list -v
[1585242709:16980][15512:0x7fff8dc13380], wt, txn-recover: Recovering file:WiredTiger.wt with id 0 @ (39, 768384)
[1585242709:17235][15512:0x7fff8dc13380], wt, txn-recover: Recovering file:WiredTigerHS.wt with id 1 @ (39, 768384)
[1585242709:17264][15512:0x7fff8dc13380], wt, txn-recover: Recovering file:wt.wt with id 2 @ (39, 768384)
[1585242709:17339][15512:0x7fff8dc13380], wt, txn-recover: Set global recovery timestamp: (0, 0)
metadata:
access_pattern_hint=none,allocation_size=4KB,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=,cache_resident=false,checksum=uncompressed,collator=,columns=,dictionary=0,encryption=(keyid=,name=),format=btree,huffman_key=,huffman_value=,id=0,ignore_in_memory_cache_size=false,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=S,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=0,log=(enabled=true),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=4,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,value_format=S,version=(major=1,minor=1),checkpoint=(WiredTigerCheckpoint.6=(addr="018581e4d74a769f8681e4424107b48781e4f27b2c96808080e26fc0cfc0",order=6,time=1585242709,size=8192,newest_durable_ts=0,oldest_start_ts=0,oldest_start_txn=0,newest_stop_ts=-1,newest_stop_txn=-11,write_gen=15)),checkpoint_backup_info=,checkpoint_lsn=(4294967295,2147483647)
...

Comment by Susan LoVerso [ 26/Mar/20 ]

There isn't any information about what errors you're actually seeing in here. I will describe here in detail what I did and what I'm seeing (via WiredTiger only mechanisms). Essentially, a simple printlog and metadata list test works.

I have two trees:

  1. Current checkout of WiredTiger mongodb-4.2 branch, changeset 4c2230d9.
  2. Current checkout of WiredTiger develop, changeset b2e25e5.

The mongodb-4.2 branch is unchanged. In the develop branch, I only modified test/format/wts.c to issue a reconfigure call just before connection close:

diff --git a/test/format/wts.c b/test/format/wts.c
index a01210b88..901b8bbc6 100644
--- a/test/format/wts.c
+++ b/test/format/wts.c
@@ -451,6 +451,7 @@ wts_close(void)
     const char *config;
 
     conn = g.wts_conn;
+    testutil_check(conn->reconfigure(conn, "compatibility=(release=3.3)"));
 
     config = g.c_leak_memory ? "leak_memory" : NULL;
 

Build both trees.

I then run test/format with a select CONFIG, but here's one generated CONFIG that yields a useful directory (a superset of mine with other values randomly generated):

############################################
#  RUN PARAMETERS
############################################
abort=0
alter=0
assert_commit_timestamp=0
assert_read_timestamp=0
auto_throttle=1
backups=1
backup_incremental=log
bitcnt=6
bloom=1
bloom_bit_count=15
bloom_hash_count=13
bloom_oldest=0
cache=50
cache_minimum=0
checkpoints=on
checkpoint_log_size=37
checkpoint_wait=14
checksum=uncompressed
chunk_size=4
compaction=0
compression=none
data_extend=0
data_source=table
delete_pct=23
dictionary=1
direct_io=0
encryption=none
evict_max=2
file_type=row-store
firstfit=0
huffman_key=0
huffman_value=0
independent_thread_rng=1
in_memory=0
insert_pct=0
internal_key_truncation=1
internal_page_max=10
isolation=snapshot
key_gap=1
key_max=118
key_min=21
leaf_page_max=16
leak_memory=0
logging=1
logging_archive=0
logging_compression=none
logging_file_max=1024
logging_prealloc=1
lsm_worker_threads=3
major_timeout=0
memory_page_max=8
merge_max=5
mmap=1
mmap_all=0
modify_pct=69
ops=0
prefix_compression=1
prefix_compression_min=1
prepare=0
quiet=1
random_cursor=0
read_pct=0
rebalance=0
repeat_data_pct=77
reverse=0
rows=230271
runs=1
salvage=0
split_pct=96
statistics=0
statistics_server=0
threads=15
timer=1
timing_stress_aggressive_sweep=0
timing_stress_checkpoint=0
timing_stress_hs_sweep=0
timing_stress_split_1=0
timing_stress_split_2=0
timing_stress_split_3=0
timing_stress_split_4=0
timing_stress_split_5=0
timing_stress_split_6=0
timing_stress_split_7=0
timing_stress_split_8=0
transaction_timestamps=1
transaction-frequency=100
truncate=1
value_max=1003
value_min=6
verify=0
wiredtiger_config=config_base=false,verbose=(log)
write_pct=8
############################################

Here's the verbose logging output from the first line when we do the reconfigure call until the test completes:

0x110b1e800:[1585239796:76782][15029:0x7fff8dc13380], t, WT_CONNECTION.reconfigure: log_remove: remove log ./WiredTigerPreplog.0000000061
0x110b1e800:[1585239796:77108][15029:0x7fff8dc13380], t, WT_CONNECTION.reconfigure: log_remove: remove log ./WiredTigerPreplog.0000000060
0x110b1fb60:[1585239796:77453][15029:0x70000d9d4000], t, log-server: opening log ./WiredTigerTmplog.0000000063
0x110b1fb60:[1585239796:97169][15029:0x70000d9d4000], t, log-server: log_allocfile: rename ./WiredTigerTmplog.0000000063 to ./WiredTigerLog.0000000061
0x110b1fb60:[1585239796:100801][15029:0x70000d9d4000], t, log-server: opening log ./WiredTigerLog.0000000061
0x110b1e800:[1585239796:101151][15029:0x7fff8dc13380], t, WT_CONNECTION.reconfigure: log_printf: COMPATIBILITY: Version now 4
0x110b1fb60:[1585239796:101477][15029:0x70000d9d4000], t, log-server: Missed 1. Now pre-allocating up to 2
0x110b1fb60:[1585239796:101518][15029:0x70000d9d4000], t, log-server: opening log ./WiredTigerTmplog.0000000064
0x110b1fb60:[1585239796:123450][15029:0x70000d9d4000], t, log-server: log_allocfile: rename ./WiredTigerTmplog.0000000064 to ./WiredTigerPreplog.0000000062
0x110b1fb60:[1585239796:123627][15029:0x70000d9d4000], t, log-server: opening log ./WiredTigerTmplog.0000000065
0x110b1fb60:[1585239796:145577][15029:0x70000d9d4000], t, log-server: log_allocfile: rename ./WiredTigerTmplog.0000000065 to ./WiredTigerPreplog.0000000063
0x110b20ec0:[1585239796:206326][15029:0x7fff8dc13380], t, WT_SESSION.checkpoint: log_force_sync: sync directory . to LSN 61/384
0x110b20ec0:[1585239796:214491][15029:0x7fff8dc13380], t, WT_SESSION.checkpoint: opening log ./WiredTigerLog.0000000061
0x110b20ec0:[1585239796:214538][15029:0x7fff8dc13380], t, WT_SESSION.checkpoint: log_force_sync: sync ./WiredTigerLog.0000000061 to LSN 61/384
0x110b20ec0:[1585239796:323461][15029:0x7fff8dc13380], t, compatibility-reconfig: log_truncate_files: Archive once up to 0
0x110b20ec0:[1585239796:323498][15029:0x7fff8dc13380], t, compatibility-reconfig: log_archive: archive to log number 61
0x110b20ec0:[1585239796:323901][15029:0x7fff8dc13380], t, compatibility-reconfig: log_remove: remove log ./WiredTigerLog.0000000038
... [Other log_remove messages deleted for brevity.]
0x110b20ec0:[1585239796:337128][15029:0x7fff8dc13380], t, compatibility-reconfig: log_remove: remove log ./WiredTigerLog.0000000043
0x110b209e8:[1585239796:337943][15029:0x7fff8dc13380], t, close_ckpt: opening log ./WiredTigerLog.0000000061
0x110b209e8:[1585239796:337956][15029:0x7fff8dc13380], t, close_ckpt: log_force_sync: sync ./WiredTigerLog.0000000061 to LSN 61/3584
0x110b1e800:[1585239796:393911][15029:0x7fff8dc13380], t, WT_CONNECTION.close: closing log ./WiredTigerLog.0000000061
0x110b1e800:[1585239796:402255][15029:0x7fff8dc13380], t, WT_CONNECTION.close: closing log directory .
   1: table, row-store (60 seconds)
t: successful run completed

The database directory looks like this:

develop % ls RUNDIR
CONFIG				WiredTiger.wt			WiredTigerPreplog.0000000063
WiredTiger			WiredTigerHS.wt			rand
WiredTiger.lock			WiredTigerLog.0000000061	wt.wt
WiredTiger.turtle		WiredTigerPreplog.0000000062

The log files all look correct to me. The only existing log file is correct. It is log version 4:

develop % hexdump -C RUNDIR/WiredTigerLog.0000000061
00000000  80 00 00 00 7a fa 3c 46  00 00 00 00 00 00 00 00  |....z.<F........|
00000010  64 10 10 00 04 00 00 00  00 00 10 00 00 00 00 00  |d...............| <--- The 04 here is the version.
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000080  80 00 00 00 1d 87 b4 d8  00 00 00 00 00 00 00 00  |................|
00000090  84 88 87 b8 e3 05 ee c0  00 00 00 00 00 00 00 00  |................|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000100  80 00 00 00 c9 54 d8 8c  00 00 00 00 00 00 00 00  |.....T..........|
00000110  83 43 4f 4d 50 41 54 49  42 49 4c 49 54 59 3a 20  |.COMPATIBILITY: |
00000120  56 65 72 73 69 6f 6e 20  6e 6f 77 20 34 00 00 00  |Version now 4...|
00000130  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*

A hexdump of the two WiredTigerPreplog files also show that both have version 4 in the header.

Now, run the wt command from the mongodb-4.2 tree I built:

mongodb-4.2 % ./wt -h /tmp/RUNDIR printlog
[
  { "lsn" : [61,128],
    "hdr_flags" : "",
    "rec_len" : 128,
    "mem_len" : 128,
    "type" : "system",
    "ops": [
      { "optype": "prev_lsn",
        "prev_lsn": [60, 544256]
      }
    ]
  },
  { "lsn" : [61,256],
    "hdr_flags" : "",
    "rec_len" : 128,
    "mem_len" : 128,
    "type" : "message",
    "message" : "COMPATIBILITY: Version now 4"
  },
  { "lsn" : [61,384],
    "hdr_flags" : "",
    "rec_len" : 128,
    "mem_len" : 128,
    "type" : "system",
    "ops": [
      { "optype": "checkpoint_start",
      }
    ]
  },
...

And listing the metadata also works:

mongodb-4.2 % ./wt -h /tmp/RUNDIR list -v
metadata:
access_pattern_hint=none,allocation_size=4KB,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=,cache_resident=false,checksum=uncompressed,collator=,columns=,dictionary=0,encryption=(keyid=,name=),format=btree,huffman_key=,huffman_value=,id=0,ignore_in_memory_cache_size=false,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=S,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=0,log=(enabled=true),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=4,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,value_format=S,version=(major=1,minor=1),checkpoint=(mine.2=(addr="018581e4f40fd00a8b81e4a5e727028881e4b5a9f07e808080e27fc0cfc0",order=2,time=1585239766,size=8192,newest_durable_ts=0,oldest_start_ts=0,oldest_start_txn=2,newest_stop_ts=-1,newest_stop_txn=-11,write_gen=4),WiredTigerCheckpoint.12=(addr="018081e4fe1f53728181e425dc3ab28381e49d7ef4d98281e46a90d4c0e2cfc0cfc0",order=12,time=1585240078,size=8192,newest_durable_ts=0,oldest_start_ts=0,oldest_start_txn=0,newest_stop_ts=-1,newest_stop_txn=-11,write_gen=30)),checkpoint_backup_info=,checkpoint_lsn=(4294967295,2147483647)
colgroup:wt
app_metadata=,collator=,columns=,source="file:wt.wt",type=file
file:WiredTigerLAS.wt
access_pattern_hint=none,allocation_size=4KB,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=snappy,cache_resident=false,checkpoint=,checkpoint_backup_info=,checkpoint_lsn=,checksum=uncompressed,collator=,columns=,dictionary=0,encryption=(keyid=,name=),format=btree,huffman_key=,huffman_value=,id=4,ignore_in_memory_cache_size=false,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=QIQu,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,log=(enabled=true),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=true,prefix_compression_min=4,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,value_format=QQQBBu,version=(major=1,minor=1)
file:wt.wt
access_pattern_hint=none,allocation_size=512,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=,cache_resident=false,checksum="uncompressed",collator=,columns=,dictionary=203,encryption=(keyid=,name=),format=btree,huffman_key=,huffman_value=,id=2,ignore_in_memory_cache_size=false,internal_item_max=0,internal_key_max=24,internal_key_truncate=true,internal_page_max=1024,key_format=u,key_gap=1,leaf_item_max=0,leaf_key_max=1438,leaf_page_max=65536,leaf_value_max=0,log=(enabled=true),memory_page_image_max=0,memory_page_max=8388608,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=1,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=96,value_format=u,version=(major=1,minor=1),checkpoint=(mine.2=(addr="01be82e49300b726c01284e45ac68481c00888e4a20f7def808080e4033efbc0e401a259c0",order=3,time=1585239766,size=27426304,start_durable_ts=159538,oldest_start_ts=0,oldest_start_txn=0,stop_durable_ts=159532,newest_stop_ts=-1,newest_stop_txn=-11,write_gen=125061),WiredTigerCheckpoint.5=(addr="01c56382e40310b4a7c56584e4b657a835c56d8ee4d84f2cf4c56984e468f8f01ce4042713c0e4017a4bc0",order=5,time=1585239796,size=24801280,start_durable_ts=336282,oldest_start_ts=0,oldest_start_txn=0,stop_durable_ts=336184,newest_stop_ts=-1,newest_stop_txn=-11,write_gen=240975)),checkpoint_backup_info=,checkpoint_lsn=(61,384)
table:wt
app_metadata=,colgroups=,collator=,columns=,key_format=u,value_format=u

So it does appear the basic mechanism is working.

Comment by Eric Milkie [ 26/Mar/20 ]

I think it's the latter: WT does not do all the right things, yet isn't returning an error.

Comment by Susan LoVerso [ 26/Mar/20 ]

WiredTiger definitely realises that we're in a downgrade situation and attempts to downgrade the log file versions (the version numbers it thinks it is downgrading to and from look correct). I'm assuming that something in that logging machinery is interacting badly with the new durable history changes and we're not doing this successfully.

So are you saying that the reconfigure is returning an error? Or that WiredTiger does not do (all) the right things yet still returns success?

I will take a look at this today. It seems like it should be easy to reproduce with just WiredTiger calls/tests and then trying to start up on an earlier branch. I will let you know what I find.

Comment by Alex Cameron (Inactive) [ 26/Mar/20 ]

vamsi.krishna alexander.gorrod daniel.gottlieb
Just checking in where I'm at with this.

The scenario that is hitting most of the issues here is downgrading from 4.4 to 4.2. I can confirm that MongoD 4.4 reconfigures to compatibility_version=3.3 when shutting down with 4.2 FCV as expected. However, when we restart with MongoD 4.2, WiredTiger detects log files with a version of 5 (which corresponds to 4.4's compatibility_version=10.0) and errors out.

I've mostly been testing with:

buildscripts/resmoke.py --suites=multiversion --storageEngine=wiredTiger --jobs=4 --repeat=1 --shuffleMode=on --storageEngineCacheSizeGB=1 jstests/multiVersion/3_upgrade_replset.js

I've added some logging to trace what is happening when shutting down with 4.2 FCV. The reconfigure call appears to do the right thing: we realise that we're downgrading log versions and go all the way down to __logmgr_force_archive and appear to do work to force a checkpoint, write a new log file with version number 4 and blow away any old log files. So WiredTiger definitely realises that we're in a downgrade situation and attempts to downgrade the log file versions (the version numbers it thinks it is downgrading to and from look correct). I'm assuming that something in that logging machinery is interacting badly with the new durable history changes and we're not doing this successfully.

I'm going to continue looking at this however, I don't anticipate that I'll find get to the bottom of this tonight since it will involve learning more about the logging subsystem. sue.loverso, if you get the chance could you please take a look at this? I suspect that you'll have a much better idea of what is going on.

I also had a try of just running a MongoD instance manually but I didn't see any issues.
I just started 4.4, executed:

db.adminCommand({setFeatureCompatibilityVersion: "4.2"})}

Shut it down, then successfully started up 4.2 with the same dbpath. I also hacked wt printlog to print the log version number and it also confirms that it is 4 after shutting down.

./wt -h ~/work/mongo/data/db -C "log=(compressor=snappy)" printlog | less

Generated at Thu Feb 08 05:13:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.