This appears to be the cause of the problem reported in HELP-37618. If there is a WiredTiger.backup file in the WT home directory, the WiredTiger.wt file is removed and the WiredTiger.backup will be used to repopulate the WiredTiger.wt . However, the WiredTiger.wt file is not actually written to until after the WiredTiger.backup is removed. If there is a crash after the WiredTiger.backup is removed and before WiredTiger.wt is written/flushed/fsync-ed, the next restart will start with an empty metadata file, thus losing track of all existing data.
The test case I have proves a weaker condition as it doesn't actually create a backup, but rather opens a backup cursor, does directory copy of the WT home, and starts a new connection on the copied directory. That directory "should" look the same as a backup directory, at least as far as the startup process goes. The relevant strace during the startup is here:
I'll attach the test program and entire strace output below.
In addition to fixing the bug (we're suggesting moving the removal of the WiredTiger.backup until later in the process), I think there should be some ad hoc testing to make sure it's doing the right thing. First, we should generate the strace as above, and verify that the writes to WiredTiger.wt get to disk before the backup file is removed. Second, we should run the startup from backup in the debugger. We should break and/or single step at various points between until the WiredTiger.wt is removed. For each of these breakpoints, we should copy the WiredTiger directory to a unique saved directory. So we end up with a set of WT directories. And for each one, start up WT and make sure we didn't lose any data. Within the scope of this ticket, I think this kind of testing is the best we can hope for. Writing a good automated test for these is more involved, and will be done in WT-9932.