Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Blocker - P1
Fix Version/s: WT11.2.0, WT11.1.0, 6.2.0-rc0, 6.0.3, 6.1.1, 5.0.15, 4.4.20
Affects Version/s: None
Component/s: None
Labels:
None

Sprint:
Storage Engines - 2022-10-03
Story Points:
5
Case:

Backport Requested:

v6.1, v6.0, v5.0, v4.4, v4.2

This appears to be the cause of the problem reported in HELP-37618. If there is a WiredTiger.backup file in the WT home directory, the WiredTiger.wt file is removed and the WiredTiger.backup will be used to repopulate the WiredTiger.wt . However, the WiredTiger.wt file is not actually written to until after the WiredTiger.backup is removed. If there is a crash after the WiredTiger.backup is removed and before WiredTiger.wt is written/flushed/fsync-ed, the next restart will start with an empty metadata file, thus losing track of all existing data.

The test case I have proves a weaker condition as it doesn't actually create a backup, but rather opens a backup cursor, does directory copy of the WT home, and starts a new connection on the copied directory. That directory "should" look the same as a backup directory, at least as far as the startup process goes. The relevant strace during the startup is here:

 $ grep -n '.>>>' strace_open.txt
28:>>>> stat("COPYDIR/WiredTiger.backup", {st_mode=S_IFREG|0664, st_size=89532, ...}) = 0
32:>>>> stat("COPYDIR/WiredTiger.wt", {st_mode=S_IFREG|0664, st_size=229376, ...}) = 0
33:>>>> unlink("COPYDIR/WiredTiger.wt")         = 0
34:>>>> stat("COPYDIR/WiredTiger.turtle", {st_mode=S_IFREG|0664, st_size=1485, ...}) = 0
35:>>>> unlink("COPYDIR/WiredTiger.turtle")     = 0
36:>>>>> openat(AT_FDCWD, "COPYDIR/WiredTiger.wt", O_RDWR|O_CREAT|O_EXCL|O_NOATIME|O_CLOEXEC, 0666) = 8
40:>>>> pwrite64(8, "A\330\1\0\1\0\0\0\330\10#\267\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096
41:>>>> fdatasync(8)                            = 0
42:>>>> close(8)                                = 0
48:>>>> openat(AT_FDCWD, "COPYDIR/WiredTiger.wt", O_RDWR|O_NOATIME|O_CLOEXEC) = 8
49:>>>> fstat(8, {st_mode=S_IFREG|0664, st_size=4096, ...}) = 0
50:>>>> pread64(8, "A\330\1\0\1\0\0\0\330\10#\267\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096
52:>>>> fstat(8, {st_mode=S_IFREG|0664, st_size=4096, ...}) = 0
53:>>>> ftruncate(8, 4096)                      = 0
57:>>>> stat("COPYDIR/WiredTiger.backup", {st_mode=S_IFREG|0664, st_size=89532, ...}) = 0
58:>>>> openat(AT_FDCWD, "COPYDIR/WiredTiger.backup", O_RDWR|O_CLOEXEC) = 9
59:>>>> fstat(9, {st_mode=S_IFREG|0664, st_size=89532, ...}) = 0
60:>>>> pread64(9, "colgroup:test_backup.3\napp_metad"..., 8192, 0) = 8192
384:>>>> stat("COPYDIR/WiredTiger.backup", {st_mode=S_IFREG|0664, st_size=89532, ...}) = 0
385:>>>> unlink("COPYDIR/WiredTiger.backup")     = 0
530:>>>> pwrite64(8, "\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\rl\0\0Z\0\0\0\7\4\0\1\0p\0\0"..., 28672, 4096) = 28672
531:>>>> pwrite64(8, "\0\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\372o\0\0004\0\0\0\7\4\0\1\0p\0\0"..., 28672, 32768) = 28672
532:>>>> pwrite64(8, "\0\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\341?\0\0:\0\0\0\7\4\0\1\0@\0\0"..., 16384, 61440) = 16384
533:>>>> pwrite64(8, "\0\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0\10F\0\0.\0\0\0\7\4\0\1\0P\0\0"..., 20480, 77824) = 20480
535:>>>> pwrite64(8, "\0\0\0\0\0\0\0\0\6\0\0\0\0\0\0\0\246\0\0\0\10\0\0\0\6 \0\1\0\20\0\0"..., 4096, 98304) = 4096
536:>>>> pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0004\0\0\0\f\0\0\0\1\0\0\1\0\20\0\0"..., 4096, 102400) = 4096
537:>>>> pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0.\0\0\0\6\0\0\0\1\0\0\1\0\20\0\0"..., 4096, 106496) = 4096
538:>>>> fdatasync(8)                            = 0

I'll attach the test program and entire strace output below.

Completion Criteria

In addition to fixing the bug (we're suggesting moving the removal of the WiredTiger.backup until later in the process), I think there should be some ad hoc testing to make sure it's doing the right thing. First, we should generate the strace as above, and verify that the writes to WiredTiger.wt get to disk before the backup file is removed. Second, we should run the startup from backup in the debugger. We should break and/or single step at various points between until the WiredTiger.wt is removed. For each of these breakpoints, we should copy the WiredTiger directory to a unique saved directory. So we end up with a set of WT directories. And for each one, start up WT and make sure we didn't lose any data. Within the scope of this ticket, I think this kind of testing is the best we can hope for. Writing a good automated test for these is more involved, and will be done in WT-9932.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

strace_open.txt
47 kB
Sep 28 2022 06:05:50 PM UTC
test_backup29.py
2 kB
Sep 28 2022 06:05:26 PM UTC

related to

WT-9932 Tests needed for crashes during backup restart

Open

Assignee:: Susan LoVerso (Inactive)
Reporter:: Donald Anderson
Votes:: 0 Vote for this issue
Watchers:: 12 Start watching this issue

Created:: Sep 28 2022 05:52:06 PM UTC
Updated:: Oct 29 2023 04:38:56 PM UTC
Resolved:: Oct 03 2022 03:01:37 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates