-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Live Restore
-
None
-
Environment:GCC 13, Ubuntu 24.04
-
Storage Engines - Persistence
-
175.961
-
None
-
1
During live restore background migration, when determining whether a file is visible, the engine checks if a stop file exists in the destination layer using __dest_has_stop_file().
In src/live_restore/live_restore_fs.c, __dest_has_stop_file queries the underlying OS file system's fs_exist function to search for the stop file, but discards its return value:
lr_fs->os_file_system->fs_exist(
lr_fs->os_file_system, (WT_SESSION *)session, path_marker, existp);
If fs_exist returns an error (such as a transient I/O issue or permission failure), the return value is discarded, and the function returns success with ret = 0.
Why this leads to incorrect results:
Looking at the underlying POSIX implementation of fs_exist in src/os_posix/os_fs.c (__posix_fs_exist):
WT_SYSCALL(stat(name, &sb), ret);
if (ret == 0)
if (ret == ENOENT)
{ *existp = false; return (0); }WT_RET_MSG(session, ret, "%s: file-exist: stat", name);
If the systemic call stat() fails with a systemic error (such as EIO), the function immediately aborts and returns the non-zero error code ret without ever assigning a value to *existp.
Because __dest_has_stop_file discarded this return value, it proceeded to use the unmodified, potentially uninitialized or stale value of *existp. This causes logically-deleted source files to incorrectly reappear as visible to the user, violating consistency invariants.
For Repro:
I added a Catch2 unit test file test_live_restore_dest_has_stop_file.cpp.
Before the fix: When running the test case where fs_exist for a stop-file returns EIO, the system swallowed the error, returning ret == 0 and making the deleted file visible (exists == true).
After the fix: The system correctly propagates the error. The final test(final test is after_fix_test_live_restore_dest_has_stop_file.cpp I attached both before/after) asserts this correct behavior.
I intentionally did not include the test code in the PR; we can discuss whether the test code is appropriate for the PR. I have attached the test code. However, I have implemented the fix in the PR.
Purposed Fix:
Wrap the os_file_system->fs_exist in a WT_ERR(...) macro to check and propagate the return value.