__dest_has_stop_file silently swallows fs_exist return value on error

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • WT12.0.0, 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: Live Restore
    • None
    • Environment:
      GCC 13, Ubuntu 24.04
    • Storage Engines - Persistence
    • 175.961
    • None
    • 1

      During live restore background migration, when determining whether a file is visible, the engine checks if a stop file exists in the destination layer using __dest_has_stop_file().

      In src/live_restore/live_restore_fs.c, __dest_has_stop_file queries the underlying OS file system's fs_exist function to search for the stop file, but discards its return value:

      lr_fs->os_file_system->fs_exist(
            lr_fs->os_file_system, (WT_SESSION *)session, path_marker, existp);

      If fs_exist returns an error (such as a transient I/O issue or permission failure), the return value is discarded, and the function returns success with ret = 0.

      Why this leads to incorrect results:
      Looking at the underlying POSIX implementation of fs_exist in src/os_posix/os_fs.c (__posix_fs_exist):

      WT_SYSCALL(stat(name, &sb), ret);
          if (ret == 0)

      {         *existp = true;         return (0);     }

          if (ret == ENOENT)

      {         *existp = false;         return (0);     }

          WT_RET_MSG(session, ret, "%s: file-exist: stat", name);

      If the systemic call stat() fails with a systemic error (such as EIO), the function immediately aborts and returns the non-zero error code ret without ever assigning a value to *existp.

      Because __dest_has_stop_file discarded this return value, it proceeded to use the unmodified, potentially uninitialized or stale value of *existp. This causes logically-deleted source files to incorrectly reappear as visible to the user, violating consistency invariants.

      For Repro:
      I added a Catch2 unit test file test_live_restore_dest_has_stop_file.cpp.
      Before the fix: When running the test case where fs_exist for a stop-file returns EIO, the system swallowed the error, returning ret == 0 and making the deleted file visible (exists == true).
      After the fix: The system correctly propagates the error. The final test(final test is after_fix_test_live_restore_dest_has_stop_file.cpp I attached both before/after) asserts this correct behavior.

      I intentionally did not include the test code in the PR; we can discuss whether the test code is appropriate for the PR. I have attached the test code. However, I have implemented the fix in the PR.

      Purposed Fix:
      Wrap the os_file_system->fs_exist in a WT_ERR(...) macro to check and propagate the return value.

            Assignee:
            Etienne Petrel
            Reporter:
            Tuna KARABACAK
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: