Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-9368

Extend WT fault injection testing to support LazyFS

    • Type: Icon: New Feature New Feature
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • WT11.2.0, 7.0.0-rc0
    • Affects Version/s: None
    • Component/s: None

      Summary
      WiredTiger has a variety of tests that simulate failures by halting or killing WT uncleanly.  We should configure and run these tests using LazyFS.

      Background

      LazyFS is a file system implemented using the FUSE framework on Linux. It's intended use case is to verify that an application is using fsync as needed to ensure data persistence and consistency.  LazyFS intercepts read, write, and fsync calls and only writes data to the underlying file system if/when an appropriate fsync (or similar) call is made.  

      Thus LazyFS can be combined with fault inject to ensure that an application such as WiredTiger or MongoDB recovers correctly in a pessimal scenario for written data actually getting persisted.

      In WiredTiger we could do this by taking any of our existing fault-injection tests, such as random_abort or timestamp_abort and running them with LazyFS:

      • Run test in directory backed by LazyFS
      • Test creates failure by killing WT
      • Tell LazyFS to drop/forget all unsynced data
      • Test attempts recovery as normal

      Acceptance Criteria (Definition of Done)

      The goal of this project is to get at least one of the WT fault tests working with LazyFS and assess how easy or hard it would be to deploy this type of testing as part of our standard Evergreen testing.  If it looks tractable we should create further tickets to do that and to expand LazyFS testing to other tests where it would be useful.

      Useful information to know would include:

      • What dependencies are required and not currently on the testing platforms?
      • Does the process to setup LazyFS require elevated (root) permissions to complete?
      • Are there any steps that would be difficult to automate?

      Suggested Solution

      1. Spawn a fresh Ubuntu 20.04 host
      2. Install LazyFS as per instructions and mount it. You can verify this has worked correctly by sending a lazyfs::display-cache-usage command to the LazyFS faults fifo and seeing a cache usage report in response.
      3. Create and compile a simple csuite test that also writes a lazyfs::display-cache-usage command to the faults fifo, but this time is triggered by the csuite test and not via command line. The results should be the same as above.
      4. Once this is working, update random_abort or timestamp_abort such that the database created by these tests is located on the LazyFS volume and the test sends a lazyfs::clear-cache command after WiredTiger has aborted but before recovery. If this test succeeds, either stub or modify __wt_fsync to no longer call fsync successfully and rerun the test. The second test run should fail due to lost data

            Assignee:
            peter.macko@mongodb.com Peter Macko
            Reporter:
            keith.smith@mongodb.com Keith Smith
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: