Loading...

XML

Word

Printable

JSON

Type: New Feature
Resolution: Fixed
Priority: Major - P3
Fix Version/s: WT11.2.0, 7.0.0-rc0
Affects Version/s: None
Component/s: None
Labels:
- dev-prod

Sprint:
None
Story Points:
None

Summary
WiredTiger has a variety of tests that simulate failures by halting or killing WT uncleanly. We should configure and run these tests using LazyFS.

Background

LazyFS is a file system implemented using the FUSE framework on Linux. It's intended use case is to verify that an application is using fsync as needed to ensure data persistence and consistency. LazyFS intercepts read, write, and fsync calls and only writes data to the underlying file system if/when an appropriate fsync (or similar) call is made.

Thus LazyFS can be combined with fault inject to ensure that an application such as WiredTiger or MongoDB recovers correctly in a pessimal scenario for written data actually getting persisted.

In WiredTiger we could do this by taking any of our existing fault-injection tests, such as random_abort or timestamp_abort and running them with LazyFS:

Run test in directory backed by LazyFS
Test creates failure by killing WT
Tell LazyFS to drop/forget all unsynced data
Test attempts recovery as normal

Acceptance Criteria (Definition of Done)

The goal of this project is to get at least one of the WT fault tests working with LazyFS and assess how easy or hard it would be to deploy this type of testing as part of our standard Evergreen testing. If it looks tractable we should create further tickets to do that and to expand LazyFS testing to other tests where it would be useful.

Useful information to know would include:

What dependencies are required and not currently on the testing platforms?
Does the process to setup LazyFS require elevated (root) permissions to complete?
Are there any steps that would be difficult to automate?

Suggested Solution

Spawn a fresh Ubuntu 20.04 host
Install LazyFS as per instructions and mount it. You can verify this has worked correctly by sending a lazyfs::display-cache-usage command to the LazyFS faults fifo and seeing a cache usage report in response.
Create and compile a simple csuite test that also writes a lazyfs::display-cache-usage command to the faults fifo, but this time is triggered by the csuite test and not via command line. The results should be the same as above.
Once this is working, update random_abort or timestamp_abort such that the database created by these tests is located on the LazyFS volume and the test sends a lazyfs::clear-cache command after WiredTiger has aborted but before recovery. If this test succeeds, either stub or modify __wt_fsync to no longer call fsync successfully and rerun the test. The second test run should fail due to lost data

causes

WT-10625 Failed: make-check-test on macOS 11.00 [WiredTiger (develop) @ fc1a5bf4]

Closed

related to

WT-10591 Automate LazyFS testing using Evergreen

Closed

Assignee:: Peter Macko
Reporter:: Keith Smith
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: May 26 2022 08:18:07 PM UTC
Updated:: Oct 29 2023 04:39:32 PM UTC
Resolved:: Feb 16 2023 03:31:29 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates