Branch wt-2909-verify-checkpoint-integrity introduces a test that runs a subprogram that does some inserts and periodically checkpoints. During the course of a checkpoint, we cause some file system writes to fail, and we expect the subprogram to fail. The parent program opens a connection to the (failed) home directory and reads what it can.
The subprogram inserts into two tables within a single transaction. In the case of the failure, we see one of the tables containing many records, and the other only containing 1. (we always do a checkpoint after the 1st record). The test always expects to see the same number of records in each. Note that there is a long comment in test/csuite/wt2909_checkpoint_integrity/main.c describing the test.
There is a caveat to this JIRA report. We must be sure that there is not an error in the fail_fs code that violates some assumption of the file system code. In particular, fail_fs does not do locks or unlocks of files, or does syncs. That is because fail_fs does not need to be durable in the face of system crashes, only for process crashes. Perhaps I missed some other assumption.
To see the failure:
./test_wt2909_checkpoint_integrity -v -o 125
That runs the "top level" test as well as the subtest. To run the subtest only, which populates and uses the fail_fs to inject failures, do:
./test_wt2909_checkpoint_integrity subtest -v -p -o 125 -n 50000
At the moment, I've only verified this is a failure on OS/X. It's consistently reproducible.
For a stack trace of where the write fault was injected, see WT_TEST.subtest/stdout.txt.