Priority: Major - P3
Affects Version/s: None
Fix Version/s: None
@Keith Bostic and I were looking at a problem that cropped up with a variant version of test_format, and it turns out not to be a problem with format, but that fast-delete and rollback-to-stable don't work together.
I wrote a Python test (candidate test_rollback_to_stable30.py) that exhibits the problem, which I'm attaching.
It writes out a baseline set of data ("aaaaa") at time 20, then a second set of data ("bbbbb") at time 30, evicts the lot, truncates half the table at time 35 (checking stats to make sure fast-delete happens as expected), checkpoints, and then does RTS to time 25, either by direct call ("runtime") or by simulate_crash_restart ("recover"). Then it checks to make sure it sees all the "aaaaa" values, that is, that both the truncate and the second set of data have been rolled back. For VLCS and FLCS, which don't support fast-delete, this works. For row store, both the runtime and recovery cases produce wrong data.
The failure mode for runtime RTS is that it reads "bbbbb" values for some keys. The failure mode for recovery RTS is that it reads "aaaaa" values, but the wrong ones. (Each value is tagged with its key number so that it's easier to see what's happening; otherwise it fails after seeing the wrong number of values.)
The first problem is that the page-walk in RTS visits the children of an internal page before visiting the internal page; visiting the parent page of the fast-deleted leaf pages brings them back to life, but after the tree-walk would have visited them, so they aren't seen or processed by RTS.
In the runtime case this seems to mean that the original page images are still available and they get reconnected, just not rolled back, so the "bbbbb" data is still there. (And in fact, debug logging shows that after the test fails, the shutdown-time RTS pass then visits them all and cleans them up.)
For the recovery case the same thing happens except that the reattached pages are empty, so the cursor scan skips over them and reads an unexpected "aaaaa" value from the last page, which still exists.
The runtime RTS version can probably be fixed by changing the order of the tree-walk, though it's not immediately obvious how to do that in bt_walk.c; an expedient solution is to walk all the internal pages first and then do another walk to visit the leaves. I've attached a patch that does this; it makes the runtime case for row-store (scenario 4) work, but doesn't have any immediate effect on the recovery case.
Based on a suggestion by Keith I tried writing a hack to restore the WT_PAGE_DELETED info for WT_REF_DELETED entries in __inmem_row_int, but this seems to have no effect either, so I'm not including it.
The ultimate problem in the recovery case is that the pages can get thrown away, and once that happens it's too late. It seems like the right thing to do is to write out non-deleted references with stop times in the internal pages until the deletions become stable, but it's not clear if the time aggregates in internal pages can really represent this state adequately. The existing hooks seem to work differently (and predate the addition of history and timestamping) and I think I'm not sufficiently familiar with the row-store internal page code to sort it out without taking a long time.