Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-8401

Update test_inmem01.test_insert_over_delete_replace to avoid a degenerate case in FLCS

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Minor - P4 Minor - P4
    • WT10.0.1, 5.2.0
    • Affects Version/s: None
    • Component/s: None
    • Labels:

      One of the subtests of test_inmem01 seems to fail sometimes under load on FLCS. This test writes data until it gets WT_CACHE_FULL, then deletes a quarter of it, then tries to rewrite the first 1000 rows, retrying on failure (which is tacitly assumed to be WT_CACHE_FULL). Under sufficient load, on FLCS, it will sometimes retry forever and fail.

      First, note that deleting doesn't recover space on FLCS, because deleted values are stored as zero; so reconciling the deletions doesn't make any more room in the cache. On the other hand, reconciling updates does save a lot of space, because an update structure is much larger than an on-disk value (which is at most one byte...)

      I think what's happening is that under sufficient load all the pages in the initial write get reconciled during that write, so by the time it stops all the possible space that can be wrung out by reconciling that data already has been. Then the deletions accomplish nothing, and when it goes to try to do more updates there's no space and no space to reclaim, so it gets WT_CACHE_FULL forever until the test gives up and fails.

      I don't see any way to fix this, since we have no way to mark pages to keep them from being reconciled. (Otherwise, doing that on the first page of the initial write pass would do the trick.)

      The idea I've come up with is to check how many rows the initial write generates (since the cache size is fixed, this indicates how much reconciliation has already happened). If it's too high skip the test at that point, and otherwise continue. This lets it run much of the time (it succeeded on all the runs testing WT-8287) but avoid generating noise if it's going to get stuck. While this seems a little cheesy, I think it's better than just turning off the test for FLCS and that's probably the only other viable choice.

      Unrelatedly I noticed that this and one of the other subtests say "... verify removes succeed" except that they don't actually check that remove succeeds. I propose to fix this while passing through

            Unassigned Unassigned
            dholland+wt@sauclovia.org David Holland
            0 Vote for this issue
            1 Start watching this issue