Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7806

Failed: many-collection-test on Large scale tests [WiredTiger (develop) @ fe8ecbdd]

    • Type: Icon: Build Failure Build Failure
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • 1
    • Storage - Ra 2021-07-12

      many-collection-test failed on Large scale tests

      Host: ec2-34-205-41-137.compute-1.amazonaws.com
      Project: WiredTiger (develop)
      Commit: diff: WT-7507 Update salvage for a history store and timestamp world (#6590)

      • Salvage calls reconciliation to handle merged pages, and we were explicitly discarding the
        timestamp information from those pages. Preserve all timestamp information when reconciling
        salvaged pages.

      Row-store leaf page reconcilation:
      Don't copy every cell's timestamp information as we process the cells, just point to
      the current timestamp information.

      Column-store leaf page reconcilation:
      Rename "default_tw" to be "clear_tw", there are several places where we need a cleared
      timestamp structure, and it's a better name.

      Don't copy every cell's timestamp information as we process the cells, just point to
      the current timestamp information.

      Don't initialize the "last" timestamp information twice.

      • Fix a problem in salvage where reconciliation may skip a key/value pair (based on timestamps),
        and in that case, if the key/value is an overflow item, reconciliation will free the underlying
        object's backing blocks. That's a problem when merging pages if the key is an overflow item:
        if we're processing a page multiple times to handle overlapping ranges, and if the first build
        and reconcile removes the overflow key, the second build/reconcile will fail when it can't read
        the key. Intercept any attempt by reconciliation to free blocks, and clear our reference to that
        overflow key so it will be discarded when salvage finishes.
      • Fix a comment.
      • Lift the test for no-data-handles to before going and getting the key, it's wasted work in
        that case.

      Clean up some comments and move them so the comments are next to the code being discussed.

      • error: variable 'twp' may be uninitialized when used here [-Wconditional-uninitialized]
      • Cache the HS cursor for the entire page reconciliation, there's no point to doing an open/close
        cycle on every key that requires a HS update.

      Lift the complex test limiting when we update the HS on key removal out of the main loop, it's
      chock full of cache misses, at best.

      • Hook rollback-to-stable in as a second step for the WT_SESSION.salvage API.
      • Use __wt_metadata_search() instead of rolling my own.
      • Minor cleanup, don't assign integral values to a boolean.
      • Skip RTS on fixed-length column-store files, they have no stored timestamp information.
      • Fix a bunch of comments with unexpected trailing whitespace.
      • The overflow count won't be set unless there are overflow items, regardless of page type,
        simplify the test.
      • Free the config memory when leaving the function.
      • Generalize the "skip this object" function to cover all object rollback-to-stable ignores.
      • Fix handle usage for salvage: salvage needs a handle but rollback-to-stable doesn't. Hold the
        checkpoint & schema lock across the the salved and rollback-to-stable calls, if we release in
        the middle, a thread could get in and open handles.

      Remove fixed-length column-store exclusion in rollback-to-stable: we still have to clean up
      the in-memory structures. Don't even check for fixed-length column store, the root will have
      no timestamps so there will be little or no disk image processing.

      • Fix a timestamp type.
      • Don't cache the maximum file ID, read it on demand instead and make it a local variable.
      • Rollback-to-stable doesn't need to cache handles, and it's a serious bug if RTS doesn't have
        exclusive access, flag that as an error.
      • Fixed-length column-store is always stable on disk (it has no timestamps), but still needs to
        inititalize the time stamp information for aggregation into the column-store internal address.
      • fix a typo in a comment
      • Close the log recovery cursors before calling rollback-to-stable, that allows an assert of
        exclusive access by rollback-to-stable.
      • Fix speling typo.
      • Close cursors before running rollback-to-stable.
      • Fix a comment, closing sessions will close cursors, no need to do both.
      • Close cursors before calling rollback-to-stable, RTS requires exclusive access.
      • Close cursors before calling rollback-to-stable, RTS requires exclusive access.
      • Close cursors before calling rollback-to-stable, RTS requires exclusive access.
      • Don't mix-and-match non-diagnostic and diagnostic code.
      • Rework debugging asserts that we're not discarding an internal page with an active page-split
        generation to consistently check for handle dead and exclusive, exclusive handles cannot be
        in danger of another thread of control accessing a page-index field. (The bug this is fixing
        is __wt_page_can_evict() could return the page was evicatable because the handle was exclusive,
        but the assert in __wt_evict() didn't check exclusivity and so asserted that the page could not
        be evicted.

      Rework rollback-to-stable to protect the page-index only where it's needed, when reviewing
      the internal pages for fast-delete leaf pages. This isn't a performance or correctness issue,
      it's just clarifying when page-generations are interesting and when they're not (tree walk
      handles its own page-generations issues, there's no point in RTS doing it as well).

      • Remove stable_rollback_maxfile, its only purpose was to protect the stable_rollback_bitstring
        overwrite and that code has already been removed.
      • You can't open a file exclusively if there a modifications in the cache, attempting to close
        the already open file handles will fail with EBUSY: see WT-4070 and WT-4414.
      • Exclusive handle operations (salvage & verify in this case), can return EBUSY until able to
        close open handles and flush dirty data from the cache. Loop around checkpoints until the
        operation succeeds.
      • Add a specific error message if rollback-to-stable is unable to acquire a handle to make an
        exclusive handle failure in the field easier to diagnose.

      Enhance rollback-to-stable method documentation for clarity on error handling.

      • Change rollback-to-stable to only require exclusive handle use in standalone builds, MongoDB has
        open handles when calling rollback-to-stable.
      • Fix clang-analyzer complaint:
        txn_rollback_to_stable.c:1518:9: warning: Value stored to 'handle_open_flags' is never read | 08 Jul 21 02:07 UTC
        Evergreen Subscription: ; Evergreen Event:

      Task Logs (many-collection-test)

            Assignee:
            etienne.petrel@mongodb.com Etienne Petrel
            Reporter:
            xgen-evg-user Xgen-Evergreen-User
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: