-
Type:
Build Failure
-
Resolution: Fixed
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Storage - Ra 2021-07-12
-
1
many-collection-test failed on Large scale tests
Host: ec2-34-205-41-137.compute-1.amazonaws.com
Project: WiredTiger (develop)
Commit: diff: WT-7507 Update salvage for a history store and timestamp world (#6590)
- Salvage calls reconciliation to handle merged pages, and we were explicitly discarding the
timestamp information from those pages. Preserve all timestamp information when reconciling
salvaged pages.
Row-store leaf page reconcilation:
Don't copy every cell's timestamp information as we process the cells, just point to
the current timestamp information.
Column-store leaf page reconcilation:
Rename "default_tw" to be "clear_tw", there are several places where we need a cleared
timestamp structure, and it's a better name.
Don't copy every cell's timestamp information as we process the cells, just point to
the current timestamp information.
Don't initialize the "last" timestamp information twice.
- Fix a problem in salvage where reconciliation may skip a key/value pair (based on timestamps),
and in that case, if the key/value is an overflow item, reconciliation will free the underlying
object's backing blocks. That's a problem when merging pages if the key is an overflow item:
if we're processing a page multiple times to handle overlapping ranges, and if the first build
and reconcile removes the overflow key, the second build/reconcile will fail when it can't read
the key. Intercept any attempt by reconciliation to free blocks, and clear our reference to that
overflow key so it will be discarded when salvage finishes.
- Fix a comment.
- Lift the test for no-data-handles to before going and getting the key, it's wasted work in
that case.
Clean up some comments and move them so the comments are next to the code being discussed.
- error: variable 'twp' may be uninitialized when used here [-Wconditional-uninitialized]
- Cache the HS cursor for the entire page reconciliation, there's no point to doing an open/close
cycle on every key that requires a HS update.
Lift the complex test limiting when we update the HS on key removal out of the main loop, it's
chock full of cache misses, at best.
- Hook rollback-to-stable in as a second step for the WT_SESSION.salvage API.
- Use __wt_metadata_search() instead of rolling my own.
- Minor cleanup, don't assign integral values to a boolean.
- Skip RTS on fixed-length column-store files, they have no stored timestamp information.
- Fix a bunch of comments with unexpected trailing whitespace.
- The overflow count won't be set unless there are overflow items, regardless of page type,
simplify the test.
- Free the config memory when leaving the function.
- Generalize the "skip this object" function to cover all object rollback-to-stable ignores.
- Fix handle usage for salvage: salvage needs a handle but rollback-to-stable doesn't. Hold the
checkpoint & schema lock across the the salved and rollback-to-stable calls, if we release in
the middle, a thread could get in and open handles.
Remove fixed-length column-store exclusion in rollback-to-stable: we still have to clean up
the in-memory structures. Don't even check for fixed-length column store, the root will have
no timestamps so there will be little or no disk image processing.
- Fix a timestamp type.
- Don't cache the maximum file ID, read it on demand instead and make it a local variable.
- Rollback-to-stable doesn't need to cache handles, and it's a serious bug if RTS doesn't have
exclusive access, flag that as an error.
- Fixed-length column-store is always stable on disk (it has no timestamps), but still needs to
inititalize the time stamp information for aggregation into the column-store internal address.
- fix a typo in a comment
- Close the log recovery cursors before calling rollback-to-stable, that allows an assert of
exclusive access by rollback-to-stable.
- Fix speling typo.
- Close cursors before running rollback-to-stable.
- Fix a comment, closing sessions will close cursors, no need to do both.
- Close cursors before calling rollback-to-stable, RTS requires exclusive access.
- Close cursors before calling rollback-to-stable, RTS requires exclusive access.
- Close cursors before calling rollback-to-stable, RTS requires exclusive access.
- Don't mix-and-match non-diagnostic and diagnostic code.
- Rework debugging asserts that we're not discarding an internal page with an active page-split
generation to consistently check for handle dead and exclusive, exclusive handles cannot be
in danger of another thread of control accessing a page-index field. (The bug this is fixing
is __wt_page_can_evict() could return the page was evicatable because the handle was exclusive,
but the assert in __wt_evict() didn't check exclusivity and so asserted that the page could not
be evicted.
Rework rollback-to-stable to protect the page-index only where it's needed, when reviewing
the internal pages for fast-delete leaf pages. This isn't a performance or correctness issue,
it's just clarifying when page-generations are interesting and when they're not (tree walk
handles its own page-generations issues, there's no point in RTS doing it as well).
- Remove stable_rollback_maxfile, its only purpose was to protect the stable_rollback_bitstring
overwrite and that code has already been removed.
- You can't open a file exclusively if there a modifications in the cache, attempting to close
the already open file handles will fail with EBUSY: seeWT-4070andWT-4414.
- Exclusive handle operations (salvage & verify in this case), can return EBUSY until able to
close open handles and flush dirty data from the cache. Loop around checkpoints until the
operation succeeds.
- Add a specific error message if rollback-to-stable is unable to acquire a handle to make an
exclusive handle failure in the field easier to diagnose.
Enhance rollback-to-stable method documentation for clarity on error handling.
- Change rollback-to-stable to only require exclusive handle use in standalone builds, MongoDB has
open handles when calling rollback-to-stable.
- Fix clang-analyzer complaint:
txn_rollback_to_stable.c:1518:9: warning: Value stored to 'handle_open_flags' is never read | 08 Jul 21 02:07 UTC
Evergreen Subscription: ; Evergreen Event: