-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
2023-06-27 Lord of the Sprints
Line 219 of block_mgr.c (in bm_close_block) says "You can't close files during a checkpoint." and asserts that the block manager's checkpoint state is either WT_CKPT_NONE or WT_CKPT_PANIC_ON_FAILURE, and not WT_CKPT_INPROGRESS.
I observed this assertion fail in test_checkpoint last night (while running the test_checkpoint_column_sweep_timestamps scenario) under the following circumstances:
1. Someone opens a checkpoint cursor on a tree that's in active use, so both the live tree and a checkpoint tree are open. These are different WT_BTREEs but the same underlying file, so share the same block manager.
2. A checkpoint starts.
3. The sweep server runs.
4. The checkpoint starts syncing the live tree. This sets the block manager's checkpoint state to WT_CKPT_INPROGRESS.
5. The sweep server decides to close the checkpoint tree. This closes the checkpoint tree's reference to the block manager while the checkpoint state is still WT_CKPT_INPROGRESS.
6. Profit^H^H^H^H^H^H Assertion failure.
Both the checkpoint and the sweep server lock their dhandles, and both dhandles are locked, but since they aren't the same dhandle this doesn't prevent the close from happening.
It is safe to allow this close to go through (all it's going to do is decref and return) so one possible fix is to move the assertion past the code that does that ,and only assert if the refcount is 1. However, this also largely nerfs the assertion, so I'm wondering if there's a better fix. I don't know the block manager code, so I'm hoping that someone who does will have some ideas.
My guess is that this will be very hard to repeat so I'm going to save the core file in case anyone wants further information.
- duplicates
-
WT-11181 heap-use-after-free during dhandle open
- Closed