Some of our patch builds were failing test_verify.py consistently with the following error:
[2023/11/29 04:25:34.376] [1701231595:894982][149352:0x7f25c4d98800], test_verify.test_verify.test_verify_api_corrupt_first_page, file:test_verify.a.wt, WT_SESSION.verify: [WT_VERB_DEFAULT][ERROR]: __wt_block_read_off, 234: test_verify.a.wt: potential hardware corruption, read checksum error for 28672B block at offset 4096: calculated block checksum of 0xd4b774a doesn't match expected checksum of 0x5ccb8e3e
The test deliberately corrupts the data file and runs verify after.
The below patch fixed the errors we were seeing by avoiding prefetch whenever we detect a corrupted block:
--- a/src/conn/conn_prefetch.c +++ b/src/conn/conn_prefetch.c @@ -108,7 +108,13 @@ __wt_prefetch_thread_run(WT_SESSION_IMPL *session, WT_THREAD *thread) __wt_spin_unlock(session, &conn->prefetch_lock); locked = false; - WT_WITH_DHANDLE(session, pe->dhandle, ret = __wt_prefetch_page_in(session, pe)); + /* + * It's a weird case, but if verify is utilizing prefetch and encounters a corrupted + * block, stop using prefetch. Some of the guarantees about ref and page freeing are + * ignored in that case, which can invalidate entries on the prefetch queue. + */ + if (!F_ISSET(S2C(session), WT_CONN_DATA_CORRUPTION) && pe->ref->page_del != NULL) + WT_WITH_DHANDLE(session, pe->dhandle, ret = __wt_prefetch_page_in(session, pe)); /* * It probably isn't strictly necessary to re-acquire the lock to reset the flag, but other * flag accesses do need to lock, so it's better to be consistent.
- related to
-
WT-12135 Don't re-open connections and sessions with pre-fetching enabled after damaging tables
- Closed