Keith Bostic and I spent a long time on a call today. I have not had luck reproducing on my AWS box with clang yet (2+ hours).
I'm going to add in some of his suggestions from slack into the ticket. Some suggestions on configuration changes to make it reproduce faster perhaps:
- Try a bigger cache (it may hit faster, or it may make it go away) CONFIG:cache
- Try more threads CONFIG:threads
- Turn off salvage, verify and rebalance in CONFIG to make each iteration go faster.
- Try smaller CONFIG:leaf_page_max.
- Turn off overflow, make CONFIG:value_max smaller.
- Try larger CONFIG:write_pct.
- Try CONFIG:isolation=snapshot so that test/format does better testing of getting the correct key/value pairs. If we are allocating update structures but not correctly keeping or applying after reconciliation then snapshot may detect it at the time it happens instead of sanitizer detecting at the end of a run.
First step is getting a reproducer. This is already not easy (for me). We could record the flags that were sent to wt_reconcile, in evict_review, on the page, so we know how it was called.