-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Cache and Eviction
-
Storage Engines
-
2024-05-28 - FOLLOW ON SPRINT
I found a failure here where the test-format runs for more than 15 minutes and fails. From the core dump,cache dump and stats, I found that the checkpoint does not run regularly, it runs in an interval of ~12-15 minutes. I believe it is because after the checkpoint the test-format calls wts_verify_checkpoint here and most of the time is spent verifying the checkpoint. Below are my observations.
- It is kind of a cache stuck; cache has only internal pages, and all are modified.
- The server sees the internal pages, but it is operating in a clean eviction strategy, so it is evicting only unmodified pages, which include both internal and leaf pages, but there are only a few unmodified or clean internal pages.
- Eventually, modified internal pages become too much, which crosses the threshold, and it pulls application threads to do eviction, but it could not find anything in the queue because the server skips the modified internal pages and does not queue here. The stat cache eviction server skips pages that we do not want to evict, which keeps increasing.
- The application threads cause latency and are stuck for 15 minutes, causing the format tests to fail with time out format run more than 15 minutes past the maximum time.
- The cache is not operating in aggressive mode even after the cache dirty fill ratio is beyond the trigger level.
The dirty internal pages in the cache should have been marked clean by the checkpoint. So, if checkpoints run at regular intervals, then this issue will not happen.
I ran the same test/format configuration with the develop branch, and checkpoints run at regular intervals in around ~2–3 minutes, so this issue will not happen in the develop branch.
- duplicates
-
WT-11372 Test format times out after 15 minutes with cache stuck (clean)
- Open