Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-12728

test-format timeout due to checkpoint does not run at regular intervals in 7.0

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Cache and Eviction
    • Storage Engines
    • 2024-05-28 - FOLLOW ON SPRINT

      I found a failure here where the test-format runs for more than 15 minutes and fails. From the core dump,cache dump and stats, I found that the checkpoint does not run regularly, it runs in an interval of ~12-15 minutes. I believe it is because after the checkpoint the test-format calls wts_verify_checkpoint here and most of the time is spent verifying the checkpoint. Below are my observations.

      • It is kind of a cache stuck; cache has only internal pages, and all are modified.
      • The server sees the internal pages, but it is operating in a clean eviction strategy, so it is evicting only unmodified pages, which include both internal and leaf pages, but there are only a few unmodified or clean internal pages.
      • Eventually, modified internal pages become too much, which crosses the threshold, and it pulls application threads to do eviction, but it could not find anything in the queue because the server skips the modified internal pages and does not queue here. The stat cache eviction server skips pages that we do not want to evict, which keeps increasing.
      • The application threads cause latency and are stuck for 15 minutes, causing the format tests to fail with time out format run more than 15 minutes past the maximum time.
      • The cache is not operating in aggressive mode even after the cache dirty fill ratio is beyond the trigger level.

      The dirty internal pages in the cache should have been marked clean by the checkpoint. So, if checkpoints run at regular intervals, then this issue will not happen.

      I ran the same test/format configuration with the develop branch, and checkpoints run at regular intervals in around ~2–3 minutes, so this issue will not happen in the develop branch.

        1. 7.0_Branch.png
          7.0_Branch.png
          116 kB
        2. Develop_Branch.png
          Develop_Branch.png
          107 kB
        3. Screenshot 2024-05-16 at 5.10.52 PM.png
          Screenshot 2024-05-16 at 5.10.52 PM.png
          253 kB

            Assignee:
            ravi.giri@mongodb.com Ravi Giri
            Reporter:
            ravi.giri@mongodb.com Ravi Giri
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: