Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-9121

Eviction server thread always reset the traversing point when exiting -- parts of the B-tree may never be traversed by eviction in some simple cases/

    • Storage Engines
    • 8
    • c(3x10^8)-StorEng - 2023-11-14, 2023-11-28 - Anthill Tiger, 2023-12-12 - Heisenbug, 2024-01-09 - I Grew Tired, StorEng - 2024-01-23, 2024-02-06 tapioooooooooooooca, 2024-02-20_A_near-death_puffin, 2024-03-05 - Claronald, 2024-03-19 - PacificOcean

      Summary
      The eviction server thread, when exiting, will clear btree->evict_ref, which records the current point in traversing this btree, by calling __evict_clear_walk() when finishing this eviction run. This means the next background eviction thread would traverse the btree from the beginning, and if it finishes before scanning through the whole tree, the next one would again start from the beginning and parts of the tree would never be scanned for eviction.

       

      This means the current eviction mechanism cannot adapt to certain access patterns in certain (but common) conditions and fail to decide hotness of data correctly and evict hot data from cache. Workloads focusing on former parts of the tree would be much slower than those focusing on latter parts because former parts are always scanned and evicted from cache while latter parts are persistent in cache.

      Reproduction and Consequence

      This issue occur in the following setup:

      1. I write an application that uses WiredTiger directly.
      2. The database is using a single B-tree, sequentially filling 10,000,000 key value pairs of (16 + 72) bytes resulting in a ~1G uncompressed database size. With compression enabled (snappy compression), the on-disk file size is ~500M.
      3. Total available memory for the application is 1G, and the cache size of WiredTiger is set to 820000KB (but actually only ~700M is used by the application; this may be another issue.)  The workload runs 2,000,000 get requests in a single thread, either uniformly random, or focusing (>80% of all requests) on the first half of keys, or focusing on the second half. The total latency differs a lot (see the table below). The reason is described as above, that the first half pages in the tree are always chosen for eviction regardless of the workload.
      4. Most parameters including those related to eviction, except for the cache size, are default.

      I think any read workload with a single large btree and mild cache pressure (so the eviction thread finishes working before scanning through the whole tree) and penalizes cache miss heavily (e.g. by decompression cost) would reproduce similar results.

      I have modified the code and verified that only the first half pages in the tree are frequently evicted from the cache regardless of the access pattern.

      This behavior is checked to be true on WiredTiger version 10.0.0 (the current master branch) and 10.0.2 (the current develop branch)

       

      Workload Focusing on first half Uniform Random Focusing on second half
      Latency 47s 36s 14s

       

        1. wt_trace_eviction-8.png
          wt_trace_eviction-8.png
          61 kB
        2. WiredTigerStat.29.06
          4.87 MB
        3. simple_read.cpp
          6 kB
        4. screenshot-throughput.png
          screenshot-throughput.png
          150 kB
        5. screenshot-latency-95.png
          screenshot-latency-95.png
          155 kB
        6. screenshot-latency-50.png
          screenshot-latency-50.png
          158 kB
        7. screenshot-9.png
          screenshot-9.png
          27 kB
        8. screenshot-8.png
          screenshot-8.png
          37 kB
        9. screenshot-7.png
          screenshot-7.png
          19 kB
        10. screenshot-6.png
          screenshot-6.png
          19 kB
        11. screenshot-5.png
          screenshot-5.png
          120 kB
        12. screenshot-4.png
          screenshot-4.png
          137 kB
        13. screenshot-3.png
          screenshot-3.png
          252 kB
        14. screenshot-2.png
          screenshot-2.png
          864 kB
        15. screenshot-11.png
          screenshot-11.png
          20 kB
        16. screenshot-10.png
          screenshot-10.png
          27 kB
        17. screenshot-1.png
          screenshot-1.png
          286 kB
        18. random-rescent-histogram.png
          random-rescent-histogram.png
          72 kB
        19. random-descent-timeline.png
          random-descent-timeline.png
          26 kB
        20. random-descent-no-memory-timeline-zoom.png
          random-descent-no-memory-timeline-zoom.png
          14 kB
        21. random-descent-no-memory-timeline.png
          random-descent-no-memory-timeline.png
          37 kB
        22. random-descent-no-memory-histogram.png
          random-descent-no-memory-histogram.png
          70 kB
        23. image-2023-11-30-07-07-52-326.png
          image-2023-11-30-07-07-52-326.png
          36 kB
        24. cxxopts.hpp
          44 kB
        25. 2nd.png
          2nd.png
          662 kB
        26. 1st.png
          1st.png
          857 kB

            Assignee:
            y.ershov@mongodb.com Yury Ershov
            Reporter:
            ydai59@wisc.edu Yifan Dai
            Votes:
            0 Vote for this issue
            Watchers:
            22 Start watching this issue

              Created:
              Updated:
              Resolved: