Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-11322

Copy blocks smaller than the chunk size to properly sized buffer

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Block Manager
    • Storage Engines
    • 5
    • Joker - StorEng - 2023-10-17, 2024-01-09 - I Grew Tired, StorEng - 2024-01-23, 2024-02-06 tapioooooooooooooca, 2024-02-20_A_near-death_puffin, 2024-03-05 - Claronald, 2024-03-19 - PacificOcean, Megabat - 2024-05-14

      An alternative approach to addressing WT-10831 would be to copy blocks to new, correctly sized buffers during read.

      WT-10831 addressed a bug where WT incorrectly tracked the amount of memory allocated to the cache. Specifically when reading a page that is smaller than the configured chunk size, WT would read the page into a chunk-sized piece of memory, but only increase the cache size by the page size. For example, if we stored a 1KB page in a 4KB block, then during read, WT would allocate 4KB of memory, read the 4KB file chunk into that memory, then increase the internal cache statistics by 1KB. On workloads that had many small blocks of this type, WT would wind up drastically overcommitting the cache, sometimes leading to OOM kills.

      Note that this is only an issue for files that are not compressed. When blocks are compressed, WT reads the blocks from disk, gets the in-memory (i.e., decompressed) page size from the block header, allocates a properly sized buffer, and then decompresses the block into that buffer.

      WT-10831 addressed the problem by properly accounting for the size of the buffer used to read data from disk. So in my example, WT-10831 increases the cache stats by 4KB, rather than 1KB, thus accurately accounting for the memory consumed by the cache. The downside is that this results in less effective use of the cache, as we are (correctly) billing the cache for unused space in these buffers (i.e., 3KB of unused space in the example). This can hurt performance as shown in WT-11320.

      An alternative approach to addressing WT-10831 would be copy the data to a properly sized buffer when this happens. So in the example, we would read 4KB block from disk, allocate a 1KB buffer to hold the page, copy the data to that 1KB page, free the original 4KB, and increase the cache stats by 1KB. This would result in caching the same amount of data as before WT-10831, but with the overhead of data copies for affected blocks. But since these blocks are typically small, that overhead should be modest compared to the savings (i.e., avoiding future disk IO by caching more data).

      Because pages are rarely exact multiples of the chunk size, a naïve implementation would wind up copying essentially every uncompressed block. I would suggest adding a threshold so that we only copy if we save more than X bytes of cache space (or Y%?), relying on WT-10831 to keep the accounting accurate for other cases.

       

            Assignee:
            Unassigned Unassigned
            Reporter:
            keith.smith@mongodb.com Keith Smith
            Ruby Chen
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: