Improve IO behavior for sequential workloads

XMLWordPrintableJSON

    • Type: New Feature
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Engines
    • None
    • None

      This is a suggestion for a possible future project. Some of it is based on speculation about how WT behaves. This speculation should be confirmed empirically as a first step before embarking on the larger proposal.

      Many I/O systems perform better with larger read/write sizes. EBS volumes, for example, provide a target IOPS level, where each I/O can be any size up to 256KB (for SSD backed volumes). So a workload that reads a GB/s in 32KB reads would need 8x as many IOPS as a workload that read the same amount of data in 256KB reads. Since EBS charges more for volumes with a higher IOPS capacity, generating fewer, larger I/Os may reduce cost.

      Similarly, GCP storage engineers have said that GCP's cloud volumes can handle up to 256 concurrent requests of up to 256KB each. So with larger requests we can achieve higher throughput to their volumes.

      WiredTiger I/Os are typically governed by the configured leaf page size, which defaults to 32KB for MongoDB collections. There is an inherent trade-off with changing the leaf-page size, however. Larger page sizes are good for workloads and tables that have a lot of scans (accesses to sequential keys), but small page sizes are better for workloads with a lot of point queries. So increasing the leaf page size would increase our typical I/O size at the cost of less efficient point reads.

      But maybe we can leverage read ahead (WT-10716) to get the best of both worlds.

      By way of background, operating systems and file systems will coalesce adjacent read/write requests into larger requests to the underlying storage. So if we issue separate reads for blocks #10, #11, and #12, the operating system will typically group them into a single request to read three blocks starting at #10. So if we can arrange for sequential pages in a btree to lands in sequential disk blocks, we could leverage this to have WT read/write smaller pages but get the benefit of larger I/O sizes when performing scans.

      Note that file systems try to place sequential blocks within a file on sequential blocks on storage. So we should generally be able to get this benefit if we can arrange for sequential btree pages to get written to sequential locations in a file.

      Today, there are times when WT writes pages that are sequential in the btree. For example, when we split a page, we generate multiple separate pages that are sequential. Today, each page is written to the block manger separately, resulting in multiple independent allocations. If we could arrange for these writes to land in the file sequentially, then when the operating system flushes the pages to disk (i.e., at fsync), the would get coalesced.

      There are several ways we might make this happen:

      • Change the allocation algorithm. Instead of best-fit, use next-fit (i.e., first fit starting from the last place we did an allocation) may be more likely to allocate blocks written at the same time to larger spans of free space.
      • With more work, we could capture this sequentiality at the block layer by writing sets of blocks. When we split a page, we would generate multiple block images in memory and pass them to the block manager in a single request, allowing the block manager to find an appropriately sized free space in the file to write them sequentially.

      In addition to page splits, we might also find sequential writes opportunistically by teaching eviction to look for logically sequential dirty pages. I.e., if eviction selects a page to evict, it might also look at the pages immediately before/after the page in the btree and see if they are also plausible candidates to evict.

      If we succeed in writing logically sequential blocks to physically sequential storage locations, then read-ahead would let us get a similar benefit on the read side. When the application scans a table, we will generate multiple read requests for upcoming pages in the scan. By queuing multiple such requests to the operating system, we will provide the opportunity for the operating system to coalesce those into a single larger read to the storage system.

      If this works well, it would also provide the possibility of decreasing the page size to optimize for point queries, with the expectation that multi-page sequential scans would still perform will. The downside however would be a loss of compression from using smaller pages.

        1. ckpt.gif
          ckpt.gif
          19 kB
        2. evict.gif
          evict.gif
          65 kB
        3. ckpt.zoom.gif
          ckpt.zoom.gif
          101 kB
        4. all.firstfit.gif
          all.firstfit.gif
          34 kB

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Keith Smith
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated: