Structure Layout Optimization POC in WiredTiger

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Cursors, Performance
    • Storage Engines, Storage Engines - Foundations
    • 19.189
    • None
    • None

      This project developed a methodology for making evidence-based choices about field layout in WiredTiger's hot internal structs, and demonstrated it on WT_SESSION_IMPL and WT_CURSOR_BTREE. The methodology uses:

      • analysis via AI to identify which fields are touched together on hot paths
      • some rearrangement of struct fields, guided also by taste: we want to preserve logical groupings
      • WT_STRUCT_LAYOUT() macros that document and enforce cache-line group boundaries via compiler directives.

      This POC shows several wins in targeted workloads, like consistent 2+% improvement in ecommerce workloads and consistent 1.7% improvement across 5 sub-workloads of mixed_workloads_locust. More importantly, targeted analysis of workload regressions, followed by struct adjustments, has reduced these regressions substantially. This targeted work on regressions is a powerful focusing strategy, as resulting changes generally also help workloads broadly. This POC has ended due to time constraints, but there is strong indications that most or all of the stable regressions can be reduced to noise levels. Even more exciting is that future gains are possible. WT_DATA_HANDLE, WT_BTREE, and WT_CONNECTION_IMPL are all candidates for reordering. Btree internal structs should also be examined for potential wins, although it's likely that gains on smaller structs may be more difficult without growing their size.

      Another part of future work is preserving hard-fought gains. It is all too easy for any WT PR to insert a new field in the middle of a well crafted layout. There are straightforward ways to prevent/detect this from happening. Again, the POC was too short to develop these.

      A few layout strategies were used, and are relatively easy to apply given the AI analysis:

      • Once we know fields that are used together at around the same time, we can group them together taking advantage of locality in the L1 cache.
      • On identifying fields that are generally hot, group them in their own hot cache line(s).
      • Knowing fields shared among multiple threads leads us to grouping them with cold fields. This can eliminate the false sharing anti-pattern.

      A summary of 6 validation patches run at the same time (3 baseline, 3 with the changes) will be attached in the comments, as well as code differences.

            Assignee:
            Donald Anderson
            Reporter:
            Donald Anderson
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: