Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-3577

Create a page describing LSM implementation in architecture guide

    • Type: Icon: Improvement Improvement
    • Resolution: Won't Fix
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: LSM
    • Labels:

      We don't describe the architecture of the WiredTiger LSM tree implementation anywhere. A possible starting point is:

      The WiredTiger LSM tree implementation does vary significantly from the pure LSM tree that was presented in the original paper (as I suspect does the RocksDB LSM tree now). The main architectural points of interest are:

      • WiredTiger uses a pool of utility threads to manage background maintenance in LSM trees. There are several different types of work done by those worker threads - they are responsible for creating a new active chunk once the previous one is full (which is time critical), they are responsible for flushing old active chunks to disk, they are responsible for doing small and large merges and several more operations. Each operation has a priority, and higher priority operations take precedence.
      • The single pool of LSM workers are shared amongst all LSM trees in a WiredTiger connection (LevelDB benchmark uses only a single LSM tree).
      • There is a throttling scheme in place to avoid cases where inserts into the in-memory chunk overwhelm how fast content can be flushed to disk.
      • There is a throttling scheme to ensure that ingest rate doesn't overwhelm background merging.
      • WiredTiger doesn't break older (larger) chunks into ranges of data as LevelDB does. The WiredTiger merge algorithm is similar to the RocksDB universal compaction algorithm.
      • WiredTiger chooses sets of chunks to merge together based on a number of heuristics, some of which can be configured at tree create time (for example merge_min and merge_max). The heuristics attempt to do merges with chunks of similar age (size).
      • WiredTiger supports multi-key snapshot isolation transactions, which means that an update may conflict with an update that was made in a previous chunk - we have a scheme for detecting those conflicts.
      • There is an optimization for single threaded in-order inserts, where the inserts are treated similarly to how they are when a btree is bulk loaded. All content in one of these bulk loads goes into a single chunk of the tree, which is flushed when the bulk load completes. The LevelDB benchmark fillseq phase uses the bulk load mode.
      • WiredTiger implements bloom filters to make queries more efficient - there are several options to configure the space/effectiveness trade off for bloom filters.

            Assignee:
            backlog-server-storage-engines [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            alexander.gorrod@mongodb.com Alexander Gorrod
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: