Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-11404

Do not create tiered table's local file until first write

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None

      This is a suggestion for tiered tables, to create the local files, with their needed metadata, on demand.  That is, when first creating a table: configured with tiering, we do not create the empty foo-0000000001.wtobj file.  This file, and its metadata entry, would be created on either on the first write, or as a side of effect of the first eviction or checkpoint.  The same rule would apply when we complete a flush_tier of a table, a new empty file with the next number is not (immediately) created automatically.

      The advantages become clear when we talk about sharing for tiered store.  In that world, we have a union table and a separate, plain-old local btree collecting all the write traffic.  The tiered table stays idle (and its local component completely empty) until a flush_tier.  Then (and only if there has been anything written to the table since last flush_tier), a write is made to the local file.  Compare that to our current system - you see there's generally less in the file system, and there doesn't have to be an extra file handle open.

      On an accepting node it's more dramatic.  If we're "up to" version 7, e.g. foo-07.wtobj in the cloud, then on the accepting node, we've already created foo-08.wtobj locally.  If we're notified of a new version 8, then we must remove our foo-08.wtobj, and get that obj from the cloud, and create foo-09.wtobj.  A bit wasteful.  Again, potentially an extra file handle open per table.

      Thinking about restoring a backup or initial sync, one step to starting up a system with thousands of tiered tables in an empty directory would be to create thousands of local files.  This might be wasteful, especially when considering the possibility that some of these tables may never be written to again.  Also think open file handles.

      Someday, we may have the notion of readonly tables in the cloud.  That's another use case where we don't want or need a local file to be ready to go.

      Obviously, there is a downside, when an idling system is hit with a lot of write traffic and requires eviction on a large number of files at the same time.  There could be a fair amount of contention on writing metadata.  So an alternate approach is not to create the file (and the extra metadata) on the first file write, but do it on the first cursor modification operation.  That could potentially spread out the contention. 

      For extra credit (or future skunk project) based on creating file/metadata on the first cursor mod, we don't do the create synchronously, but queue it to be done async. The creations would have to stay ahead of eviction or checkpoint, which wants to write to the file.  So maybe the eviction threads own the task of working that queue.

            Assignee:
            backlog-server-storage-engines [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            donald.anderson@mongodb.com Donald Anderson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: