Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-8836

Tiered Tables can only copy files to object storage after a new checkpoint completes

      For Tiered Tables, we need a way to determine when WiredTiger can safely copy a file to object storage. 

      In particular, we can't do this copy while there are still writes outstanding to the file.


      Normally, a tiered table has one local file, where all writes happen, and zero or more read-only objects that hold older data.  Periodically, we start using a new local file and flush the old one to object storage. At the time of this transition there may be writes in progress to the file we will flush to object storage (i.e., eviction may be writing to the file). So we need to wait until these writes have finished before performing the copy.  

      Suggested Approach

      We would prefer to avoid new locking or reference counting, as that would slow performance and add code complexity. Instead we can take advantage of the fact that a checkpoint of a table is effectively a barrier in the sequence of evictions to that table.  In other words, once a checkpoint of a table completes, all evictions that to that table that started before the checkpoint must also be complete.

      Thus if we switch writing from an old file to a new file, we know that all in-flight writes to the old file must be complete after a checkpoint writing to the new file completes. At that time it will be safe to copy the old file to object storage. Or, to rephrase this the way we will probably code it, when a checkpoint completes, we know that are no outstanding writes to files older than the one the checkpoint wrote to.

      Note that if we flush more often than we checkpoint there may be multiple old files that we can copy to object storage when the checkpoint completes.

      Possible Implementation

      Based on a conversation with keith.bostic,sue.loverso, and donald.anderson, here's a possible way to implement this:

      Currently when we switch to a new object in a Tiered Table we enqueue a work item for the tiered server thread to copy the old object to object storage.  We can add a field to the work structure,WT_TIERED_WORK_UNIT and set it to the current checkpoint generation when we enqueue the item.  Then the tiered server thread would only flush items where the current checkpoint generation is greater than the one stored in the item. 

      The work items will be queued with monotonically increasing checkpoint generation values. So the tiered thread can simply check the time on the first item and skip processing if the checkpoint generation hasn't advanced past that item yet. 



            sue.loverso@mongodb.com Susan LoVerso
            keith.smith@mongodb.com Keith Smith
            0 Vote for this issue
            5 Start watching this issue