Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-12000

Test race between checkpoint and tiered_server copying to object store

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Checkpoints
    • None
    • StorEng - Defined Pipeline

      WT-10989 builds upon earlier code to synchronise between a checkpoint and the tiered server. The checkpoint queues the entries that the tiered_server processes. The entries are files from a flush_tier checkpoint call, which are the older files from a tier switch that need to be uploaded to the object store.

      The switch of an underlying file by a checkpoint is done in several steps:

      1. Checkpoint sets conn->flush_checkpoint_complete to false
      2. Checkpoint prepare (__checkpoint_flush_tier) identifies that a tiered table needs to switch the underlying file. It schedules the block manager to do so in the later phase of the checkpoint. Then it pushes an entry into the work queue shared with the tiered server. This entry identifies the older of the file from the switch to be copied to the next tier / object store.
      3. During the actual checkpoint of individual files, the block manager switches the underlying file, and syncs and finalizes the older file in separate steps.
      4. After the completion of the checkpoint, conn->flush_checkpoint_complete is set to true, and the the tiered server is signalled to start processing the queued entries.

      Other than being wakened up by the signal from checkpoint completion tiered_server also wakes up periodically to check if entries in the queue need to be processed.

      tiered_server calls __tier_storage_copy() to copy the files to the next tier. Though it checks for conn->flush_checkpoint_complete, but the checkpoint can always race and start after the check.

      In WT-10989, I add a means to tag each entry in the tiered server's work queue with the generation-number of the checkpoint that creates the work item. I also added code in __tier_storage_copy() such that even if it were to race with a checkpoint, it will not process entries from any checkpoint that is yet to finish.

      This race between __tier_storage_copy() and the checkpoint is currently untested. This ticket should add appropriate testing for the system added as part of WT-10989.

      donald.anderson@mongodb.com has some ideas on how this can be done.

      Another case to consider in the co-ordination between tiered_server and a checkpoint is the following:

      Imagine there is checkpoint 1 at timestamp 100 that switches a file. The task to flush the file is put on the queue for the tiered server. Before the server gets to it, another checkpoint 2 at timestamp 200 switches the file again and an entry to flush is queued for the tiered server. Now, when the server gets to process the file from checkpoint 1, and calls tier_flush_meta, it obtains the flush timestamp for the last flush from btree->flush_most_recent_ts which will be 200 instead of 100. So we need more co-ordination here to mark each flush task with the metadata specific to that particular file.

      cc. keith.smith@mongodb.com

            Assignee:
            backlog-server-storage-engines [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            sulabh.mahajan@mongodb.com Sulabh Mahajan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: