Major - P3
The goal of this ticket is to decide whether and how flush_tier and checkpoint operations should be linked in the WiredTiger API. This will provide a place to discuss the options and preserve the discussion. Whatever we decide will be implemented in follow-on tickets.
We have two related operations in WiredTiger today:
- WT_SESSION::checkpoint() takes a checkpoint of the database
- WT_SESSION::flush_tier() flushed the state of tiered tables to object storage.
Today, these APIs calls are independent. The application can call either whenever it chooses. Internally however, there are dependencies between these operations:
- After a successful flush_tier() the state in object storage reflects the most recent checkpoint prior to the flush_tier() operation. I.e., if we were to initial sync, or restore from the cloud objects, that checkpoint is the point WiredTiger would restore to. (MongoDB might subsequently advance the state by replaying the OpLog.)
- A call to flush_tier() won't actually flush any files to object storage until after a subsequent checkpoint completes. This is how we ensure that any in-flight eviction completes before copying a file to object storage. See
WT-8836for a longer explanation.
The dependency between flush_tier() and a subsequent checkpoint seems like an issue that will be hard to explain and therefore lead to confusion and/or errors.
In a discussion today with email@example.com, firstname.lastname@example.org, and email@example.com, we discussed some alternatives.
In particular, it seems that the desired behavior will be for an application to tell us when they want to flush data and will expect the flush to complete without further action by the application (i.e., checkpoint calls). Likewise, the application would also like to minimize the time lag in a couple ways:
- Minimize the time between when flush is called and the time of the checkpoint that gets flushed. I.e., if a flush at time T writes a checkpoint taken at T - X to object storage, then we don't want X to be too big. (Note that I am deliberately being vague about whether T is the time when we call flush_tier() or the time when it completes.)
- Minimize the time it takes for flush_tier() to complete. I.e., if flush_tier() can't complete until a subsequent checkpoint, then we don't want to wait a long time for that checkpoint to happen.
We discussed a couple ways of making the connection between checkpoints and flushes more explicit:
- Eliminate the flush_tier() call and replace it with a checkpoint option that says, take a checkpoint and then flush it.
- Incorporate a following checkpoint into flush_tier(). I.e., after flush_tier() finishes the work it does today, it will trigger a checkpoint rather than waiting for one. Today we have a debug option that does this. So this would make the debug option the normal behavior.
- Incorporate a preceding checkpoint into flush_tier(). Instead of flush capturing whatever the most recent checkpoint was, flush_tier() will trigger a checkpoint and then immediately follow it with a flush. Give that checkpoints sometimes run a long time, this should be smart enough to notice a checkpoint that is already in progress and piggy-back on that instead of triggering yet another checkpoint.
- Some combination of #3 & #4.
- Leave things as they are today.
One constraint is that the design for sharing tiered tables in a replica set assumes that we can pass the results of flush_tier() – a cookie of some sort representing the newly flushed state – to the other replicas. Currently we expect this to happen by having flush_tier() return the cookie to the caller in the server, which can then forward the cookie to other replicas via the OpLog.
So any changes in API will need to provide a way for the server to get the result of a flush.