-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Storage Engines
-
None
-
None
We should add a new API to WT_SESSION for publishing schema changes, and in the first pass, just implement it for table creation and table drop on the leader. The function signature should be:
WT_SESSION::publish(WT_SESSION *session, const char *uri, wt_timestamp_t epoch)
We need to agree on an appropriate name. If publish feels too simple or unspecific, maybe we could choose something like publish_at_schema_epoch instead.
Specifically:
- Implement a new function that determines whether schema operations should wait to publish their changes or whether they should publish their changes automatically without epochs (until the server is ready for the new API, and then later for testing purposes), e.g., __wt_disagg_has_schema_epochs or __wt_disagg_publish_automatically. We still need to figure out how exactly this should work; one easy option is to simply require that the stable_disaggregated_schema_epoch has to be set first.
- In __checkpoint_prepare, get the value of the stable schema epoch and save it to a new field checkpoint_disaggregated_schema_epoch in WT_TXN_GLOBAL, similarly to how the stable timestamp gets saved to checkpoint_timestamp. Make sure to do this while txn_global->rwlock is acquired as a write lock.
- Add a schema_epoch field to WT_DISAGG_METADATA_OP (defined in connection.h).
- Add a schema_epoch argument to __wt_disagg_enqueue_metadata_operation. Set it to 0 when creating and dropping tables. Set it to the checkpoint's schema epoch (checkpoint_disaggregated_schema_epoch) in __block_disagg_checkpoint_resolve, which is intended to write to the current checkpoint.
- Add a check to __wt_disagg_shared_metadata_queue_process to skip an operation if (1) the operation's schema_epoch is not yet stable, and (2) we are not publishing schema operations automatically. Make sure that the current defer mechanisms continues to work as it does currently; this probably means that the new check should go after the defer check.
- Add the new WT_SESSION::publish API. It should go through the queue of pending metadata updates (conn->disaggregated_storage.shared_metadata_qh), and for each update that matches the supplied table name extracted from the URI, (1) fail if the operation's schema_epoch is larger than the supplied new epoch, (2) modify the operation's schema_epoch if it is 0, and (3) ensure that the epoch is non-decreasing for the same table.
- At this point, we might want to restrict the publish API to be enabled just for the leaders.
Note that there could be more than one pending operation, such as due to us currently adding table creation operations in both __create_layered and __create_table. The function call should fail if it does not find any pending update. Also note that while this linear algorithm is not particularly efficient, as this is a purely in-memory operation, we can afford to fix it at a later stage if it becomes a problem.
It would be most likely easiest to implement this for both table create and drop in the same ticket. Implementing the support for table drop could be separated out to another ticket if need be, provided that implementing table creation alone would not result in interesting bugs due to schema operations reordering between create and drop. If the work needs to be separated, you can do that by considering the type of the metadata operation ((WT_SHARED_METADATA_CREATE) or update (WT_SHARED_METADATA_UPDATE), vs. WT_SHARED_METADATA_REMOVE).