-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Storage Execution
-
Fully Compatible
-
ALL
-
Storage Execution 2025-09-29
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
With the following sequence of events a table can exist on a secondary prior to applying the oplog entry which should have created that table:
- Some ordinary write happens with timestamp N.
- Primary begins to create a collection or index and creates the table, performing an untimestamped write to WT's metadata
- Primary begins taking a checkpoint, which performs a snapshot isolated read on timestamp N. Because writes to the metadata are untimestamped, this snapshot read includes the table which was just created.
- Primary writes the collection/index to the catalog with timestamp N+1.
- Secondary applies oplog up to timestamp N. Secondary does not have the table.
- Secondary installs checkpoint for timestamp N. Secondary now has the table.
- Secondary attempts to apply the N+1 createCollection oplog entry and discovers that the table ident it's trying to create already exists.
Currently we assume that the only way to a table to exist prior to the oplog instruction which created it being applied is due to rollback or a bug, but advancing to a checkpoint also can result in this since we don't reload the catalog and clean up unexpected idents. We need to update the collection and index creation process to allow for this.
Note that inverting the order of creating the table and writing to the collection would merely replace this bug with the inverse case where a checkpoint is missing a table it should contain. It may be possible to avoid this scenario entirely by introducing some sort of locking to prevent starting a checkpoint in between the two operations involved in creating collections and indexes (but we can't block collection creation for the full duration of a checkpoint).