-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
Fully Compatible
-
Repl 2018-02-26
KVStorageEngine implementations have their catalog persisted as "yet another" record store named the `_mdb_catalog`. For storage engines that support `recoverToStableTimestamp`, this table is not journaled, meaning it's only persisted when a stable checkpoint is taken, or from create collection oplog entries being replayed on replication recovery at startup.
Replication, naturally, does not replicate its internal collections which can lead to the following sequence:
- Exit initial sync at time T. T is also the stable timestamp.
- Node becomes a secondary.
- Create the `oplogTruncateAfterPoint` collection.
- Begin processing a patch, performing a write to the `oplogTruncateAfterPoint`.
- The node crashes. The `oplogTruncateAfterPoint` document is required to correctly recover.
- Node restarts.
- MongoDB sees a storage engine table without a corresponding MongoDB collection, the table gets removed.
- Replication recovery plays. Assumes there was no `oplogTruncateAfterPoint`, resulting in data corruption.
Explicitly creating `oplogTruncateAfterPoint` before coming out of initial sync is sufficient to guarantee that if a node starts up and decides it has completed initial sync, then the `oplogTruncateAfterPoint` collection will exist.