The replication rollback project is running into a problem when replication's recovery rolls forward from a point earlier than the durable timestamp (i.e., the timestamp used by the last checkpoint).
For example, rolling forward a dropDatabase gets confused by collections created after the dropDatabase executed but before the final checkpoint completed.
MongoDB is currently tracking the effective checkpoint timestamp by writing a document with the stable_timestamp before executing WT_SESSION::checkpoint. There are several problems with this:
- there are situations (such as shutdown) where writing a document is difficult (e.g., because the Global lock is held exclusive); and
- there is a race between writing the document and the checkpoint choosing which stable_timestamp to use. We deliberately don't want to avoid moving stable_timestamp forward for the whole checkpoint operation, if it would otherwise be possible. That would defeat some I/O optimizations in WiredTiger (aka "scrubbing") where dirty data is flushed from cache multi-threaded before the critical section of a checkpoint starts.
If instead, WiredTiger stored the stable_timestamp chosen by checkpoint as part of the metadata (specifically in metadata for WiredTiger.wt, aka the turtle file), then allowed a call to WT_CONNECTION::query_timestamp with "get=durable_timestamp" to query it after a restart, then MongoDB could skip all of this work and there would be no possibility of races. Replication could use this timestamp to roll forward the oplog from exactly the point corresponding to the checkpoint.