Sorry for the poor title, but we now have a repro case along with an understanding of how data files got into a specific state.
Consider MongoDB starting up on existing data files and setting the oldest timestamp = stable timestamp = WT recovery timestamp = 100
Consider a WT transaction that performs the following MongoDB writes with Recover to a Timestamp table logging rules:
- Writes to a collection table at time 110 (not WT journaled)
- Writes to the _id index table at time 110 (not WT journaled)
- Writes to the oplog at time 110 (journaled by WT)
MongoDB advances the stable timestamp to 120.
WT senses cache pressure and chooses a page where all writes are visible to evict. It chooses to evict the page for the collection table including the write at 110. This becomes a new block on disk that is otherwise unreferenced by the previous WT checkpoint. This writes a file_sync journal entry outside of any checkpoint log records.
The node crashes without ever taking a checkpoint. When the node restarts, it first runs WT recovery. WT creates a checkpoint that updates the collection table to a root node that now includes the write at 110. The index table does not include the corresponding write at 110. The WT system checkpoint time remains at 100.
This creates a problem when MongoDB runs replication/oplog recovery. The insert at 110 will now create a duplicate document in the collection table.
Shown below are the WT table metadata from before recovery is ran and after. It demonstrates a checkpoint was taken and, based on the size, new data is included. The new checkpoint_lsn matches the file_sync log record lsn.
access_pattern_hint=none, allocation_size=4KB, app_metadata=(formatVersion=1), assert=(commit_timestamp=none,read_timestamp=none), block_allocation=best, block_compressor=snappy, cache_resident=false, checkpoint=( WiredTigerCheckpoint.105=( addr="01c00781e4ecd10c96c01781e48fc4f005c08481e4d61a7687808080e31a5fc0e3193fc0", order=105, time=1524851546, size=1667072, write_gen=226)), checkpoint_lsn=(11,1252224), checksum=on, collator=,columns=,dictionary=0,encryption=(keyid=,name=), format=btree, huffman_key=,huffman_value=, id=65, ignore_in_memory_cache_size=false, internal_item_max=0, internal_key_max=0, internal_key_truncate=true, internal_page_max=4KB, key_format=q, key_gap=10, leaf_item_max=0, leaf_key_max=0, leaf_page_max=32KB, leaf_value_max=64MB, log=(enabled=false), memory_page_max=10m, os_cache_dirty_max=0, os_cache_max=0, prefix_compression=false, prefix_compression_min=4, split_deepen_min_child=0, split_deepen_per_child=0, split_pct=90, value_format=u, version=(major=1,minor=1)
After:
access_pattern_hint=none, allocation_size=4KB, app_metadata=(formatVersion=1), assert=(commit_timestamp=none,read_timestamp=none), block_allocation=best, block_compressor=snappy, cache_resident=false, checkpoint=( WiredTigerCheckpoint.106=( addr="01c0ba81e4fc2781bbc11a81e437359414c11b81e411671792808080e31ebfc0e31dffc0", order=106, time=1524851624, size=1978368, write_gen=244)), checkpoint_lsn=(12,21533056), checksum=on, collator=,columns=,dictionary=0,encryption=(keyid=,name=), format=btree, huffman_key=,huffman_value=, id=65, ignore_in_memory_cache_size=false, internal_item_max=0, internal_key_max=0, internal_key_truncate=true, internal_page_max=4KB, key_format=q, key_gap=10, leaf_item_max=0, leaf_key_max=0, leaf_page_max=32KB, leaf_value_max=64MB, log=(enabled=false), memory_page_max=10m, os_cache_dirty_max=0, os_cache_max=0, prefix_compression=false, prefix_compression_min=4, split_deepen_min_child=0, split_deepen_per_child=0, split_pct=90, value_format=u, version=(major=1,minor=1)
- related to
-
WT-8695 Remove file_close_sync config and disallow single file checkpoint
- Closed