Details
Description
Sorry for the poor title, but we now have a repro case along with an understanding of how data files got into a specific state.
Consider MongoDB starting up on existing data files and setting the oldest timestamp = stable timestamp = WT recovery timestamp = 100
Consider a WT transaction that performs the following MongoDB writes with Recover to a Timestamp table logging rules:
- Writes to a collection table at time 110 (not WT journaled)
- Writes to the _id index table at time 110 (not WT journaled)
- Writes to the oplog at time 110 (journaled by WT)
MongoDB advances the stable timestamp to 120.
WT senses cache pressure and chooses a page where all writes are visible to evict. It chooses to evict the page for the collection table including the write at 110. This becomes a new block on disk that is otherwise unreferenced by the previous WT checkpoint. This writes a file_sync journal entry outside of any checkpoint log records.
The node crashes without ever taking a checkpoint. When the node restarts, it first runs WT recovery. WT creates a checkpoint that updates the collection table to a root node that now includes the write at 110. The index table does not include the corresponding write at 110. The WT system checkpoint time remains at 100.
This creates a problem when MongoDB runs replication/oplog recovery. The insert at 110 will now create a duplicate document in the collection table.
Shown below are the WT table metadata from before recovery is ran and after. It demonstrates a checkpoint was taken and, based on the size, new data is included. The new checkpoint_lsn matches the file_sync log record lsn.
access_pattern_hint=none,
|
allocation_size=4KB,
|
app_metadata=(formatVersion=1),
|
assert=(commit_timestamp=none,read_timestamp=none),
|
block_allocation=best,
|
block_compressor=snappy,
|
cache_resident=false,
|
checkpoint=(
|
WiredTigerCheckpoint.105=(
|
addr="01c00781e4ecd10c96c01781e48fc4f005c08481e4d61a7687808080e31a5fc0e3193fc0",
|
order=105,
|
time=1524851546,
|
size=1667072,
|
write_gen=226)),
|
checkpoint_lsn=(11,1252224),
|
checksum=on,
|
collator=,columns=,dictionary=0,encryption=(keyid=,name=),
|
format=btree,
|
huffman_key=,huffman_value=,
|
id=65,
|
ignore_in_memory_cache_size=false,
|
internal_item_max=0,
|
internal_key_max=0,
|
internal_key_truncate=true,
|
internal_page_max=4KB,
|
key_format=q,
|
key_gap=10,
|
leaf_item_max=0,
|
leaf_key_max=0,
|
leaf_page_max=32KB,
|
leaf_value_max=64MB,
|
log=(enabled=false),
|
memory_page_max=10m,
|
os_cache_dirty_max=0,
|
os_cache_max=0,
|
prefix_compression=false,
|
prefix_compression_min=4,
|
split_deepen_min_child=0,
|
split_deepen_per_child=0,
|
split_pct=90,
|
value_format=u,
|
version=(major=1,minor=1)
|
After:
access_pattern_hint=none,
|
allocation_size=4KB,
|
app_metadata=(formatVersion=1),
|
assert=(commit_timestamp=none,read_timestamp=none),
|
block_allocation=best,
|
block_compressor=snappy,
|
cache_resident=false,
|
checkpoint=(
|
WiredTigerCheckpoint.106=(
|
addr="01c0ba81e4fc2781bbc11a81e437359414c11b81e411671792808080e31ebfc0e31dffc0",
|
order=106,
|
time=1524851624,
|
size=1978368,
|
write_gen=244)),
|
checkpoint_lsn=(12,21533056),
|
checksum=on,
|
collator=,columns=,dictionary=0,encryption=(keyid=,name=),
|
format=btree,
|
huffman_key=,huffman_value=,
|
id=65,
|
ignore_in_memory_cache_size=false,
|
internal_item_max=0,
|
internal_key_max=0,
|
internal_key_truncate=true,
|
internal_page_max=4KB,
|
key_format=q,
|
key_gap=10,
|
leaf_item_max=0,
|
leaf_key_max=0,
|
leaf_page_max=32KB,
|
leaf_value_max=64MB,
|
log=(enabled=false),
|
memory_page_max=10m,
|
os_cache_dirty_max=0,
|
os_cache_max=0,
|
prefix_compression=false,
|
prefix_compression_min=4,
|
split_deepen_min_child=0,
|
split_deepen_per_child=0,
|
split_pct=90,
|
value_format=u,
|
version=(major=1,minor=1)
|
Attachments
Issue Links
- related to
-
WT-8695 Remove file_close_sync config and disallow single file checkpoint
-
- Closed
-