Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-4070

WiredTiger recovery can checkpoint data referencing pieces of a transaction

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 3.6.6, 4.0.0-rc0, WT3.1.0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Storage Non-NYC 2018-05-07

      Sorry for the poor title, but we now have a repro case along with an understanding of how data files got into a specific state.

      Consider MongoDB starting up on existing data files and setting the oldest timestamp = stable timestamp = WT recovery timestamp = 100

      Consider a WT transaction that performs the following MongoDB writes with Recover to a Timestamp table logging rules:

      1. Writes to a collection table at time 110 (not WT journaled)
      2. Writes to the _id index table at time 110 (not WT journaled)
      3. Writes to the oplog at time 110 (journaled by WT)

      MongoDB advances the stable timestamp to 120.

      WT senses cache pressure and chooses a page where all writes are visible to evict. It chooses to evict the page for the collection table including the write at 110. This becomes a new block on disk that is otherwise unreferenced by the previous WT checkpoint. This writes a file_sync journal entry outside of any checkpoint log records.

      The node crashes without ever taking a checkpoint. When the node restarts, it first runs WT recovery. WT creates a checkpoint that updates the collection table to a root node that now includes the write at 110. The index table does not include the corresponding write at 110. The WT system checkpoint time remains at 100.

      This creates a problem when MongoDB runs replication/oplog recovery. The insert at 110 will now create a duplicate document in the collection table.

      Shown below are the WT table metadata from before recovery is ran and after. It demonstrates a checkpoint was taken and, based on the size, new data is included. The new checkpoint_lsn matches the file_sync log record lsn.

      access_pattern_hint=none,
      allocation_size=4KB,
      app_metadata=(formatVersion=1),
      assert=(commit_timestamp=none,read_timestamp=none),
      block_allocation=best,
      block_compressor=snappy,
      cache_resident=false,
      checkpoint=(
          WiredTigerCheckpoint.105=(
      	addr="01c00781e4ecd10c96c01781e48fc4f005c08481e4d61a7687808080e31a5fc0e3193fc0",
      	order=105,
      	time=1524851546,
      	size=1667072,
      	write_gen=226)),
      checkpoint_lsn=(11,1252224),
      checksum=on,
      collator=,columns=,dictionary=0,encryption=(keyid=,name=),
      format=btree,
      huffman_key=,huffman_value=,
      id=65,
      ignore_in_memory_cache_size=false,
      internal_item_max=0,
      internal_key_max=0,
      internal_key_truncate=true,
      internal_page_max=4KB,
      key_format=q,
      key_gap=10,
      leaf_item_max=0,
      leaf_key_max=0,
      leaf_page_max=32KB,
      leaf_value_max=64MB,
      log=(enabled=false),
      memory_page_max=10m,
      os_cache_dirty_max=0,
      os_cache_max=0,
      prefix_compression=false,
      prefix_compression_min=4,
      split_deepen_min_child=0,
      split_deepen_per_child=0,
      split_pct=90,
      value_format=u,
      version=(major=1,minor=1)
      

      After:

      access_pattern_hint=none,
      allocation_size=4KB,
      app_metadata=(formatVersion=1),
      assert=(commit_timestamp=none,read_timestamp=none),
      block_allocation=best,
      block_compressor=snappy,
      cache_resident=false,
      checkpoint=(
          WiredTigerCheckpoint.106=(
      	addr="01c0ba81e4fc2781bbc11a81e437359414c11b81e411671792808080e31ebfc0e31dffc0",
      	order=106,
      	time=1524851624,
      	size=1978368,
      	write_gen=244)),
      checkpoint_lsn=(12,21533056),
      checksum=on,
      collator=,columns=,dictionary=0,encryption=(keyid=,name=),
      format=btree,
      huffman_key=,huffman_value=,
      id=65,
      ignore_in_memory_cache_size=false,
      internal_item_max=0,
      internal_key_max=0,
      internal_key_truncate=true,
      internal_page_max=4KB,
      key_format=q,
      key_gap=10,
      leaf_item_max=0,
      leaf_key_max=0,
      leaf_page_max=32KB,
      leaf_value_max=64MB,
      log=(enabled=false),
      memory_page_max=10m,
      os_cache_dirty_max=0,
      os_cache_max=0,
      prefix_compression=false,
      prefix_compression_min=4,
      split_deepen_min_child=0,
      split_deepen_per_child=0,
      split_pct=90,
      value_format=u,
      version=(major=1,minor=1)
      

            Assignee:
            sue.loverso@mongodb.com Susan LoVerso
            Reporter:
            daniel.gottlieb@mongodb.com Daniel Gottlieb (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: