Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7995

Fix the global visibility so that it cannot go beyond checkpoint visibility

    • Storage Engines
    • 3
    • Storage - Ra 2021-09-06
    • v5.0, v4.4

      Issue Status as of Jan 13, 2021

      ISSUE DESCRIPTION AND AFFECTED VERSIONS
      This issue in MongoDB 4.4.2-4.4.8 and 5.0.0-5.0.2 causes a checkpoint thread to read and persist an inconsistent version of data to disk. Data in memory remains correct unless the server crashes or experiences an unclean shutdown. Then, the inconsistent checkpoint is used for recovery and introduces corruption.

      The bug is triggered on cache pages that receive multiple writes during a running checkpoint and which are evicted twice or more during the checkpoint. These events must occur within a window of vulnerability that varies by version:

      • In 4.4, this requires that a checkpoint takes longer than 5 seconds.
      • In 5.0, this requires that a checkpoint take longer than 5 minutes (by default), making impact on 5.0 extremely unlikely unless a shorter minSnapshotHistoryWindowInSeconds has been configured.

      DIAGNOSIS AND IMPACT
      The bug can cause a Duplicate Key error on startup and prevent the node from starting.

      The validate command reveals the impact by reporting on the inconsistencies created between documents and indexes, in the form of:

      • extra index entries (including duplicate entries in unique indexes)
      • missing index entries

      After an unclean shutdown, inconsistent writes can lead to the inability to restart an impacted node due to a Duplicate Key error during startup. However, nodes can also start successfully and still be impacted.

      If a node starts successfully, it may still have been impacted by:

      • Data inconsistency within documents - specific field values may not correctly reflect writes that were acknowledged to the application prior to the unclean shutdown time. And, documents may still exist which should have been deleted.
      • Incomplete query results - lost or inaccurate index entries may cause incomplete query results for queries that use impacted indexes.
      • Missing documents - documents may be lost on impacted nodes.

      REMEDIATION AND WORKAROUNDS

      This issue is fixed in MongoDB 4.4.9+ or 5.0.3+.

      Important: If you are on MongoDB 4.4.3, or 4.4.4, do not perform direct upgrades to MongoDB 4.4.8-4.4.10 or 5.0.2-5.0.5, as this upgrade path is vulnerable to another critical issue, WT-8395. Instead, upgrade directly to 4.4.11+ or 5.0.6+.

      Once you upgrade to a fixed version to prevent further exposure to this issue, run the validate command on each collection on each node of your replica set.

      If validate reports any failures, resync the impacted node from an unaffected node. If an unaffected node cannot be readily identified these scripts can assist the remediation of this bug.

      Original description

      Due to the difference in global visibility between when the checkpoint visited the btree and before it finishes the history store leads to wrong data to be written to the disk when the oldest timestamp moves ahead of the checkpoint timestamp.

      Consider a following scenario:
      1. Oldest timestamp is 10 and the stable timestamp is 10.
      2. Page A has a key (1000) from timestamp 20.
      3. Checkpoint is started at stable timestamp 10
      4. Checkpoint has finished on page A and wrote the keys to disk with timestamp 20.
      5. Later page A is modified again for another key (2000) at timestamp 30
      6. The oldest and stable timestamps are moved to 30
      7. Later eviction triggered on page A and wrote again the new image to disk and the key(1000) at timestamp 20 are rewritten to the disk with no timestamp because 20 is less than 30.
      8. Update the key (1000) again with another update with timestamp 50.
      9. Eviction triggered on this page again, writes the update at 50 to the data store and write the update at timestamp 20 is history store. Note that we cleared the timestamp due to global visibility.

      For example:
      The checkpoint stable timestamp is 939. But the same update is written to the history store with start timestamp as zero due to the above described problem.

              K {18828}
              value: len 53, start: (0, 940)/(0, 940)/0 stop: (0, 0)/(4294967295, 4294967295)/18446744073709551605
              V {0000000000000000000000000000000000000000000000046732}
              hs-update: start: (0, 0)/(0, 0)/0 stop: (0, 976)/(0, 976)/0
              V {0000000000000000000000000000000000000000000000046732}
      

      On these checkpoint data files, if the RTS occurs, it restores the key that it shouldn't.

            Assignee:
            haribabu.kommi@mongodb.com Haribabu Kommi
            Reporter:
            haribabu.kommi@mongodb.com Haribabu Kommi
            Votes:
            0 Vote for this issue
            Watchers:
            31 Start watching this issue

              Created:
              Updated:
              Resolved: