Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7995

Fix the global visibility that it cannot go beyond checkpoint visibility

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: WT10.0.1, 5.1.0, 4.4.9, 5.0.3
    • Component/s: None
    • Labels:
    • Case:
    • Story Points:
      3
    • Sprint:
      Storage - Ra 2021-09-06
    • Backport Requested:
      v5.0, v4.4

      Description

      Issue Status as of Sept 22, 2021

      ISSUE DESCRIPTION AND AFFECTED VERSIONS
      This issue in MongoDB 4.4.2-4.4.8 and 5.0.0-5.0.2 causes a checkpoint thread to read and persist an inconsistent version of data to disk. Data in memory remains correct unless the server crashes or experiences an unclean shutdown. Then, the inconsistent checkpoint is used for recovery and introduces corruption.

      The bug is triggered on cache pages that receive multiple writes during a running checkpoint and which are evicted twice or more during the checkpoint. These events must occur within a window of vulnerability that varies by version:

      • In 4.4, this requires that a checkpoint takes longer than 5 seconds.
      • In 5.0, this requires that a checkpoint take longer than 5 minutes (by default), making impact on 5.0 extremely unlikely unless a shorter minSnapshotHistoryWindowInSeconds has been configured.

      DIAGNOSIS AND IMPACT
      The bug can cause a Duplicate Key error on startup and prevent the node from starting.

      The validate command reveals the impact by reporting on the inconsistencies created between documents and indexes, in the form of:

      • extra index entries (including duplicate entries in unique indexes)
      • missing index entries

      After an unclean shutdown, inconsistent writes can lead to the inability to restart an impacted node due to a Duplicate Key error during startup. However, nodes can also start successfully and still be impacted.

      If a node starts successfully, it may still have been impacted by:

      • Data inconsistency within documents - specific field values may not correctly reflect writes that were acknowledged to the application prior to the unclean shutdown time. And, documents may still exist which should have been deleted.
      • Incomplete query results - lost or inaccurate index entries may cause incomplete query results for queries that use impacted indexes.
      • Missing documents - documents may be lost on impacted nodes.

      REMEDIATION AND WORKAROUNDS
      First, upgrade to a fixed version (MongoDB 4.4.9 or 5.0.3). Impact can be remediated on earlier versions but could re-occur.

      Then, run the validate command on each collection on each node of your replica set.

      If validate reports any failures, resync the impacted node from an unaffected node.

      Original description

      Due to the difference in global visibility between when the checkpoint visited the btree and before it finishes the history store leads to wrong data to be written to the disk when the oldest timestamp moves ahead of the checkpoint timestamp.

      Consider a following scenario:
      1. Oldest timestamp is 10 and the stable timestamp is 10.
      2. Page A has a key (1000) from timestamp 20.
      3. Checkpoint is started at stable timestamp 10
      4. Checkpoint has finished on page A and wrote the keys to disk with timestamp 20.
      5. Later page A is modified again for another key (2000) at timestamp 30
      6. The oldest and stable timestamps are moved to 30
      7. Later eviction triggered on page A and wrote again the new image to disk and the key(1000) at timestamp 20 are rewritten to the disk with no timestamp because 20 is less than 30.
      8. Update the key (1000) again with another update with timestamp 50.
      9. Eviction triggered on this page again, writes the update at 50 to the data store and write the update at timestamp 20 is history store. Note that we cleared the timestamp due to global visibility.

      For example:
      The checkpoint stable timestamp is 939. But the same update is written to the history store with start timestamp as zero due to the above described problem.

              K {18828}
              value: len 53, start: (0, 940)/(0, 940)/0 stop: (0, 0)/(4294967295, 4294967295)/18446744073709551605
              V {0000000000000000000000000000000000000000000000046732}
              hs-update: start: (0, 0)/(0, 0)/0 stop: (0, 976)/(0, 976)/0
              V {0000000000000000000000000000000000000000000000046732}
      

      On these checkpoint data files, if the RTS occurs, it restores the key that it shouldn't.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              haribabu.kommi Haribabu Kommi
              Reporter:
              haribabu.kommi Haribabu Kommi
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: