Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: WT10.0.1, 4.4.9, 5.0.3, 5.1.0-rc0
Affects Version/s: None
Component/s: None
Labels:
- dc
- dup-key

Assigned Teams:

Storage Engines
Sprint:
Storage - Ra 2021-09-06
Story Points:
3
Case:

Backport Requested:

v5.0, v4.4

Issue Status as of Jan 13, 2021

ISSUE DESCRIPTION AND AFFECTED VERSIONS
This issue in MongoDB 4.4.2-4.4.8 and 5.0.0-5.0.2 causes a checkpoint thread to read and persist an inconsistent version of data to disk. Data in memory remains correct unless the server crashes or experiences an unclean shutdown. Then, the inconsistent checkpoint is used for recovery and introduces corruption.

The bug is triggered on cache pages that receive multiple writes during a running checkpoint and which are evicted twice or more during the checkpoint. These events must occur within a window of vulnerability that varies by version:

In 4.4, this requires that a checkpoint takes longer than 5 seconds.
In 5.0, this requires that a checkpoint take longer than 5 minutes (by default), making impact on 5.0 extremely unlikely unless a shorter minSnapshotHistoryWindowInSeconds has been configured.

DIAGNOSIS AND IMPACT
The bug can cause a Duplicate Key error on startup and prevent the node from starting.

The validate command reveals the impact by reporting on the inconsistencies created between documents and indexes, in the form of:

extra index entries (including duplicate entries in unique indexes)
missing index entries

After an unclean shutdown, inconsistent writes can lead to the inability to restart an impacted node due to a Duplicate Key error during startup. However, nodes can also start successfully and still be impacted.

If a node starts successfully, it may still have been impacted by:

Data inconsistency within documents - specific field values may not correctly reflect writes that were acknowledged to the application prior to the unclean shutdown time. And, documents may still exist which should have been deleted.
Incomplete query results - lost or inaccurate index entries may cause incomplete query results for queries that use impacted indexes.
Missing documents - documents may be lost on impacted nodes.

REMEDIATION AND WORKAROUNDS

This issue is fixed in MongoDB 4.4.9+ or 5.0.3+.

Important: If you are on MongoDB 4.4.3, or 4.4.4, do not perform direct upgrades to MongoDB 4.4.8-4.4.10 or 5.0.2-5.0.5, as this upgrade path is vulnerable to another critical issue, ~~WT-8395~~. Instead, upgrade directly to 4.4.11+ or 5.0.6+.

Once you upgrade to a fixed version to prevent further exposure to this issue, run the validate command on each collection on each node of your replica set.

If validate reports any failures, resync the impacted node from an unaffected node. If an unaffected node cannot be readily identified these scripts can assist the remediation of this bug.

Original description

Due to the difference in global visibility between when the checkpoint visited the btree and before it finishes the history store leads to wrong data to be written to the disk when the oldest timestamp moves ahead of the checkpoint timestamp.

Consider a following scenario:
1. Oldest timestamp is 10 and the stable timestamp is 10.
2. Page A has a key (1000) from timestamp 20.
3. Checkpoint is started at stable timestamp 10
4. Checkpoint has finished on page A and wrote the keys to disk with timestamp 20.
5. Later page A is modified again for another key (2000) at timestamp 30
6. The oldest and stable timestamps are moved to 30
7. Later eviction triggered on page A and wrote again the new image to disk and the key(1000) at timestamp 20 are rewritten to the disk with no timestamp because 20 is less than 30.
8. Update the key (1000) again with another update with timestamp 50.
9. Eviction triggered on this page again, writes the update at 50 to the data store and write the update at timestamp 20 is history store. Note that we cleared the timestamp due to global visibility.

For example:
The checkpoint stable timestamp is 939. But the same update is written to the history store with start timestamp as zero due to the above described problem.

        K {18828}
        value: len 53, start: (0, 940)/(0, 940)/0 stop: (0, 0)/(4294967295, 4294967295)/18446744073709551605
        V {0000000000000000000000000000000000000000000000046732}
        hs-update: start: (0, 0)/(0, 0)/0 stop: (0, 976)/(0, 976)/0
        V {0000000000000000000000000000000000000000000000046732}

On these checkpoint data files, if the RTS occurs, it restores the key that it shouldn't.

is caused by

WT-6490 Acquire snapshot for eviction threads

Closed

is duplicated by

SERVER-60371 Fatal assertion - msgid 34437 - DuplicateKey

Closed

is related to

WT-8395 Inconsistent data after upgrade from 4.4.3 and 4.4.4 to 4.4.8+ and 5.0.2+

Closed

SERVER-59403 upgrade from 4.4.4 to 4.4.8,secondary crash indicates _id duplicate key

Closed

SERVER-60371 Fatal assertion - msgid 34437 - DuplicateKey

Closed

related to

WT-7958 Include recovery in test/checkpoint

Closed

(1 related to)

Assignee:: Haribabu Kommi
Reporter:: Haribabu Kommi
Votes:: 0 Vote for this issue
Watchers:: 31 Start watching this issue

Created:: Aug 24 2021 02:57:06 AM UTC
Updated:: Jun 24 2025 05:07:56 AM UTC
Resolved:: Aug 25 2021 01:34:25 AM UTC

Details

Description

Original description

Attachments

Issue Links

Forms

Activity

People

Dates