Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Critical - P2
Fix Version/s: WT12.0.0, 8.3.0-rc0, 8.2.2, 8.0.18
Affects Version/s: None
Component/s: Transactions
Labels:
- dc

Assigned Teams:

Storage Engines, Storage Engines - Transactions
Sprint:
SE Transactions - 2025-08-01
Story Points:
3

Backport Requested:

v8.2, v8.1, v8.0, v7.0, v6.0

    /*
     * Release our snapshot in case it is keeping data pinned (this is particularly important for
     * checkpoints). Before releasing our snapshot, copy values into any positioned cursors so they
     * don't point to updates that could be freed once we don't have a snapshot. If this transaction
     * is prepared, then copying values would have been done during prepare.
     */
    if (session->ncursors > 0 && !prepare) {
        WT_DIAGNOSTIC_YIELD;
        WT_ERR(__wt_session_copy_values(session));
    }
    __wt_txn_release_snapshot(session);

In the txn commit code, we release the snapshot at the start of the function before marking the updates as committed. This can lead to failed repeated reads if the commit uses timestamp. Here's an example:

Transaction A commits with timestamp 200.
We release the snapshot and context switch.
Another session starts a read transaction with read timestamp 100
It reads the update that written by transaction A. The update still has timestamp 0 because the commit hasn't finished. The update is visible to the read transaction.
Transaction A resumes commit and finishes marking the updates with timestamp 200.
The read transaction reads the same update again. This time it is not visible because the update now has a timestamp 200 which is larger than its read timestamp.

We should only early release the snapshot if the transaction is not timestamped, such as the checkpoint transaction described in the comment. We should also ensure that we can no longer rollback the transaction after we release the snapshot. Otherwise, repeated reads may still fail even the transaction is not timestamped.

This can also lead to data corruption or server crash if the updates are evicted/checkpointed before they are marked as committed. Here's the scenario for data corruption.

Transaction A has done a set of updates.
We start to commit transaction A.
We release transaction A's snapshot and context switch.

Checkpoint writes some updates of the transaction. (If the update is evicted then the following rollback may crash because of freed memory.)
Transaction A resumes commit. However, it hits some error and decides to rollback.

In this case, we may write some updates that should have been reverted to disk. This may explain some of the inconsistent indices we see in the field.

Update
We found out releasing the snapshot itself cannot make the transaction visible to other threads. We also need to remove the transaction from the global transaction tables to make it visible to other threads. Therefore, the situation described above is not a problem. The main issue here is that we may fail the commit after we have logged the transaction to write-ahead log (WAL) due to out of order timestamp failures. This means the transaction is aborted in the running database but committed in the WAL. If we crash afterwards and restart, we will wrongly recover the aborted transaction. In the context of MongoDB, what may happen is that we will attempt to recover a transaction that was aborted because of out of order timestamp. The oplog replay will certainly fail again because the same out of order timestamp failure, leading to a crash loop.

is caused by

WT-6861 Add the ability to log messages about unexpected timestamp usage

Closed

WT-8170 Reduce complexity of timestamp usage assertion API code

Closed

is related to

WT-15608 Aggregated timestamp validation can fail with a 0 timestamped page deleted structure

Closed

SERVER-16790 Lengthy pauses associated with checkpoints under WiredTiger

Closed

WT-15094 Fix - Update the assert to check if the btree leaf is delta enabled.

Closed

WT-15106 Make cache walk heuristic reset specific to disaggregated storage

Closed

WT-15803 Fix ref not unlocked in error cases

Closed

WT-15210 Change eviction to scrub eviction when the cache usage is less than eviction target

Closed

WT-15563 Investigate making cache tolerant to change app step-wise eviction to incremental eviction

Closed

WT-15548 Disable checkpoint_cleanup config in test/format if all related options are off

Closed

WT-15455 Don't skip prepared update pages during cursor walk

Closed

related to

WT-14392 wt utility overwrites passed config when given -m option

Closed

WT-14750 Failure in __wt_page_inmem: "encountered an illegal file format or internal value: 0x0"

Closed

WT-14686 Disagg testing: add switch mode to test/format

Closed

WT-15005 Revisit configuration options related to page deltas and make changes as necessary

Closed

WT-15033 Optimize cursor->reset calls during search() and search_near() in DSC

Closed

WT-14837 Add metric to measure execution time of block_first_srch()

Closed

WT-12974 Delete unneeded loop in __rec_hs_wrapup()

Closed

WT-14034 Fix resolving the prepared key multiple times because of reserved updates

Closed

WT-14267 test/format assert detects evicting an accessible internal page with an active split generation

Closed

(6 is related to, 9 related to)

Assignee:: Chenhao Qu
Reporter:: Chenhao Qu
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Jul 26 2025 08:11:09 AM UTC
Updated:: Dec 18 2025 04:48:57 AM UTC
Resolved:: Jul 30 2025 02:11:40 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates