-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Critical - P2
-
Affects Version/s: None
-
Component/s: Transactions
-
Storage Engines, Storage Engines - Transactions
-
SE Transactions - 2025-08-01
-
3
-
v8.2, v8.1, v8.0, v7.0, v6.0
/* * Release our snapshot in case it is keeping data pinned (this is particularly important for * checkpoints). Before releasing our snapshot, copy values into any positioned cursors so they * don't point to updates that could be freed once we don't have a snapshot. If this transaction * is prepared, then copying values would have been done during prepare. */ if (session->ncursors > 0 && !prepare) { WT_DIAGNOSTIC_YIELD; WT_ERR(__wt_session_copy_values(session)); } __wt_txn_release_snapshot(session);
In the txn commit code, we release the snapshot at the start of the function before marking the updates as committed. This can lead to failed repeated reads if the commit uses timestamp. Here's an example:
- Transaction A commits with timestamp 200.
- We release the snapshot and context switch.
- Another session starts a read transaction with read timestamp 100
- It reads the update that written by transaction A. The update still has timestamp 0 because the commit hasn't finished. The update is visible to the read transaction.
- Transaction A resumes commit and finishes marking the updates with timestamp 200.
- The read transaction reads the same update again. This time it is not visible because the update now has a timestamp 200 which is larger than its read timestamp.
We should only early release the snapshot if the transaction is not timestamped, such as the checkpoint transaction described in the comment. We should also ensure that we can no longer rollback the transaction after we release the snapshot. Otherwise, repeated reads may still fail even the transaction is not timestamped.
This can also lead to data corruption or server crash if the updates are evicted/checkpointed before they are marked as committed. Here's the scenario for data corruption.
- Transaction A has done a set of updates.
- We start to commit transaction A.
- We release transaction A's snapshot and context switch.
- Checkpoint writes some updates of the transaction. (If the update is evicted then the following rollback may crash because of freed memory.)
- Transaction A resumes commit. However, it hits some error and decides to rollback.
In this case, we may write some updates that should have been reverted to disk. This may explain some of the inconsistent indices we see in the field.
Update
We found out releasing the snapshot itself cannot make the transaction visible to other threads. We also need to remove the transaction from the global transaction tables to make it visible to other threads. Therefore, the situation described above is not a problem. The main issue here is that we may fail the commit after we have logged the transaction to write-ahead log (WAL) due to out of order timestamp failures. This means the transaction is aborted in the running database but committed in the WAL. If we crash afterwards and restart, we will wrongly recover the aborted transaction. In the context of MongoDB, what may happen is that we will attempt to recover a transaction that was aborted because of out of order timestamp. The oplog replay will certainly fail again because the same out of order timestamp failure, leading to a crash loop.
- is caused by
-
WT-6861 Add the ability to log messages about unexpected timestamp usage
-
- Closed
-
-
WT-8170 Reduce complexity of timestamp usage assertion API code
-
- Closed
-
- is related to
-
SERVER-16790 Lengthy pauses associated with checkpoints under WiredTiger
-
- Closed
-
-
WT-15094 Fix - Update the assert to check if the btree leaf is delta enabled.
-
- Closed
-
-
WT-15106 Make cache walk heuristic reset specific to disaggregated storage
-
- Closed
-
- related to
-
WT-14686 Disagg testing: add switch mode to test/format
-
- Closed
-
-
WT-15005 Revisit configuration options related to page deltas and make changes as necessary
-
- Closed
-
-
WT-15033 Optimize cursor->reset calls during search() and search_near() in DSC
-
- Closed
-