Commit may be rolled back after we have logged the transaction

XMLWordPrintableJSON

    • Storage Engines, Storage Engines - Transactions
    • SE Transactions - 2025-08-01
    • 3
    • v8.2, v8.1, v8.0, v7.0, v6.0

          /*
           * Release our snapshot in case it is keeping data pinned (this is particularly important for
           * checkpoints). Before releasing our snapshot, copy values into any positioned cursors so they
           * don't point to updates that could be freed once we don't have a snapshot. If this transaction
           * is prepared, then copying values would have been done during prepare.
           */
          if (session->ncursors > 0 && !prepare) {
              WT_DIAGNOSTIC_YIELD;
              WT_ERR(__wt_session_copy_values(session));
          }
          __wt_txn_release_snapshot(session);
      

      In the txn commit code, we release the snapshot at the start of the function before marking the updates as committed. This can lead to failed repeated reads if the commit uses timestamp. Here's an example:

      • Transaction A commits with timestamp 200.
      • We release the snapshot and context switch.
      • Another session starts a read transaction with read timestamp 100
      • It reads the update that written by transaction A. The update still has timestamp 0 because the commit hasn't finished. The update is visible to the read transaction.
      • Transaction A resumes commit and finishes marking the updates with timestamp 200.
      • The read transaction reads the same update again. This time it is not visible because the update now has a timestamp 200 which is larger than its read timestamp.

      We should only early release the snapshot if the transaction is not timestamped, such as the checkpoint transaction described in the comment. We should also ensure that we can no longer rollback the transaction after we release the snapshot. Otherwise, repeated reads may still fail even the transaction is not timestamped.

      This can also lead to data corruption or server crash if the updates are evicted/checkpointed before they are marked as committed. Here's the scenario for data corruption.

      • Transaction A has done a set of updates.
      • We start to commit transaction A.
      • We release transaction A's snapshot and context switch.
      • Checkpoint writes some updates of the transaction. (If the update is evicted then the following rollback may crash because of freed memory.)
      • Transaction A resumes commit. However, it hits some error and decides to rollback.

      In this case, we may write some updates that should have been reverted to disk. This may explain some of the inconsistent indices we see in the field.

      Update
      We found out releasing the snapshot itself cannot make the transaction visible to other threads. We also need to remove the transaction from the global transaction tables to make it visible to other threads. Therefore, the situation described above is not a problem. The main issue here is that we may fail the commit after we have logged the transaction to write-ahead log (WAL) due to out of order timestamp failures. This means the transaction is aborted in the running database but committed in the WAL. If we crash afterwards and restart, we will wrongly recover the aborted transaction. In the context of MongoDB, what may happen is that we will attempt to recover a transaction that was aborted because of out of order timestamp. The oplog replay will certainly fail again because the same out of order timestamp failure, leading to a crash loop.

              Assignee:
              Chenhao Qu
              Reporter:
              Chenhao Qu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: