Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Works as Designed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- repl-transactions

Assigned Teams:

Replication
Operating System:
ALL
Sprint:
Repl 2025-05-12, Repl 2025-05-26, Repl 2025-06-09, Repl 2025-06-23
Case:
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

While a prepared transaction is held open, the stable_timestamp is allowed to progress forward. As a result, at commit time, the commit_timestamp for a prepared transaction across all shards may be before the current stable_timestamp on a particular node. If a node takes a checkpoint at that stable_timestamp, it’s possible the transaction operations are reflected in the checkpoint, while the corresponding oplog entry and config.transactions update are not. Note that normally WT does not allow a commit_timestamp for a transaction to be less than the stable_timestamp. However, we make an exception for prepared transactions here.

When a prepared transaction is committed, the node will commit one storage transaction that includes all transaction operations and another storage transaction that commits the CommitTransaction oplog entry as well as the update to the config.transactions table. Notably, the commit_timestamp (the WT timestamp) for the first transaction of prepared operations is set to the maximum prepareTimestamp across all shards. The commit_timestamp for the subsequent transaction for the oplog entry and transactions table is set via WiredTiger. The first transaction’s commit_timestamp might include it in a checkpoint, while the second transaction will be excluded. This can happen because there is no guarantee that the oplog entry was successfully journaled when the checkpoint was taken. This results in an inconsistent state — the transaction operations are visible while its durable commit log entry does not exist.

During startup, replication will attempt to reconstruct prepared transactions. This involves looking at the config.transactions table to find open transactions in the prepared state, and applying the individual operations in each transaction. In the case above, the transaction metadata will indicate it’s in the prepared state, as the config.transactions write was not made durable. However, the data already reflects the outcome of the transaction, causing the repeated operation application to hit WriteConflicts. During startup, this will cause a repeated crash loop, degrading cluster availability.

We have verified that the issue described in this ticket cannot occur because, during a rollback to stable, WiredTiger removes any updates where the durable timestamp is greater than the stable timestamp.

Assignee:: Moustafa Maher
Reporter:: Ali Mir
Participants:: Ali Mir, Moustafa Maher
Votes:: 2 Vote for this issue
Watchers:: 32 Start watching this issue

Created:: Apr 11 2025 01:53:38 PM UTC
Updated:: Aug 07 2025 07:46:58 PM UTC
Resolved:: Jun 10 2025 08:04:24 PM UTC

Details

Description

Attachments

Activity

People

Dates