Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-103717

Committed cross-shard transactions can be re-prepared during replication startup leading to WriteConflicts

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Replication
    • ALL
    • Repl 2025-05-12
    • None
    • 0
    • None
    • None
    • None
    • None
    • None
    • None

      While a prepared transaction is held open, the stable_timestamp is allowed to progress forward. As a result, at commit time, the commit_timestamp for a prepared transaction across all shards may be before the current stable_timestamp on a particular node. If a node takes a checkpoint at that stable_timestamp, it’s possible the transaction operations are reflected in the checkpoint, while the corresponding oplog entry and config.transactions update are not. Note that normally WT does not allow a commit_timestamp for a transaction to be less than the stable_timestamp. However, we make an exception for prepared transactions here.

      When a prepared transaction is committed, the node will commit one storage transaction that includes all transaction operations and another storage transaction that commits the CommitTransaction oplog entry as well as the update to the config.transactions table. Notably, the commit_timestamp (the WT timestamp) for the first transaction of prepared operations is set to the maximum prepareTimestamp across all shards. The commit_timestamp for the subsequent transaction for the oplog entry and transactions table is set via WiredTiger. The first transaction’s commit_timestamp might include it in a checkpoint, while the second transaction will be excluded. This can happen because there is no guarantee that the oplog entry was successfully journaled when the checkpoint was taken. This results in an inconsistent state — the transaction operations are visible while its durable commit log entry does not exist.

      During startup, replication will attempt to reconstruct prepared transactions. This involves looking at the config.transactions table to find open transactions in the prepared state, and applying the individual operations in each transaction. In the case above, the transaction metadata will indicate it’s in the prepared state, as the config.transactions write was not made durable. However, the data already reflects the outcome of the transaction, causing the repeated operation application to hit WriteConflicts. During startup, this will cause a repeated crash loop, degrading cluster availability.

            Assignee:
            m.maher@mongodb.com Moustafa Maher
            Reporter:
            ali.mir@mongodb.com Ali Mir
            Votes:
            2 Vote for this issue
            Watchers:
            29 Start watching this issue

              Created:
              Updated: