Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-66110

Downgrading FCV can cause the active txnNumber on TransactionParticipant to change between session yielding and unyielding

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 6.0.0-rc5, 6.1.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Fully Compatible
    • ALL
    • v6.0
    • Sharding NYC 2022-05-16

      Consider an external session with latest txnNumber=5 where the txnNumber corresponds to a retryable write that was executed using an internal transaction.

      1. client0 starts downgrading the FCV to 5.0.
      2. Right after the setFCV thread finishes aborting unprepared transactions and waiting for prepared transactions to complete, client1 starts a transaction with txnNumber=6 and runs a write statement inside it. That statement is executed via the transaction API. To hand off the transaction to opCtx created by the API, the original opCtx for the transaction yields the TransactionParticipant. The transaction API checks out the session, executes the write statement and checks the session back in. It doesn’t commit the transaction since the transaction is owned by the external client.
      3. The setFCV thread sets the “txnNum” for the config.transactions entry for the external session to 5 and its “lastWriteTime” to {t: 1, ts: Timestamp(1, 0)}. The direct write to the config.transactions causes the TransactionParticipant for the external session to be invalidated.
      4. The original opCtx for the transaction unyields the TransactionParticipant. Upon checking out the session, it refreshes the TransactionParticipant from disk. After refresh, the active txnNumber on TransactionParticipant is 5 instead of 6. As described in SERVER-66000, setFCV doesn’t set or unset the “state” field of the config.transactions entry for the external session so there are two cases here:
        • [A] If the txnNumber before txnNumber=5 corresponds to a transaction, the refreshed TransactionParticipant would have state “committed” so unstashing would fail with TransactionCommitted here.
        • [B] If the txnNumber before txnNumber=5 corresponds to a retryable write, the refreshed TransactionParticipant would have state “none”. Additionally, unstashing would return early here so it does not fail.

      [A] is problematic since TransactionCommitted is not among the errors handled by drivers so the error would be returned to the external client and it is misleading since the transaction has not been committed yet.

      [B] shouldn’t cause any issues since when the client sends a command containing additional statements (no startTransaction) or a commit/abortTransaction command, that command would fail with NoSuchTransaction which is a transient transaction error so the drivers would retry the transaction with a higher txnNumber.

      Similar issues also exist for internal transactions for retryable writes. However, the steps are as follows:

      1. client0 runs a retryable write statement with txnNumber=6. To hand off the transaction to opCtx created by the API, the original opCtx for the retryable write yields the TransactionParticipant. The transaction API checks out an internal session for the retryable write, executes the write statement in a transaction in that session, commits the transaction and checks the session back in.
      2. client1 runs setFCV: 5.0 to completion.
      3. The opCtx for the retryable write unyields the TransactionCommitted and that either fails with TransactionCommitted or not fail for the same reason (the active txnNumber has changed).

            Assignee:
            cheahuychou.mao@mongodb.com Cheahuychou Mao
            Reporter:
            cheahuychou.mao@mongodb.com Cheahuychou Mao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: