Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-66993

TransactionParticipants can be refreshed incorrectly after failover/restart

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 6.0.0-rc9, 6.1.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
    • Fully Compatible
    • ALL
    • v6.0
    • Sharding NYC 2022-06-13

      To reduce the number of the matching session ids in the covered query read on the "parentLsid" during a TransactionParticipant refresh(see the comment below for details on how TransactionParticipant refreshes work), we look up the highest txnNumber known in memory and use it in find filter. This txnNumber is currently computed as the max of the lsid.txnNumber for the sessions that have been checked out from the SessionCatalog.  Here is how this optimization can cause TransactionParticipants to get refreshed incorrectly after failover or restart.

      1. mongos0 runs an internal transaction with session id {id, uid, 5, txnUUID} until it has entered prepare.
      2. The replica set shuts down. Upon restart, each node reconstructs the prepared internal transaction from step 1. This only sets up the TransactionParticipant for the internal transaction session (via MongoDOperationContextSessionWithoutRefresh) and so the TransactionParticipant for the parent session is not marked valid. 
      3. mongos1 starts an internal transaction with session id {id, uid, 6, txnUUID}. Checking out the session causes TransactionParticipant to get refreshed. However, but from scanning the SessionCatalog, it thinks that the highest txnNumber is 6 and so it only looks for session ids with lsid.txnNumber >= 6 and finds no matching session ids so it concludes that the txnNumber to recover is 6. So here, it finds no other TransactionParticipants to refresh (i.e. the only TransactionParticipant in RetryableWriteTransactionParticipantCatalog is the one for {id, uid, 6, txnUUID}).  The new transaction is able to start without getting blocked.
      4. The transaction from step 1 is stuck in prepare.

      So the issue here is that it is incorrect to to infer the active txnNumber from the session ids. SERVER-66852 introduced a method named OperationContextSession::observeNewTxnNumberStarted() which is called when a new txnNumber starts. This is the correct number to use as the highest txnNumber known in memory.

            jack.mulrow@mongodb.com Jack Mulrow
            cheahuychou.mao@mongodb.com Cheahuychou Mao
            0 Vote for this issue
            4 Start watching this issue