[SERVER-66993] TransactionParticipants can be refreshed incorrectly after failover/restart Created: 03/Jun/22  Updated: 29/Oct/23  Resolved: 07/Jun/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.0.0-rc9, 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Cheahuychou Mao Assignee: Jack Mulrow
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.0
Sprint: Sharding NYC 2022-06-13
Participants:

 Description   

To reduce the number of the matching session ids in the covered query read on the "parentLsid" during a TransactionParticipant refresh(see the comment below for details on how TransactionParticipant refreshes work), we look up the highest txnNumber known in memory and use it in find filter. This txnNumber is currently computed as the max of the lsid.txnNumber for the sessions that have been checked out from the SessionCatalog.  Here is how this optimization can cause TransactionParticipants to get refreshed incorrectly after failover or restart.

  1. mongos0 runs an internal transaction with session id {id, uid, 5, txnUUID} until it has entered prepare.
  2. The replica set shuts down. Upon restart, each node reconstructs the prepared internal transaction from step 1. This only sets up the TransactionParticipant for the internal transaction session (via MongoDOperationContextSessionWithoutRefresh) and so the TransactionParticipant for the parent session is not marked valid. 
  3. mongos1 starts an internal transaction with session id {id, uid, 6, txnUUID}. Checking out the session causes TransactionParticipant to get refreshed. However, but from scanning the SessionCatalog, it thinks that the highest txnNumber is 6 and so it only looks for session ids with lsid.txnNumber >= 6 and finds no matching session ids so it concludes that the txnNumber to recover is 6. So here, it finds no other TransactionParticipants to refresh (i.e. the only TransactionParticipant in RetryableWriteTransactionParticipantCatalog is the one for {id, uid, 6, txnUUID}).  The new transaction is able to start without getting blocked.
  4. The transaction from step 1 is stuck in prepare.

So the issue here is that it is incorrect to to infer the active txnNumber from the session ids. SERVER-66852 introduced a method named OperationContextSession::observeNewTxnNumberStarted() which is called when a new txnNumber starts. This is the correct number to use as the highest txnNumber known in memory.



 Comments   
Comment by Githook User [ 07/Jun/22 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-66993 Stop tracking highest child session txnNumber

(cherry picked from commit f0a44617983763b0fdc1f37bdc35e391184ea63d)
Branch: v6.0
https://github.com/mongodb/mongo/commit/90cccd16effc6e4056fd9d297ffc1e6d60b9d01c

Comment by Githook User [ 07/Jun/22 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-66993 Stop tracking highest child session txnNumber
Branch: master
https://github.com/mongodb/mongo/commit/f0a44617983763b0fdc1f37bdc35e391184ea63d

Generated at Thu Feb 08 06:06:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.