[SERVER-41556] Must handle failure to reacquire locks and ticket when unstashing transaction Created: 05/Jun/19  Updated: 29/Oct/23  Resolved: 15/Jul/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.2.0, 4.3.1

Type: Bug Priority: Major - P3
Reporter: Matthew Russotto Assignee: Suganthi Mani
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
Related
related to SERVER-42381 A failed unstashTransactionResources ... Closed
is related to SERVER-41881 Stashing the lock resources for prepa... Closed
is related to SERVER-41883 UnConditional step down should not se... Closed
is related to SERVER-41976 Server should not attach TransientTra... Closed
is related to SERVER-41980 Non-transactional commands can deadlo... Closed
is related to SERVER-42388 Release the transaction lock resource... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2
Sprint: Repl 2019-07-01, Repl 2019-07-15, Repl 2019-07-29
Participants:
Linked BF Score: 17

 Description   

In TxnParticipant::TxnResources::release(OperationContext*), it is possible for either restoring _locker->restoreWriteUnitOfWorkAndLock() or _locker->reacquireTicket(opCtx) to fail. If _locker->reacquireTicket() fails, the locker may be in an inconsistent state (holding locks but not the ticket). Further, because of the swap() we do in TransactionParticipant::Participant::_releaseTransactionResourcesToOpCtx, if either one happens we lose the TxnResources object entirely. If the transaction was prepared, it is now in a prepared state without a stash, which result in a crash next time it is used. If it was not, it's now effectively aborted though not marked as so.

For prepared transactions we need to ensure a failed release() leaves the transaction as-is. This will work for other operations as well, but usually we abort in the case of transaction errors so we may want to force an abort in that case as well.



 Comments   
Comment by Githook User [ 25/Jul/19 ]

Author:

{'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}

Message: SERVER-41881 Unstashing the transaction lock resources should ignore the saved value of maxLockTimeout and explicitly set the maxLockTimeout based on node's state.
SERVER-41883 Replication state transition reacquires locks and tickets of a prepared transaction with no lock timeout.
SERVER-41556 Handles failure to reacquire locks and ticket when unstashing transactions.

(cherry picked from commit 2ff54098b19ebc2b4bbf5516de6e6befb46f9fe7)
Branch: v4.2
https://github.com/mongodb/mongo/commit/7be60542714d7cd17a6ecf5e43a47bb7aef9a0d5

Comment by Githook User [ 25/Jul/19 ]

Author:

{'name': 'Suganthi Mani', 'username': 'smani87', 'email': 'suganthi.mani@mongodb.com'}

Message: SERVER-41881 Unstashing the transaction lock resources should ignore the saved value of maxLockTimeout and explicitly set the maxLockTimeout based on node's state.
SERVER-41883 Replication state transition reacquires locks and tickets of a prepared transaction with no lock timeout.
SERVER-41556 Handles failure to reacquire locks and ticket when unstashing transactions.
Branch: master
https://github.com/mongodb/mongo/commit/2ff54098b19ebc2b4bbf5516de6e6befb46f9fe7

Comment by Suganthi Mani [ 24/Jul/19 ]

Author:

{'name': 'Ian Boros', 'email': 'puppyofkosh@gmail.com', 'username': 'puppyofkosh'}

Message: Revert "SERVER-41881 Unstashing the transaction lock resources should ignore the saved value of maxLockTimeout and explicitly set the maxLockTimeout based on node's state."

This reverts commit e707fd09ef0dadbb33510249732fd38c654da914.
Branch: v4.2
https://github.com/mongodb/mongo/commit/65a1db06e9a88e7d96e1359662f5480f939c0e5b

Comment by Suganthi Mani [ 24/Jul/19 ]

Author:

{'name': 'Ian Boros', 'email': 'puppyofkosh@gmail.com', 'username': 'puppyofkosh'}

Message: Revert "SERVER-41881 Unstashing the transaction lock resources should ignore the saved value of maxLockTimeout and explicitly set the maxLockTimeout based on node's state."

This reverts commit b7cec5064fb03f1e1f9bd39af35e495facfdcdc9.
Branch: master
https://github.com/mongodb/mongo/commit/38e92ef5b9fe645cd73fec3742c0fde9caea0cb2

Comment by Githook User [ 15/Jul/19 ]

Author:

{'name': 'Suganthi Mani', 'username': 'smani87', 'email': 'suganthi.mani@mongodb.com'}

Message: SERVER-41881 Unstashing the transaction lock resources should ignore the saved value of maxLockTimeout and explicitly set the maxLockTimeout based on node's state.
SERVER-41883 Replication state transition reacquires locks and tickets of a prepared transaction with no lock timeout.
SERVER-41556 Handles failure to reacquire locks and ticket when unstashing transactions.

(cherry picked from commit b7cec5064fb03f1e1f9bd39af35e495facfdcdc9)
Branch: v4.2
https://github.com/mongodb/mongo/commit/e707fd09ef0dadbb33510249732fd38c654da914

Comment by Githook User [ 15/Jul/19 ]

Author:

{'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}

Message: SERVER-41881 Unstashing the transaction lock resources should ignore the saved value of maxLockTimeout and explicitly set the maxLockTimeout based on node's state.
SERVER-41883 Replication state transition reacquires locks and tickets of a prepared transaction with no lock timeout.
SERVER-41556 Handles failure to reacquire locks and ticket when unstashing transactions.
Branch: master
https://github.com/mongodb/mongo/commit/b7cec5064fb03f1e1f9bd39af35e495facfdcdc9

Comment by Judah Schvimer [ 06/Jun/19 ]

Since this is a 4.2 bug, we should do this next iteration.

Comment by Matthew Russotto [ 06/Jun/19 ]

The relevant code for 4.0 is

https://github.com/mongodb/mongo/blob/37c22a95d498d079022322b2b9ea6c2713f3d1f5/src/mongo/db/session.cpp#L782

and

https://github.com/mongodb/mongo/blob/37c22a95d498d079022322b2b9ea6c2713f3d1f5/src/mongo/db/session.cpp#L661

Since we don't restore yielded locks and don't use a temporary to hold the TxnResources, this bug isn't in 4.0.

Comment by Judah Schvimer [ 05/Jun/19 ]

matthew.russotto, can this happen on 4.0 as well?

Generated at Thu Feb 08 04:58:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.