[SERVER-38257] TransactionParticipant::abortArbitraryTransaction can cause incorrect transaction metrics to be recorded Created: 26/Nov/18  Updated: 29/Oct/23  Resolved: 01/Apr/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.1.5
Fix Version/s: 4.1.10

Type: Bug Priority: Minor - P4
Reporter: Esha Maharishi (Inactive) Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: open_todo_in_code, prepare_diagnostics
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2019-04-08
Participants:
Linked BF Score: 0

 Description   

Based on my conversation with tess.avitabile, I think the following bug exists in transaction metrics active and inactive counts (the bug may affect other metrics as well, I'm not sure):

If a transaction is aborted, whether the number of active or inactive transactions is decremented depends on whether the TxnResources were stashed at the time of the abort:

This usually works because in a transaction request's flow:

  • First, TransactionParticipant::beginOrContinue calls TransactionMetricsObserver::onStart, which increments the number of currently inactive transactions
  • Later, TransactionParticipant::unstashTransactionResources calls TransactionMetricsObserver::onUnstash (both if the TxnResources already exist, or if they were just created), which increments the number of currently active transactions and decrements the number of currently inactive transactions.

However, TransactionParticipant::abortArbitraryTransaction can be called outside of a checked out Session, and so the following sequence can happen, which causes the metrics to be incorrect for a short period:

// inactive: 0
// active: 0
 
// Starts *new* transaction; increments inactive count
Thread 1: TransactionParticipant::beginOrContinue
 
// inactive: 1
// active: 0
 
// TxnResources have not been created, so _txnResourceStash is boost::none; interprets this as meaning the transaction is active and decrements active count
Thread 2: TransactionParticipant::abortArbitraryTransaction
 
// inactive: 1 <----- metric incorrect
// active: -1 <----- metric incorrect
 
Thread 1: TransactionParticipant::unstashTransactionResources
 
// inactive: 0 <----- metric remedied
// active: 0 <----- metric remedied

However, if TransactionParticipant::unstashTransactionResources throws before calling TransactionMetricsObserver:onUnstash, for example by timing out waiting to acquire the GlobalLock, then the inactive and active counts may remain permanently incorrect. 



 Comments   
Comment by Githook User [ 01/Apr/19 ]

Author:

{'email': 'lingzhi.deng@mongodb.com', 'name': 'Lingzhi Deng', 'username': 'ldennis'}

Message: SERVER-38257: Make transactionMetricsObserver.onAbort self-contained

Comment by Tess Avitabile (Inactive) [ 20/Feb/19 ]

Got it, that sounds fine! Same for the other ticket.

Comment by Judah Schvimer [ 20/Feb/19 ]

I wanted to signal (mostly to myself) that this was lower priority in the prepare epic, but still required. This seemed like the best way. I can change it back if you'd prefer not to (somewhat) overload the meaning of the field.

Comment by Tess Avitabile (Inactive) [ 20/Feb/19 ]

judah.schvimer, why has the priority been changed to minor?

Comment by Githook User [ 11/Jan/19 ]

Author:

{'username': 'tessavitabile', 'email': 'tess.avitabile@mongodb.com', 'name': 'Tess Avitabile'}

Message: SERVER-38257 Disable allCountsNonNegative invariant in multi_statement_transaction_atomicity_isolation_metrics_test.js
Branch: master
https://github.com/mongodb/mongo/commit/eaa91b0822c4ad59cb72e9274b18f4d32ce2eacf

Comment by Tess Avitabile (Inactive) [ 12/Dec/18 ]

That seems like a reasonable solution to me to have two in-progress states. If we don't like how it looks, we could consult the metrics tracker to check the active/inactive state.

That seems quite plausible that this is the cause of BF-11283.

Comment by William Schultz (Inactive) [ 12/Dec/18 ]

tess.avitabile At this point could we consider just making kInProgressActive and kInProgressInactive states in TransactionState? They would be sub-states of kInProgress. I think this would be a clear way to resolve the ambiguity around how to check whether a transaction is active or inactive. We initially used the presence/absence of _txnResourceStash to check this because it seemed appropriate, but it is becoming clear that this doesn't quite work in all cases. We could use the metrics tracker to check the active/inactive state, but we already have TransactionState, which is a nice state machine mechanism for validating state transitions. I think this approach would have also allowed us to solve the issue from SERVER-37189 in a cleaner manner.

On a related note, I wonder if BF-11283 is a manifestation of this issue. I have not investigated thoroughly, but the symptoms look like what is described in this ticket description i.e. the metrics end up with currentInactive=1, currentActive=-1. It has occurred very frequently, and recently i.e. BFG-115312.

Comment by Tess Avitabile (Inactive) [ 12/Dec/18 ]

inMultiDocumentTransaction() just checks whether the state is kInProgress. This does not distinguish between "active" (an operation has unstashed resources) and "inactive". A transaction remains in the kInProgress state until it is prepared, committed, or aborted.

Comment by Kaloian Manassiev [ 12/Dec/18 ]

What about using inMultiDocumentTransaction for this check instead? That's what I was planning to use for SERVER-38456.

Comment by Tess Avitabile (Inactive) [ 11/Dec/18 ]

I believe there is still a bug here, even after it's no longer possible to modify a TransactionParticipant without checking out the session.

1) Operation runs unstashTransactionResources(), which throws. The transaction is in state kInProgress, _txnResourceStash is empty, and the transaction is considered "inactive" in the metrics.

2) killSessions, transaction expiration thread, or stepdown/shutdown checks out the session and runs abortArbitraryTransaction(). Since _txnResourceStash is empty, this calls _transactionMetricsObserver.onAbortActive(). (I don't think abortArbitraryTransaction() will go away–it will just no longer run concurrently with transaction operations.)

We should not rely on _txnResourceStash to determine if a transaction is active. One option is that abortArbitraryTransaction() could now always call _transactionMetricsObserver.onAbortInactive(), since it no longer can run concurrently with a transaction operation. Another option is to go back to the 4.0 model, where we consult the metrics-tracker about whether the transaction is in the active or inactive state.

Comment by Judah Schvimer [ 11/Dec/18 ]

kaloian.manassiev, I don't see how this could happen without abortArbitraryTransaction still, since that was how we got the incorrect metrics in the first place. Am I missing anything? esha.maharishi, thoughts?

Comment by Kaloian Manassiev [ 10/Dec/18 ]

After SERVER-37923, transaction abort will always check-out the session (aborting the current owner operation context if necessary) so there should never be a case where unstashTransactionResources threw because the participant became invalidated due to a concurrent abort. However, looking at the code there are other situations where it can throw, for example where read concern is passed after the first statement. Not sure what the effect is of those.

Comment by Gregory McKeon (Inactive) [ 03/Dec/18 ]

Since kaloian.manassiev is removing TransactionParticipant::abortArbitraryTransaction, we believe this will go away. We should check that this isn't a bug in 4.0 and mark as dependent on the Session catalog refactor for 4.1.

Generated at Thu Feb 08 04:48:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.