[SERVER-38257] TransactionParticipant::abortArbitraryTransaction can cause incorrect transaction metrics to be recorded Created: 26/Nov/18 Updated: 29/Oct/23 Resolved: 01/Apr/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 4.1.5 |
| Fix Version/s: | 4.1.10 |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | Esha Maharishi (Inactive) | Assignee: | Lingzhi Deng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | open_todo_in_code, prepare_diagnostics | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Repl 2019-04-08 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 0 | ||||||||
| Description |
|
Based on my conversation with tess.avitabile, I think the following bug exists in transaction metrics active and inactive counts (the bug may affect other metrics as well, I'm not sure): — If a transaction is aborted, whether the number of active or inactive transactions is decremented depends on whether the TxnResources were stashed at the time of the abort:
This usually works because in a transaction request's flow:
However, TransactionParticipant::abortArbitraryTransaction can be called outside of a checked out Session, and so the following sequence can happen, which causes the metrics to be incorrect for a short period:
However, if TransactionParticipant::unstashTransactionResources throws before calling TransactionMetricsObserver:onUnstash, for example by timing out waiting to acquire the GlobalLock, then the inactive and active counts may remain permanently incorrect. |
| Comments |
| Comment by Githook User [ 01/Apr/19 ] |
|
Author: {'email': 'lingzhi.deng@mongodb.com', 'name': 'Lingzhi Deng', 'username': 'ldennis'}Message:
|
| Comment by Tess Avitabile (Inactive) [ 20/Feb/19 ] |
|
Got it, that sounds fine! Same for the other ticket. |
| Comment by Judah Schvimer [ 20/Feb/19 ] |
|
I wanted to signal (mostly to myself) that this was lower priority in the prepare epic, but still required. This seemed like the best way. I can change it back if you'd prefer not to (somewhat) overload the meaning of the field. |
| Comment by Tess Avitabile (Inactive) [ 20/Feb/19 ] |
|
judah.schvimer, why has the priority been changed to minor? |
| Comment by Githook User [ 11/Jan/19 ] |
|
Author: {'username': 'tessavitabile', 'email': 'tess.avitabile@mongodb.com', 'name': 'Tess Avitabile'}Message: |
| Comment by Tess Avitabile (Inactive) [ 12/Dec/18 ] |
|
That seems like a reasonable solution to me to have two in-progress states. If we don't like how it looks, we could consult the metrics tracker to check the active/inactive state. That seems quite plausible that this is the cause of BF-11283. |
| Comment by William Schultz (Inactive) [ 12/Dec/18 ] |
|
tess.avitabile At this point could we consider just making kInProgressActive and kInProgressInactive states in TransactionState? They would be sub-states of kInProgress. I think this would be a clear way to resolve the ambiguity around how to check whether a transaction is active or inactive. We initially used the presence/absence of _txnResourceStash to check this because it seemed appropriate, but it is becoming clear that this doesn't quite work in all cases. We could use the metrics tracker to check the active/inactive state, but we already have TransactionState, which is a nice state machine mechanism for validating state transitions. I think this approach would have also allowed us to solve the issue from On a related note, I wonder if BF-11283 is a manifestation of this issue. I have not investigated thoroughly, but the symptoms look like what is described in this ticket description i.e. the metrics end up with currentInactive=1, currentActive=-1. It has occurred very frequently, and recently i.e. BFG-115312. |
| Comment by Tess Avitabile (Inactive) [ 12/Dec/18 ] |
|
inMultiDocumentTransaction() just checks whether the state is kInProgress. This does not distinguish between "active" (an operation has unstashed resources) and "inactive". A transaction remains in the kInProgress state until it is prepared, committed, or aborted. |
| Comment by Kaloian Manassiev [ 12/Dec/18 ] |
|
What about using inMultiDocumentTransaction for this check instead? That's what I was planning to use for |
| Comment by Tess Avitabile (Inactive) [ 11/Dec/18 ] |
|
I believe there is still a bug here, even after it's no longer possible to modify a TransactionParticipant without checking out the session. 1) Operation runs unstashTransactionResources(), which throws. The transaction is in state kInProgress, _txnResourceStash is empty, and the transaction is considered "inactive" in the metrics. 2) killSessions, transaction expiration thread, or stepdown/shutdown checks out the session and runs abortArbitraryTransaction(). Since _txnResourceStash is empty, this calls _transactionMetricsObserver.onAbortActive(). (I don't think abortArbitraryTransaction() will go away–it will just no longer run concurrently with transaction operations.) We should not rely on _txnResourceStash to determine if a transaction is active. One option is that abortArbitraryTransaction() could now always call _transactionMetricsObserver.onAbortInactive(), since it no longer can run concurrently with a transaction operation. Another option is to go back to the 4.0 model, where we consult the metrics-tracker about whether the transaction is in the active or inactive state. |
| Comment by Judah Schvimer [ 11/Dec/18 ] |
|
kaloian.manassiev, I don't see how this could happen without abortArbitraryTransaction still, since that was how we got the incorrect metrics in the first place. Am I missing anything? esha.maharishi, thoughts? |
| Comment by Kaloian Manassiev [ 10/Dec/18 ] |
|
After |
| Comment by Gregory McKeon (Inactive) [ 03/Dec/18 ] |
|
Since kaloian.manassiev is removing TransactionParticipant::abortArbitraryTransaction, we believe this will go away. We should check that this isn't a bug in 4.0 and mark as dependent on the Session catalog refactor for 4.1. |