[SERVER-36295] Transaction metrics not updated on TransientTransactionError Created: 26/Jul/18 Updated: 29/Oct/23 Resolved: 14/Aug/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Diagnostics |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | William Schultz (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Repl 2018-08-13, Repl 2018-08-27 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 18 | ||||||||||||||||
| Description |
|
Same test as
At A 6 of the 10 threads got a TransientTransactionError, apparently (based on timing in chart above) because the in-progress transaction reaper job prevented them from acquiring a lock within 5ms:
The 10 threads continue, but the open and active counters are now stuck ranging from 6-16 instead of 0-10. Apparently the counters have not been decremented on the error. At B I killed the mongo shell. It appears there was still one transaction open at that point because open is 7 and inactive is 1. At C after that transaction times out we remain stuck indefinitely with open at 6 and active at 6. |
| Comments |
| Comment by William Schultz (Inactive) [ 14/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Should be fixed by | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 09/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Assigning to siyuan.zhou to close when he completes the dependent ticket. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 02/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Ok, so the database X lock is acquired on the "config" database when the reaper tries to run createIndexes on the logical sessions namespace (config.system.sessions). It turns out createIndexes takes an exclusive database lock for foreground index builds. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 02/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Also, still trying to figure out what exactly was holding an exclusive lock on the "config" database so as to conflict with the intent lock held by the commitTransaction write to the transactions table. At first glance, it looks like the TransactionReaper only grabs an intent lock. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 02/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
And the simpler repro. Shows the issue after running for a few seconds:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 02/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
bruce.lucas The underlying issue is described in | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 02/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I was also able to repro this with just 2 threads. One thread runs and commits a simple transaction repeatedly, and the other runs the logical session cache reaper repeatedly. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 02/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think I have figured out what is going on here. I believe that the LockTimeout errors we are seeing are from the commitTransaction command failing, not the update inside the transaction. When we run commitTransaction, we will call theĀ Session::commitUnpreparedTransaction method. Inside this method, we will first transition to state kCommittingWithoutPrepare, and then trigger the onTransactionCommit OpObserver. Inside that OpObserver call is where we will do a write to update the config.transactions table. If the the onTransactionCommit method throws an exception, then the commitTransaction command will fail, and we will have left the transaction state in kCommittingWithoutPrepare. When a command running inside a transaction throws an exception, we will trigger this block, to abort the transaction if necessary. In the case described, we would call Session::abortActiveTransaction while _txnState==kCommittingWithoutPrepare. Since we are not in one of the expected states passed to _abortActiveTransaction, we will not execute the _abortTransactionOnSession method, which is what actually updates the various metadata about the transaction, to indicate that it is aborted. We will, however, clean up the transaction resources that live on the OperationContext. So, even though we called abortActiveTransaction, we never actually transitioned to state kAborted. The issue then persists, because the transaction has been left in the kCommittingWithoutPrepare state. For example, when we try to run another commit command, we will get an error because the transaction is no longer marked as in-progress. The same error will also be thrown if we try to run abort. One way to get the transaction out of this "limbo" state is to start a new transaction with a higher transaction number on the same session. This will work as a way to clear out the old transaction state, but it still won't trigger an actual call to _abortTransactionOnSession for the previous transaction. When we start a new transaction when one is already running, we will only abort the old transaction if there is one in progress. This means we would start the new transaction without ever explicitly calling the abort method internally. This, I think, it why the transactions metrics appeared incorrect i.e. we effectively aborted a transaction but we never called _abortTransactionOnSession, which is where we update global transaction metrics counts. Note: The reason this manifested in the repro shown is because the logical session cache reaper was running concurrently with a transaction. If it removes entries from the transactions table, it would need to take a lock on the config.transactions table, and that is what would cause a lock conflict with the commit transaction command, leading to the scenario described above. Theoretically, this could occur for anything that tries to take a similar lock. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 01/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
That was a nightly build from July 24. Log says
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 01/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
bruce.lucas What git revision did you test this against? Have you been working against master? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 01/Aug/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The following seems to reliably reproduce the problem:
It starts up 100 threads doing transactions, waits a few seconds, then runs the reapLogicalSessionCacheNow command (test commands must be enabled). Output:
Note that open and active transactions are now stuck at 32. Interestingly, totalAborted hasn't been bumped even though (I think) the transactions that got the errors have been aborted. |