[SERVER-39187] Rerunning commitTransaction on a new mongos blocks forever Created: 25/Jan/19 Updated: 29/Oct/23 Resolved: 30/Jan/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.1.7 |
| Fix Version/s: | 4.1.8 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Shane Harvey | Assignee: | Matthew Saltz (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Sharding 2019-02-11 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Rerunning commitTransaction, with the recoveryToken added in To reproduce start a sharded cluster with at least two mongoses (my cluster a one config server and a one node shard). Run the repro script: reproHangingCommit.js
db.currentOp() reports an ongoing coordinateCommitTransaction command that never ends. I've attached an example currentOp output at the bottom of the repro script. |
| Comments |
| Comment by Matthew Saltz (Inactive) [ 30/Jan/19 ] | |||||||||||||||||||||||||||||||
|
shane.harvey Note that even with the fix, if the transaction does not go through two-phase commit, the recovery mongos will have to wait for the coordinator to expire, which means that it will still take 60 seconds by default (the value of the transactionLifetimeLimitSeconds parameter) before recovering the commit decision. | |||||||||||||||||||||||||||||||
| Comment by Githook User [ 30/Jan/19 ] | |||||||||||||||||||||||||||||||
|
Author: {'username': 'saltzm', 'email': 'matthew.saltz@mongodb.com', 'name': 'Matthew Saltz'}Message: | |||||||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 29/Jan/19 ] | |||||||||||||||||||||||||||||||
yes, this is correct. | |||||||||||||||||||||||||||||||
| Comment by Shane Harvey [ 29/Jan/19 ] | |||||||||||||||||||||||||||||||
Why does the coordinator time out at all? The transaction has already committed on the first attempt. Shouldn't the second commit immediately return success? Is this a consequence of the behavior described in this comment:
| |||||||||||||||||||||||||||||||
| Comment by Matthew Saltz (Inactive) [ 29/Jan/19 ] | |||||||||||||||||||||||||||||||
|
I also reproduced this in a unit test. I'm not sure why BrokenPromise isn't thrown when the coordinator is destroyed after get() is called, but we aren't supposed to rely on that behavior anyway so I'll fix it to set an error when cancelIfCommitNotYetStarted is called. | |||||||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 29/Jan/19 ] | |||||||||||||||||||||||||||||||
|
I think the hang occurs because the TransactionCoordinator's _finalDecisionPromise is not set when the coordinator times out waiting to hear the participant list, because the _finalDecisionPromise is only set when a decision is made or in _handleCompletionStatus when coordinating a commit or continuing to coordinate a commit fails with an error. This causes the coordinateCommitTransaction command to hang, because coordinateCommitTransaction is waiting for _finalDecisionPromise to be signaled. Running the repro with shards verbosity at 3, I see the following log lines under the TXN component:
and the coordinator shard primary (d20000) hanging here:
I think one fix would be to call _handleCompletionStatus in cancelCommitIfNotYetStarted, where currently only _transitionToDone is called. However, I remember matthew.saltz had wanted to do a small refactor to simplify _handleCompletionStatus and _transitionToDone. Matthew, do you want to do that refactor under this ticket while also fixing this bug? |