[SERVER-39726] Recovering the state of an uncommitted transaction should not block Created: 21/Feb/19  Updated: 02/May/19  Resolved: 02/May/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.1.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Shane Harvey Assignee: Randolph Tan
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-39349 Recovering the state of a completed s... Closed
Related
related to SERVER-37344 Implement recovery token for retrying... Closed
related to SERVER-39349 Recovering the state of a completed s... Closed
Operating System: ALL
Sprint: Sharding 2019-04-08, Sharding 2019-04-22, Sharding 2019-05-06
Participants:

 Description   

SERVER-37344's ticket description says:

a shard that receives 'recoverTransaction' returns NoSuchTransaction if the shard does not know about the transaction. otherwise, if the decision has been made, returns the decision; *if the decision has not been made, decides to abort.*

And the server design also says:

If the client is unable to reach the original router after having attempted to send commitTransaction to the original router, the client can send commitTransaction to a different router.
Doing so will never initiate committing the transaction. *Instead, the recovery token in the request will be used to try to abort the transaction if a decision to commit has not already been made*, otherwise to recover the transaction's outcome.

However the implementation of SERVER-37344 says:

commit recovery is best effort. If coordinateCommit was never sent to the coordinator, the recovery commit will timeout waiting for it.

So I think the current implementation is incomplete. The abort optimization is important because it prevents applications from blocking for 60 seconds (or transactionLifetimeLimitSeconds) when the original commit attempt is lost.

CC: renctan.



 Comments   
Comment by Randolph Tan [ 02/May/19 ]

Changes in SERVER-39349 made it such that commit recovery will abort the transaction coordinator if it has not yet started instead of trying to wait for it to time out.

Comment by Shane Harvey [ 25/Feb/19 ]

I think the distinction is that commitTransaction against a recovery router will abort the uncommitted transaction on the recovery shard, which guarantees the transaction will never commit. This is done so that NoSuchTransaction can be safely returned to the client.

Yes this is the behavior I would like to see implemented by this ticket. I linked SERVER-39692 because if commitTransaction can abort the uncommitted transaction on the recovery shard, then abortTransaction should also be able to abort the uncommitted transaction on the recovery shard.

Comment by Esha Maharishi (Inactive) [ 25/Feb/19 ]

shane.harvey, I think the distinction is that commitTransaction against a recovery router will abort the uncommitted transaction on the recovery shard, which guarantees the transaction will never commit. This is done so that NoSuchTransaction can be safely returned to the client. However, it does not synchronously abort the transaction on all participant shards, since the recoveryToken does not include the participant list.

Comment by Shane Harvey [ 21/Feb/19 ]

LinkingĀ SERVER-39692 because these two tickets are very related. If commitTransaction can cause an uncommitted transaction to abort then abortTransaction should also be able to do the same.

Comment by Shane Harvey [ 21/Feb/19 ]

Interesting... can you explain why the recovery commitTransaction attempt cannot communicate with the coordinator to abort the transaction? What exactly is the race condition?

Comment by Randolph Tan [ 21/Feb/19 ]

Note: design doc is not up to date. I don't think the abort was meant to be an optimization and it is also possible that it won't be used to get around this quirk. The issue is that making decisions without involving the transaction coordinator is racy.

Generated at Thu Feb 08 04:52:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.