[SERVER-61816] cancel_coordinate_txn_commit_with_tickets_exhausted.js can hang forever due to race condition between transaction reaper and transaction coordinator Created: 30/Nov/21  Updated: 29/Oct/23  Resolved: 02/Dec/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.2.0, 5.1.2, 5.0.6, 4.4.11, 4.2.19

Type: Bug Priority: Major - P3
Reporter: Luis Osta (Inactive) Assignee: Luis Osta (Inactive)
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
is caused by SERVER-60685 TransactionCoordinator may interrupt ... Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.1, v5.0, v4.4, v4.2
Steps To Reproduce:

diff --git a/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp b/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp
index 52568382913..9b140b57c82 100644
--- a/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp
+++ b/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp
@@ -113,7 +113,7 @@ void PeriodicThreadToAbortExpiredTransactions::_init(ServiceContext* serviceCont
                 LOGV2_DEBUG(4684101, 2, "Periodic job canceled", "{reason}"_attr = ex.reason());
             }
         },
-        getPeriod(gTransactionLifetimeLimitSeconds.load()));
+        Milliseconds(1));
 
     _anchor = std::make_shared<PeriodicJobAnchor>(periodicRunner->makeJob(std::move(job)));
 

Sprint: Sharding 2021-12-13
Participants:
Linked BF Score: 163
Story Points: 2

 Description   

Context

The issue that this occurred happens when the TransactionCoordinator is also a participant. The local transaction reaper gets triggered before the TransactionCoordinator sends the abortTransaction command to the local transaction (also due to a timeout). The coordinator sends the abort command to all of the participants, but since the coordinator is also a participant, it will utilize handleRequest to abort the local transaction.

The underlying function which handles the request has special logic in the event that the coordinator is also the participant, instead of going through the network, it will directly call handleRequest. This is the origin of that stack frame above.

That call to handleRequest will get stuck because the ServiceEntryPoint attempt to do a no-op write because the abortTransaction command failed with a NoSuchTransaction error.

Proposal

The fix required to make the test work as expected is for the transaction coordinator assert.soon accept the coordinator to be in any step equal to or past writingDecision. The new assert.soon function that checks for the server status of the transaction coordinator should look something like this:

let twoPhaseCommitCoordinatorServerStatus;
assert.soon(
    () => {
        twoPhaseCommitCoordinatorServerStatus =
            txnCoordinator.getDB(dbName).serverStatus().twoPhaseCommitCoordinator;
        const deletingCoordinatorDoc =
            twoPhaseCommitCoordinatorServerStatus.currentInSteps.deletingCoordinatorDoc;
        const waitingForDecisionAcks =
            twoPhaseCommitCoordinatorServerStatus.currentInSteps.waitingForDecisionAcks;
        const writingDecision = twoPhaseCommitCoordinatorServerStatus.currentInSteps.writingDecision;
        return deletingCoordinatorDoc.toNumber() === 1 || waitingForDecisionAcks.toNumber() === 1 || writingDecision.toNumber() === 1;
    },
    () => `Failed to find 1 total transactions in the deletingCoordinatorDoc state: ${
        tojson(twoPhaseCommitCoordinatorServerStatus)}`);



 Comments   
Comment by Githook User [ 10/Jan/22 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-61816 Add steps past kWaitingForVotes to assert.soon

(cherry picked from commit 6d1b572c7ddbba652ffa49dc3783fbd27cec9714)

SERVER-60685 Add Interruption category to 'TransactionCoordinatorReachedAbortDecision'

(cherry picked from commit 78ab98a46b53582a5e69424bbb92f25c483fec0a)
Branch: v4.2
https://github.com/mongodb/mongo/commit/f2fc34bdec288cf3b90bf926d6a6c77631f4fa10

Comment by Githook User [ 13/Dec/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-61816 Add steps past kWaitingForVotes to assert.soon

(cherry picked from commit 6d1b572c7ddbba652ffa49dc3783fbd27cec9714)

SERVER-60685 Add Interruption category to 'TransactionCoordinatorReachedAbortDecision'

(cherry picked from commit 78ab98a46b53582a5e69424bbb92f25c483fec0a)
Branch: v4.4
https://github.com/mongodb/mongo/commit/172eefd07a93db431d762e1ab017b21db9d10f9f

Comment by Githook User [ 08/Dec/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-61816 Add steps past kWaitingForVotes to assert.soon

(cherry picked from commit 6d1b572c7ddbba652ffa49dc3783fbd27cec9714)

SERVER-60685 Add Interruption category to 'TransactionCoordinatorReachedAbortDecision'

(cherry picked from commit 78ab98a46b53582a5e69424bbb92f25c483fec0a)
(cherry picked from commit 7634ffa5d056aa5efcc12079d00da898e6f258fb)
Branch: v5.0
https://github.com/mongodb/mongo/commit/f33c1dac76e2799d50b3453eaf14d771dc9646ab

Comment by Githook User [ 07/Dec/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-61816 Add steps past kWaitingForVotes to assert.soon

(cherry picked from commit 6d1b572c7ddbba652ffa49dc3783fbd27cec9714)

SERVER-60685 Add Interruption category to 'TransactionCoordinatorReachedAbortDecision'

(cherry picked from commit 78ab98a46b53582a5e69424bbb92f25c483fec0a)
Branch: v5.1
https://github.com/mongodb/mongo/commit/7634ffa5d056aa5efcc12079d00da898e6f258fb

Comment by Githook User [ 02/Dec/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-61816 Add steps past kWaitingForVotes to assert.soon
Branch: master
https://github.com/mongodb/mongo/commit/be3181ae244732d15e09571f2dc0488800bba26b

Comment by Githook User [ 02/Dec/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-61816 Add steps past kWaitingForVotes to assert.soon
Branch: master
https://github.com/mongodb/mongo/commit/6d1b572c7ddbba652ffa49dc3783fbd27cec9714

Generated at Thu Feb 08 05:53:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.