Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-61816

cancel_coordinate_txn_commit_with_tickets_exhausted.js can hang forever due to race condition between transaction reaper and transaction coordinator

    • Fully Compatible
    • ALL
    • v5.1, v5.0, v4.4, v4.2
    • Hide
      Unable to find source-code formatter for language: git. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      diff --git a/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp b/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp
      index 52568382913..9b140b57c82 100644
      --- a/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp
      +++ b/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp
      @@ -113,7 +113,7 @@ void PeriodicThreadToAbortExpiredTransactions::_init(ServiceContext* serviceCont
                       LOGV2_DEBUG(4684101, 2, "Periodic job canceled", "{reason}"_attr = ex.reason());
                   }
               },
      -        getPeriod(gTransactionLifetimeLimitSeconds.load()));
      +        Milliseconds(1));
       
           _anchor = std::make_shared<PeriodicJobAnchor>(periodicRunner->makeJob(std::move(job)));
       
      
      Show
      Unable to find source-code formatter for language: git. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml diff --git a/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp b/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp index 52568382913..9b140b57c82 100644 --- a/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp +++ b/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp @@ -113,7 +113,7 @@ void PeriodicThreadToAbortExpiredTransactions::_init(ServiceContext* serviceCont LOGV2_DEBUG(4684101, 2, "Periodic job canceled" , "{reason}" _attr = ex.reason()); } }, - getPeriod(gTransactionLifetimeLimitSeconds.load())); + Milliseconds(1)); _anchor = std::make_shared<PeriodicJobAnchor>(periodicRunner->makeJob(std::move(job)));
    • Sharding 2021-12-13
    • 163
    • 2

      Context

      The issue that this occurred happens when the TransactionCoordinator is also a participant. The local transaction reaper gets triggered before the TransactionCoordinator sends the abortTransaction command to the local transaction (also due to a timeout). The coordinator sends the abort command to all of the participants, but since the coordinator is also a participant, it will utilize handleRequest to abort the local transaction.

      The underlying function which handles the request has special logic in the event that the coordinator is also the participant, instead of going through the network, it will directly call handleRequest. This is the origin of that stack frame above.

      That call to handleRequest will get stuck because the ServiceEntryPoint attempt to do a no-op write because the abortTransaction command failed with a NoSuchTransaction error.

      Proposal

      The fix required to make the test work as expected is for the transaction coordinator assert.soon accept the coordinator to be in any step equal to or past writingDecision. The new assert.soon function that checks for the server status of the transaction coordinator should look something like this:

      let twoPhaseCommitCoordinatorServerStatus;
      assert.soon(
          () => {
              twoPhaseCommitCoordinatorServerStatus =
                  txnCoordinator.getDB(dbName).serverStatus().twoPhaseCommitCoordinator;
              const deletingCoordinatorDoc =
                  twoPhaseCommitCoordinatorServerStatus.currentInSteps.deletingCoordinatorDoc;
              const waitingForDecisionAcks =
                  twoPhaseCommitCoordinatorServerStatus.currentInSteps.waitingForDecisionAcks;
              const writingDecision = twoPhaseCommitCoordinatorServerStatus.currentInSteps.writingDecision;
              return deletingCoordinatorDoc.toNumber() === 1 || waitingForDecisionAcks.toNumber() === 1 || writingDecision.toNumber() === 1;
          },
          () => `Failed to find 1 total transactions in the deletingCoordinatorDoc state: ${
              tojson(twoPhaseCommitCoordinatorServerStatus)}`);
      

            Assignee:
            luis.osta@mongodb.com Luis Osta (Inactive)
            Reporter:
            luis.osta@mongodb.com Luis Osta (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: