Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-61816

cancel_coordinate_txn_commit_with_tickets_exhausted.js can hang forever due to race condition between transaction reaper and transaction coordinator

    XMLWordPrintable

Details

    • Fully Compatible
    • ALL
    • v5.1, v5.0, v4.4, v4.2
    • Hide

      diff --git a/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp b/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp
      index 52568382913..9b140b57c82 100644
      --- a/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp
      +++ b/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp
      @@ -113,7 +113,7 @@ void PeriodicThreadToAbortExpiredTransactions::_init(ServiceContext* serviceCont
                       LOGV2_DEBUG(4684101, 2, "Periodic job canceled", "{reason}"_attr = ex.reason());
                   }
               },
      -        getPeriod(gTransactionLifetimeLimitSeconds.load()));
      +        Milliseconds(1));
       
           _anchor = std::make_shared<PeriodicJobAnchor>(periodicRunner->makeJob(std::move(job)));
       
      

      Show
      diff --git a/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp b/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp index 52568382913..9b140b57c82 100644 --- a/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp +++ b/src/mongo/db/periodic_runner_job_abort_expired_transactions.cpp @@ -113,7 +113,7 @@ void PeriodicThreadToAbortExpiredTransactions::_init(ServiceContext* serviceCont LOGV2_DEBUG(4684101, 2, "Periodic job canceled", "{reason}"_attr = ex.reason()); } }, - getPeriod(gTransactionLifetimeLimitSeconds.load())); + Milliseconds(1)); _anchor = std::make_shared<PeriodicJobAnchor>(periodicRunner->makeJob(std::move(job)));
    • Sharding 2021-12-13
    • 163
    • 2

    Description

      Context

      The issue that this occurred happens when the TransactionCoordinator is also a participant. The local transaction reaper gets triggered before the TransactionCoordinator sends the abortTransaction command to the local transaction (also due to a timeout). The coordinator sends the abort command to all of the participants, but since the coordinator is also a participant, it will utilize handleRequest to abort the local transaction.

      The underlying function which handles the request has special logic in the event that the coordinator is also the participant, instead of going through the network, it will directly call handleRequest. This is the origin of that stack frame above.

      That call to handleRequest will get stuck because the ServiceEntryPoint attempt to do a no-op write because the abortTransaction command failed with a NoSuchTransaction error.

      Proposal

      The fix required to make the test work as expected is for the transaction coordinator assert.soon accept the coordinator to be in any step equal to or past writingDecision. The new assert.soon function that checks for the server status of the transaction coordinator should look something like this:

      let twoPhaseCommitCoordinatorServerStatus;
      assert.soon(
          () => {
              twoPhaseCommitCoordinatorServerStatus =
                  txnCoordinator.getDB(dbName).serverStatus().twoPhaseCommitCoordinator;
              const deletingCoordinatorDoc =
                  twoPhaseCommitCoordinatorServerStatus.currentInSteps.deletingCoordinatorDoc;
              const waitingForDecisionAcks =
                  twoPhaseCommitCoordinatorServerStatus.currentInSteps.waitingForDecisionAcks;
              const writingDecision = twoPhaseCommitCoordinatorServerStatus.currentInSteps.writingDecision;
              return deletingCoordinatorDoc.toNumber() === 1 || waitingForDecisionAcks.toNumber() === 1 || writingDecision.toNumber() === 1;
          },
          () => `Failed to find 1 total transactions in the deletingCoordinatorDoc state: ${
              tojson(twoPhaseCommitCoordinatorServerStatus)}`);
      

      Attachments

        Issue Links

          Activity

            People

              luis.osta@mongodb.com Luis Osta (Inactive)
              luis.osta@mongodb.com Luis Osta (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: