Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-97629

Replace invariant with tassert when yielding

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 8.1.0-rc0, 8.0.5
    • Affects Version/s: 8.1.0-rc0, 7.0.15, 8.0.4
    • Component/s: Query Execution
    • None
    • Query Execution
    • Fully Compatible
    • ALL
    • v8.0, v7.0

      The following invariant failure occurred in AF-1175 in 7.0.15:
       

      {"c":"ASSERT","id":23079,"ctx":"conn3061","msg":"Invariant failure","attr":{"expr":"yieldable","file":"src/mongo/db/query/plan_yield_policy.cpp","line":155}

      The invariant failure was triggered from a find command:

      mongo::invariantFailed(char const*, char const*, unsigned int)
      mongo::PlanYieldPolicy::yieldOrInterrupt(mongo::OperationContext*, std::function<void ()>) [clone .cold]
      mongo::PlanExecutorImpl::_getNextImpl(mongo::Snapshotted<mongo::Document>*, mongo::RecordId*)
      mongo::PlanExecutorImpl::getNextDocument(mongo::Document*, mongo::RecordId*)
      mongo::PlanExecutorImpl::getNext(mongo::BSONObj*, mongo::RecordId*)
      mongo::(anonymous namespace)::FindCmd::Invocation::run(mongo::OperationContext*, mongo::rpc::ReplyBuilderInterface*)
      ...

      I have looked a bit at PlanYieldPolicy::yieldOrInterrupt, and I believe there is at least one issue with it. I cannot yet say if that issue occurred on Atlas or if it is unrelated, but at least it is something to investigate.

      Here is a simplified version of the method from v7.0, with irrelevant lines excluded:

         101  Status PlanYieldPolicy::yieldOrInterrupt(OperationContext* opCtx,
         102                                           std::function<void()> whileYieldingFn) {
             ...
         122      for (int attempt = 1; true; attempt++) {
         123          try {
         124              // Saving and restoring can modify '_yieldable', so we make a copy before we start.
         125              const Yieldable* yieldable = _yieldable;
         126
         127              try {
         128                  saveState(opCtx);
         129              } catch (const StorageUnavailableException&) {
         130                  MONGO_UNREACHABLE;
         131              }
             ...
         149              if (getPolicy() == PlanYieldPolicy::YieldPolicy::WRITE_CONFLICT_RETRY_ONLY) {
         150                  // This yield policy doesn't release locks, but it does relinquish our storage
         151                  // snapshot.
         152                  invariant(!opCtx->isLockFreeReadsOp());
         153                  opCtx->recoveryUnit()->abandonSnapshot();
         154              } else {
         155                  invariant(yieldable);
         156                  performYield(opCtx, *yieldable, whileYieldingFn);
         157              }
         158
         159              restoreState(opCtx, yieldable);
         160              return Status::OK();
         161          } catch (const StorageUnavailableException&) {
         162              if (_callbacks) {
         163                  _callbacks->handledWriteConflict(opCtx);
         164              }
         165              logWriteConflictAndBackoff(attempt, "query yield", ""_sd);
         166              // Retry the yielding process.
         167          } catch (...) {
         168              // Errors other than write conflicts don't get retried, and should instead result in
         169              // the PlanExecutor dying. We propagate all such errors as status codes.
         170              return exceptionToStatus();
         171          }
         172      }

      In line 125, we are storing whatever is in the instance variable _yieldable in a local variable yieldable.
      In line 128, we call saveState, which sets _yieldable to a nullptr.
      When we get to line 159, we restore the value of the local variable yieldable in _yieldable and everything is fine.

      However, when we get to line 161 and catch a StorageUnavailableException, we are only logging a debug message and retry the loop. We are not restoring _yieldable to a non-nullptr. In the next iteration of the loop, yieldable and _yieldable will both contain nullptrs, so should we get to line 155 we will run into the invariant failure.
      I cannot say if this sequence of events happened in the Atlas cluster, because we don't have debug log messages from it. So right now it is only a possibility that this sequence of events has happened.

      It looks like that the catch block on lines 161 to 166 should also restore the state so that _yieldable is valid for the next iteration. 

            Assignee:
            jan.steemann@mongodb.com Jan Steemann
            Reporter:
            jan.steemann@mongodb.com Jan Steemann
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: