Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.1.0-rc0, 8.0.5, 7.0.21
Affects Version/s: 8.1.0-rc0, 7.0.15, 8.0.4
Component/s: Query Execution
Labels:
None

Assigned Teams:

Query Execution
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v8.0, v7.0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The following invariant failure occurred in AF-1175 in 7.0.15:

{"c":"ASSERT","id":23079,"ctx":"conn3061","msg":"Invariant failure","attr":{"expr":"yieldable","file":"src/mongo/db/query/plan_yield_policy.cpp","line":155}

The invariant failure was triggered from a find command:

mongo::invariantFailed(char const*, char const*, unsigned int)
mongo::PlanYieldPolicy::yieldOrInterrupt(mongo::OperationContext*, std::function<void ()>) [clone .cold]
mongo::PlanExecutorImpl::_getNextImpl(mongo::Snapshotted<mongo::Document>*, mongo::RecordId*)
mongo::PlanExecutorImpl::getNextDocument(mongo::Document*, mongo::RecordId*)
mongo::PlanExecutorImpl::getNext(mongo::BSONObj*, mongo::RecordId*)
mongo::(anonymous namespace)::FindCmd::Invocation::run(mongo::OperationContext*, mongo::rpc::ReplyBuilderInterface*)
...

I have looked a bit at PlanYieldPolicy::yieldOrInterrupt, and I believe there is at least one issue with it. I cannot yet say if that issue occurred on Atlas or if it is unrelated, but at least it is something to investigate.

Here is a simplified version of the method from v7.0, with irrelevant lines excluded:

   101  Status PlanYieldPolicy::yieldOrInterrupt(OperationContext* opCtx,
   102                                           std::function<void()> whileYieldingFn) {
       ...
   122      for (int attempt = 1; true; attempt++) {
   123          try {
   124              // Saving and restoring can modify '_yieldable', so we make a copy before we start.
   125              const Yieldable* yieldable = _yieldable;
   126
   127              try {
   128                  saveState(opCtx);
   129              } catch (const StorageUnavailableException&) {
   130                  MONGO_UNREACHABLE;
   131              }
       ...
   149              if (getPolicy() == PlanYieldPolicy::YieldPolicy::WRITE_CONFLICT_RETRY_ONLY) {
   150                  // This yield policy doesn't release locks, but it does relinquish our storage
   151                  // snapshot.
   152                  invariant(!opCtx->isLockFreeReadsOp());
   153                  opCtx->recoveryUnit()->abandonSnapshot();
   154              } else {
   155                  invariant(yieldable);
   156                  performYield(opCtx, *yieldable, whileYieldingFn);
   157              }
   158
   159              restoreState(opCtx, yieldable);
   160              return Status::OK();
   161          } catch (const StorageUnavailableException&) {
   162              if (_callbacks) {
   163                  _callbacks->handledWriteConflict(opCtx);
   164              }
   165              logWriteConflictAndBackoff(attempt, "query yield", ""_sd);
   166              // Retry the yielding process.
   167          } catch (...) {
   168              // Errors other than write conflicts don't get retried, and should instead result in
   169              // the PlanExecutor dying. We propagate all such errors as status codes.
   170              return exceptionToStatus();
   171          }
   172      }

In line 125, we are storing whatever is in the instance variable _yieldable in a local variable yieldable.
In line 128, we call saveState, which sets _yieldable to a nullptr.
When we get to line 159, we restore the value of the local variable yieldable in _yieldable and everything is fine.

However, when we get to line 161 and catch a StorageUnavailableException, we are only logging a debug message and retry the loop. We are not restoring _yieldable to a non-nullptr. In the next iteration of the loop, yieldable and _yieldable will both contain nullptrs, so should we get to line 155 we will run into the invariant failure.
I cannot say if this sequence of events happened in the Atlas cluster, because we don't have debug log messages from it. So right now it is only a possibility that this sequence of events has happened.

It looks like that the catch block on lines 161 to 166 should also restore the state so that _yieldable is valid for the next iteration.

related to

SERVER-98293 Make PlanYieldPolicy::yieldOrInterrupt() safer

Closed

Assignee:: Jan Steemann
Reporter:: Jan Steemann
Participants:: Githook User, Jan Steemann
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Nov 26 2024 12:23:50 PM UTC
Updated:: May 22 2025 04:00:56 PM UTC
Resolved:: Dec 05 2024 09:43:27 PM UTC
Confidence Status Last Update:: 05/Dec/24 4:06 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates