-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 8.1.0-rc0, 7.0.15, 8.0.4
-
Component/s: Query Execution
-
None
-
Query Execution
-
Fully Compatible
-
ALL
-
v8.0, v7.0
The following invariant failure occurred in AF-1175 in 7.0.15:
{"c":"ASSERT","id":23079,"ctx":"conn3061","msg":"Invariant failure","attr":{"expr":"yieldable","file":"src/mongo/db/query/plan_yield_policy.cpp","line":155}
The invariant failure was triggered from a find command:
mongo::invariantFailed(char const*, char const*, unsigned int) mongo::PlanYieldPolicy::yieldOrInterrupt(mongo::OperationContext*, std::function<void ()>) [clone .cold] mongo::PlanExecutorImpl::_getNextImpl(mongo::Snapshotted<mongo::Document>*, mongo::RecordId*) mongo::PlanExecutorImpl::getNextDocument(mongo::Document*, mongo::RecordId*) mongo::PlanExecutorImpl::getNext(mongo::BSONObj*, mongo::RecordId*) mongo::(anonymous namespace)::FindCmd::Invocation::run(mongo::OperationContext*, mongo::rpc::ReplyBuilderInterface*) ...
I have looked a bit at PlanYieldPolicy::yieldOrInterrupt, and I believe there is at least one issue with it. I cannot yet say if that issue occurred on Atlas or if it is unrelated, but at least it is something to investigate.
Here is a simplified version of the method from v7.0, with irrelevant lines excluded:
101 Status PlanYieldPolicy::yieldOrInterrupt(OperationContext* opCtx, 102 std::function<void()> whileYieldingFn) { ... 122 for (int attempt = 1; true; attempt++) { 123 try { 124 // Saving and restoring can modify '_yieldable', so we make a copy before we start. 125 const Yieldable* yieldable = _yieldable; 126 127 try { 128 saveState(opCtx); 129 } catch (const StorageUnavailableException&) { 130 MONGO_UNREACHABLE; 131 } ... 149 if (getPolicy() == PlanYieldPolicy::YieldPolicy::WRITE_CONFLICT_RETRY_ONLY) { 150 // This yield policy doesn't release locks, but it does relinquish our storage 151 // snapshot. 152 invariant(!opCtx->isLockFreeReadsOp()); 153 opCtx->recoveryUnit()->abandonSnapshot(); 154 } else { 155 invariant(yieldable); 156 performYield(opCtx, *yieldable, whileYieldingFn); 157 } 158 159 restoreState(opCtx, yieldable); 160 return Status::OK(); 161 } catch (const StorageUnavailableException&) { 162 if (_callbacks) { 163 _callbacks->handledWriteConflict(opCtx); 164 } 165 logWriteConflictAndBackoff(attempt, "query yield", ""_sd); 166 // Retry the yielding process. 167 } catch (...) { 168 // Errors other than write conflicts don't get retried, and should instead result in 169 // the PlanExecutor dying. We propagate all such errors as status codes. 170 return exceptionToStatus(); 171 } 172 }
In line 125, we are storing whatever is in the instance variable _yieldable in a local variable yieldable.
In line 128, we call saveState, which sets _yieldable to a nullptr.
When we get to line 159, we restore the value of the local variable yieldable in _yieldable and everything is fine.
However, when we get to line 161 and catch a StorageUnavailableException, we are only logging a debug message and retry the loop. We are not restoring _yieldable to a non-nullptr. In the next iteration of the loop, yieldable and _yieldable will both contain nullptrs, so should we get to line 155 we will run into the invariant failure.
I cannot say if this sequence of events happened in the Atlas cluster, because we don't have debug log messages from it. So right now it is only a possibility that this sequence of events has happened.
It looks like that the catch block on lines 161 to 166 should also restore the state so that _yieldable is valid for the next iteration.
- related to
-
SERVER-98293 Make PlanYieldPolicy::yieldOrInterrupt() safer
- Closed