-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: 8.1.0-rc0, 7.0.15, 8.0.4
-
Component/s: Query Execution
-
None
-
Query Execution
-
Fully Compatible
-
ALL
-
v8.0, v7.0
-
None
-
None
-
None
-
None
-
None
-
None
-
None
The following invariant failure occurred in AF-1175 in 7.0.15:
{"c":"ASSERT","id":23079,"ctx":"conn3061","msg":"Invariant failure","attr":{"expr":"yieldable","file":"src/mongo/db/query/plan_yield_policy.cpp","line":155}
The invariant failure was triggered from a find command:
mongo::invariantFailed(char const*, char const*, unsigned int) mongo::PlanYieldPolicy::yieldOrInterrupt(mongo::OperationContext*, std::function<void ()>) [clone .cold] mongo::PlanExecutorImpl::_getNextImpl(mongo::Snapshotted<mongo::Document>*, mongo::RecordId*) mongo::PlanExecutorImpl::getNextDocument(mongo::Document*, mongo::RecordId*) mongo::PlanExecutorImpl::getNext(mongo::BSONObj*, mongo::RecordId*) mongo::(anonymous namespace)::FindCmd::Invocation::run(mongo::OperationContext*, mongo::rpc::ReplyBuilderInterface*) ...
I have looked a bit at PlanYieldPolicy::yieldOrInterrupt, and I believe there is at least one issue with it. I cannot yet say if that issue occurred on Atlas or if it is unrelated, but at least it is something to investigate.
Here is a simplified version of the method from v7.0, with irrelevant lines excluded:
101 Status PlanYieldPolicy::yieldOrInterrupt(OperationContext* opCtx,
102 std::function<void()> whileYieldingFn) {
...
122 for (int attempt = 1; true; attempt++) {
123 try {
124 // Saving and restoring can modify '_yieldable', so we make a copy before we start.
125 const Yieldable* yieldable = _yieldable;
126
127 try {
128 saveState(opCtx);
129 } catch (const StorageUnavailableException&) {
130 MONGO_UNREACHABLE;
131 }
...
149 if (getPolicy() == PlanYieldPolicy::YieldPolicy::WRITE_CONFLICT_RETRY_ONLY) {
150 // This yield policy doesn't release locks, but it does relinquish our storage
151 // snapshot.
152 invariant(!opCtx->isLockFreeReadsOp());
153 opCtx->recoveryUnit()->abandonSnapshot();
154 } else {
155 invariant(yieldable);
156 performYield(opCtx, *yieldable, whileYieldingFn);
157 }
158
159 restoreState(opCtx, yieldable);
160 return Status::OK();
161 } catch (const StorageUnavailableException&) {
162 if (_callbacks) {
163 _callbacks->handledWriteConflict(opCtx);
164 }
165 logWriteConflictAndBackoff(attempt, "query yield", ""_sd);
166 // Retry the yielding process.
167 } catch (...) {
168 // Errors other than write conflicts don't get retried, and should instead result in
169 // the PlanExecutor dying. We propagate all such errors as status codes.
170 return exceptionToStatus();
171 }
172 }
In line 125, we are storing whatever is in the instance variable _yieldable in a local variable yieldable.
In line 128, we call saveState, which sets _yieldable to a nullptr.
When we get to line 159, we restore the value of the local variable yieldable in _yieldable and everything is fine.
However, when we get to line 161 and catch a StorageUnavailableException, we are only logging a debug message and retry the loop. We are not restoring _yieldable to a non-nullptr. In the next iteration of the loop, yieldable and _yieldable will both contain nullptrs, so should we get to line 155 we will run into the invariant failure.
I cannot say if this sequence of events happened in the Atlas cluster, because we don't have debug log messages from it. So right now it is only a possibility that this sequence of events has happened.
It looks like that the catch block on lines 161 to 166 should also restore the state so that _yieldable is valid for the next iteration.
- related to
-
SERVER-98293 Make PlanYieldPolicy::yieldOrInterrupt() safer
-
- Closed
-