-
Type: Bug
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Query Execution
-
ALL
-
v8.0, v7.3
-
QE 2023-10-02, QE 2023-10-16, QE 2024-02-19, QE 2024-03-04, QE 2024-03-18, QE 2024-04-01
-
160
Trying to keep this ticket generic, although I believe that most of the related issues / BF tickets are related to the optimization performed in SERVER-77427.
Why didn't we see these errors before?
Mostly due to SERVER-81007. The FSM framework was swallowing some exceptions. This was addressed in 7.1.
Affected versions
The bug that caused those errors to be shallowed (SERVER-79735) was introduced on 7.1. Thus, I would expect that if any previous version was suffering of these failures at least our testing coverage would have caught them. What I believe is the conflicting commit (SERVER-77427) was introduced on 7.1, which matches with the timeline.
About the issue
In the query framework we have at least one place in which if we detect that an operation that was going to be sent through the network is just involving the current shard, we end up executing it locally. This local execution is performed without creating a fresh operation context (i.e. reusing the existing one), which could be problematic, specially for components like the OperationShardingState that assumes that a pair of <OpCtx, nss> can only be associated with one shard version. Note that this is just an example but we could have other components doing similar assumptions.
The problematic interleaving is the following:
- Router sends an operation with SV1 and nss on a certain shard.
- Shard start its execution, registering the <OpCtx, nss> with SV1.
- A DDL operation bump the shardVersion of that shard to SV2.
- The shard continues the execution of the operation, and it reaches a point in which in theory it has to send a request to a set of shards. This set only includes itself, so it ends up performing a direct execution of the command.
- We try to register the pair <OpCtx, nss> with SV2 on the OperationShardingState but because of 2. we end up failing.
Update
it even looks like we don't really need to have a bump of the shard version (step 3): some kind of query operations, I believe that the ones that use $mergeCursors, initially attach ChunkVersion::IGNORED, so if that operation needs to perform some targeting and end ups deciding to perform a local execution it, it will also hit the same failure (see BF-30079).
How should we fix it?
I asked about the semantics of command re-entrance to service-arch. They mentioned that ideally we should create a fresh operation context in this situations, to avoid hitting assumptions like the one explained above.
Another thing we could consider is reverting SERVER-77427 if it ends up being the root cause of the failures.
- depends on
-
SERVER-79580 Remove HostTypeRequirement::kPrimaryShard from $lookup
- Closed
-
SERVER-83056 Confirm $lookup local read optimization doesn't miss data, migrated during the read
- Closed
- is caused by
-
SERVER-77427 Avoid going through the network when a shard is targeting only itself for a $lookup subpipeline
- Closed
- is depended on by
-
SERVER-81234 analyze_shard_key FSM workload fails after changing shard version
- Closed
-
SERVER-81235 agg_graph_lookup FSM workload not failing with expected QueryPlanKilled error
- Closed
- is related to
-
SERVER-81007 FSM workloads no longer fail when $config.states functions throw exceptions
- Closed