|
I got a chance to return to working on this ticket, and while I don't yet have a proposed fix I have a pretty good understanding of the root cause. The $lookup stage is forwarded from the mongos to the primary shard which owns the outer collection. On the primary shard, we first attempt to run the query in order to gather execution stats, since this is an "executionStats" verbosity explain command. The query fails unexpectedly on this shard when attempting to execute the $lookup. By instrumenting the server code, I was able to collect this backtrace at the point of failure:
(gdb) bt
|
#0 0x00007ffff5b23817 in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
|
#1 0x000055556a532b4f in mongo::breakpoint () at src/mongo/util/debugger.cpp:72
|
#2 0x000055556a4d414d in mongo::invariantFailed (expr=0x55556122293f "false", file=0x555561222560 "src/mongo/db/pipeline/variables.cpp", line=361)
|
at src/mongo/util/assert_util.cpp:141
|
#3 0x0000555564416df0 in mongo::invariantWithLocation<bool> (testOK=@0x7fffb9bec6e0: false, expr=0x55556122293f "false", file=0x555561222560 "src/mongo/db/pipeline/variables.cpp",
|
line=361) at src/mongo/util/assert_util_core.h:74
|
#4 0x00005555685b44fb in mongo::VariablesParseState::getVariable (this=0x7fffc5508628, name=...) at src/mongo/db/pipeline/variables.cpp:361
|
#5 0x00005555684f0953 in mongo::ExpressionFieldPath::parse (expCtx=0x7fffc5508520, raw=..., vps=...) at src/mongo/db/pipeline/expression.cpp:2418
|
#6 0x00005555684da0ac in mongo::Expression::parseOperand (expCtx=0x7fffc5508520, exprElement=..., vps=...) at src/mongo/db/pipeline/expression.cpp:252
|
#7 0x00005555684d9eef in mongo::ExpressionNary::parseArguments (expCtx=0x7fffc5508520, exprElement=..., vps=...) at src/mongo/db/pipeline/expression.cpp:236
|
#8 0x00005555684e118d in mongo::ExpressionCompare::parse (expCtx=0x7fffc5508520, bsonExpr=..., vps=..., op=mongo::ExpressionCompare::EQ) at src/mongo/db/pipeline/expression.cpp:999
|
#9 0x00005555684e090f in mongo::(anonymous namespace)::BoundOp::operator() (this=0x7ffff047ed78, expCtx=0x7fffc5508520, bsonExpr=..., vps=...)
|
at src/mongo/db/pipeline/expression.cpp:981
|
#10 0x000055556853182e in std::__invoke_impl<boost::intrusive_ptr<mongo::Expression>, mongo::(anonymous namespace)::BoundOp&, mongo::ExpressionContext*, mongo::BSONElement, mongo::VariablesParseState const&> (__f=...) at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/invoke.h:61
|
#11 0x000055556852dfb1 in std::__invoke_r<boost::intrusive_ptr<mongo::Expression>, mongo::(anonymous namespace)::BoundOp&, mongo::ExpressionContext*, mongo::BSONElement, mongo::VariablesParseState const&> (__fn=...) at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/invoke.h:116
|
#12 0x000055556852c1da in std::_Function_handler<boost::intrusive_ptr<mongo::Expression>(mongo::ExpressionContext*, mongo::BSONElement, const mongo::VariablesParseState&), mongo::(anonymous namespace)::BoundOp>::_M_invoke(const std::_Any_data &, mongo::ExpressionContext *&&, mongo::BSONElement &&, const mongo::VariablesParseState &) (__functor=...,
|
__args#0=@0x7fffb9becb98: 0x7fffc5508520, __args#1=..., __args#2=...)
|
at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/std_function.h:291
|
#13 0x0000555568563840 in std::function<boost::intrusive_ptr<mongo::Expression> (mongo::ExpressionContext*, mongo::BSONElement, mongo::VariablesParseState const&)>::operator()(mongo::ExpressionContext*, mongo::BSONElement, mongo::VariablesParseState const&) const (this=0x7ffff047ed78, __args#0=0x7fffc5508520, __args#1=..., __args#2=...)
|
at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/std_function.h:590
|
#14 0x00005555684d9dc8 in mongo::Expression::parseExpression (expCtx=0x7fffc5508520, obj=..., vps=...) at src/mongo/db/pipeline/expression.cpp:227
|
#15 0x00005555684d8e60 in mongo::Expression::parseObject (expCtx=0x7fffc5508520, obj=..., vps=...) at src/mongo/db/pipeline/expression.cpp:111
|
#16 0x00005555684da110 in mongo::Expression::parseOperand (expCtx=0x7fffc5508520, exprElement=..., vps=...) at src/mongo/db/pipeline/expression.cpp:254
|
#17 0x000055556842f379 in mongo::ExprMatchExpression::ExprMatchExpression (this=0x7fffe5325d40, elem=..., expCtx=..., annotation=...) at src/mongo/db/matcher/expression_expr.cpp:50
|
#18 0x000055556846c647 in std::make_unique<mongo::ExprMatchExpression, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, std::unique_ptr<mongo::MatchExpression::ErrorAnnotation, std::default_delete<mongo::MatchExpression::ErrorAnnotation> > > ()
|
at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/unique_ptr.h:962
|
#19 0x000055556844d79c in mongo::(anonymous namespace)::parseExpr (name=..., elem=..., expCtx=..., extensionsCallback=0x7fffb9bed680, allowedFeatures=18446744073709551615,
|
currentLevel=mongo::DocumentParseLevel::kPredicateTopLevel) at src/mongo/db/matcher/expression_parser.cpp:536
|
#20 0x000055556847f003 in std::__invoke_impl<mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > >, mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > > (*&)(mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel), mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel> (
|
__f=@0x7fffb9bed370: 0x55556844d5df <mongo::(anonymous namespace)::parseExpr(mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, mongo::MatchExpressionParser::AllowedFeatureSet, mongo::DocumentParseLevel)>)
|
at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/invoke.h:61
|
#21 0x000055556847c451 in std::__invoke_r<mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > >, mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > > (*&)(mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel), mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel> (
|
__fn=@0x7fffb9bed370: 0x55556844d5df <mongo::(anonymous namespace)::parseExpr(mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, mongo::MatchExpressionParser::AllowedFeatureSet, mongo::DocumentParseLevel)>)
|
at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/invoke.h:116
|
#22 0x000055556847871b in std::_Function_handler<mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > > (mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel), mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > > (*)(mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel)>::_M_invoke(std::_Any_data const&, mongo::StringData&&, mongo::BSONElement&&, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*&&, unsigned long long&&, mongo::DocumentParseLevel&&) (__functor=..., __args#0=..., __args#1=..., __args#2=...,
|
__args#3=@0x7fffb9bed1a0: 0x7fffb9bed680, __args#4=@0x7fffb9bed240: 18446744073709551615, __args#5=@0x7fffb9bed248: mongo::DocumentParseLevel::kPredicateTopLevel)
|
at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/std_function.h:291
|
#23 0x000055556846c01d in std::function<mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > > (mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel)>::operator()(mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel) const (this=0x7fffb9bed370,
|
__args#0=..., __args#1=..., __args#2=..., __args#3=0x7fffb9bed680, __args#4=18446744073709551615, __args#5=mongo::DocumentParseLevel::kPredicateTopLevel)
|
at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/std_function.h:590
|
#24 0x000055556844bb24 in mongo::(anonymous namespace)::parse (obj=..., expCtx=..., extensionsCallback=0x7fffb9bed680, allowedFeatures=18446744073709551615,
|
currentLevel=mongo::DocumentParseLevel::kPredicateTopLevel) at src/mongo/db/matcher/expression_parser.cpp:309
|
#25 0x000055556845a8c1 in mongo::MatchExpressionParser::parse (obj=..., expCtx=..., extensionsCallback=..., allowedFeatures=18446744073709551615)
|
at src/mongo/db/matcher/expression_parser.cpp:2098
|
#26 0x000055556753aa94 in operator() (__closure=0x7fffb9bed550) at src/mongo/db/query/canonical_query.cpp:123
|
#27 0x000055556753afb7 in mongo::CanonicalQuery::canonicalize (opCtx=0x7fffc55356a0, findCommand=..., explain=false, expCtx=..., extensionsCallback=...,
|
allowedFeatures=18446744073709551615, projectionPolicies=..., pipeline=..., isCountLike=false) at src/mongo/db/query/canonical_query.cpp:125
|
#28 0x000055556718299b in mongo::getShardIdsForQuery (expCtx=..., query=..., collation=..., cm=..., shardIds=0x7fffb9bed840, chunkRanges=0x0, targetMinKeyToMaxKey=0x0)
|
at src/mongo/s/shard_key_pattern_query_util.cpp:426
|
#29 0x0000555567167699 in mongo::(anonymous namespace)::buildVersionedRequestsForTargetedShards (opCtx=0x7fffc55356a0, nss=..., cri=..., shardsToSkip=..., cmdObj=..., query=...,
|
collation=..., eligibleForSampling=false) at src/mongo/s/cluster_commands_helpers.cpp:183
|
#30 0x000055556716907e in mongo::scatterGatherVersionedTargetByRoutingTable (opCtx=0x7fffc55356a0, dbName=..., nss=..., cri=..., cmdObj=..., readPref=...,
|
retryPolicy=mongo::Shard::RetryPolicy::kIdempotent, query=..., collation=..., eligibleForSampling=false) at src/mongo/s/cluster_commands_helpers.cpp:431
|
#31 0x000055556711ab3f in mongo::sharded_agg_helpers::dispatchShardPipeline (serializedCommand=..., hasChangeStream=false, startsWithDocuments=false, eligibleForSampling=false,
|
pipeline=..., shardTargetingPolicy=mongo::ShardTargetingPolicy::kAllowed, readConcern=...) at src/mongo/db/pipeline/sharded_agg_helpers.cpp:1165
|
#32 0x0000555567117ad0 in mongo::sharded_agg_helpers::targetShardsAndAddMergeCursors (expCtx=..., targetRequest=..., shardCursorsSortSpec=...,
|
shardTargetingPolicy=mongo::ShardTargetingPolicy::kAllowed, readConcern=...) at src/mongo/db/pipeline/sharded_agg_helpers.cpp:788
|
#33 0x000055556711e849 in operator() (__closure=0x7fffb9beef70, opCtx=0x7fffc55356a0, cri=...) at src/mongo/db/pipeline/sharded_agg_helpers.cpp:1639
|
#34 0x000055556711eb0e in mongo::sharding::router::CollectionRouter::route<mongo::sharded_agg_helpers::attachCursorToPipeline(mongo::Pipeline*, mongo::ShardTargetingPolicy, boost::optional<mongo::BSONObj>)::<lambda(mongo::OperationContext*, const mongo::CollectionRoutingInfo&)> >(mongo::OperationContext *, mongo::StringData, struct {...} &&) (
|
this=0x7fffb9bef000, opCtx=0x7fffc55356a0, comment=..., callbackFn=...) at src/mongo/s/router.h:105
|
#35 0x000055556711f046 in mongo::sharded_agg_helpers::attachCursorToPipeline (ownedPipeline=0x7ffff0086820, shardTargetingPolicy=mongo::ShardTargetingPolicy::kAllowed,
|
--Type <RET> for more, q to quit, c to continue without paging--c
|
readConcern=...) at src/mongo/db/pipeline/sharded_agg_helpers.cpp:1640
|
#36 0x000055556636735e in mongo::ShardServerProcessInterface::attachCursorSourceToPipeline (this=0x7ffff0466b10, ownedPipeline=0x7ffff0086820, shardTargetingPolicy=mongo::ShardTargetingPolicy::kAllowed, readConcern=...) at src/mongo/db/pipeline/process_interface/shardsvr_process_interface.cpp:387
|
#37 0x000055556779f751 in mongo::DocumentSourceLookUp::buildPipeline (this=0x7fffc55e8920, inputDoc=...) at src/mongo/db/pipeline/document_source_lookup.cpp:598
|
#38 0x000055556779e8a4 in mongo::DocumentSourceLookUp::doGetNext (this=0x7fffc55e8920) at src/mongo/db/pipeline/document_source_lookup.cpp:451
|
#39 0x000055556485913e in mongo::DocumentSource::getNext (this=0x7fffc55e8920) at src/mongo/db/pipeline/document_source.h:370
|
#40 0x00005555678698af in mongo::Pipeline::getNext (this=0x7ffff00a61a0) at src/mongo/db/pipeline/pipeline.cpp:499
|
#41 0x0000555566c52dde in mongo::PlanExecutorPipeline::_tryGetNext (this=0x7fffc5601820) at src/mongo/db/pipeline/plan_executor_pipeline.cpp:131
|
#42 0x0000555566c52c49 in mongo::PlanExecutorPipeline::_getNext (this=0x7fffc5601820) at src/mongo/db/pipeline/plan_executor_pipeline.cpp:119
|
#43 0x0000555566c52b2e in mongo::PlanExecutorPipeline::getNextDocument (this=0x7fffc5601820, docOut=0x7fffb9bef778, recordIdOut=0x0) at src/mongo/db/pipeline/plan_executor_pipeline.cpp:101
|
#44 0x0000555566c529b3 in mongo::PlanExecutorPipeline::getNext (this=0x7fffc5601820, objOut=0x7fffb9bef7d0, recordIdOut=0x0) at src/mongo/db/pipeline/plan_executor_pipeline.cpp:80
|
#45 0x0000555566cd719d in mongo::(anonymous namespace)::executePlan (exec=0x7fffc5601820) at src/mongo/db/query/explain.cpp:314
|
#46 0x0000555566cd765c in mongo::Explain::explainPipeline (exec=0x7fffc5601820, executePipeline=true, verbosity=mongo::explain::VerbosityEnum::kExecStats, command=..., out=0x7fffb9befba0) at src/mongo/db/query/explain.cpp:378
|
#47 0x00005555662ae343 in mongo::runAggregate (opCtx=0x7fffc55356a0, origNss=..., request=..., liteParsedPipeline=..., cmdObj=..., privileges=..., result=0x7ffff0061520, externalDataSourceGuard=...) at src/mongo/db/commands/run_aggregate.cpp:1119
|
#48 0x000055556629a526 in mongo::(anonymous namespace)::PipelineCommand::Invocation::explain (this=0x7fffbbc8db20, opCtx=0x7fffc55356a0, verbosity=mongo::explain::VerbosityEnum::kExecStats, result=0x7ffff0061520) at src/mongo/db/commands/pipeline_command.cpp:246
|
#49 0x000055556624566c in mongo::(anonymous namespace)::CmdExplain::Invocation::run (this=0x7fffc54f6920, opCtx=0x7fffc55356a0, result=0x7ffff0061520) at src/mongo/db/commands/explain_cmd.cpp:122
|
#50 0x000055556835e98b in mongo::CommandHelpers::runCommandInvocation (opCtx=0x7fffc55356a0, request=..., invocation=0x7fffc54f6920, response=0x7ffff0061520) at src/mongo/db/commands.cpp:188
|
#51 0x000055556835e6f3 in operator() (__closure=0x7ff
|
Here, the DocumentSouceLookUp is trying to target the query for the inner side to the necessary shards. However, this query is not a regular aggregate command, but rather an explain command. Because it is an explain command, we follow a codepath which allocates its own ExpressionContext. This fresh ExpressionContext does not have the original VariablesParseState object which is used to resolve variable names at parse time. Later on, we try to use this ExpressionContext to parse the subquery on the inner side of the $lookup. This re-parsing is done in order to do shard targeting. However, the parsing fails with a "Use of undefined variable" error because the aforementioned freshly allocated ExpressionContext does not have the correct VariablesParseState.
Zooming out, there is a more fundamental problem here. Explain in "executionStats" or "allPlansExecution" mode currently is not designed to work across network boundaries. In order for this to work, the system would have to be designed to carry runtime stats alongside the actual query results; that way, the merging node could both collect the stats and continue executing the query in order to have an accurate picture of the runtime stats on the merging side. Furthermore, the explain output for the inner side of $lookup is quite limited in its current form (see SERVER-22622). For these reasons, I don't think we should be attempting to run an explain command across all targeted shards on the inner side of a $lookup into a sharded collection. The original symptom of this ticket is that the query can fail spuriously while the system is attempting to do so, but we probably shouldn't be doing so in the first place.
A related observation is that the execution stats reported by $lookup are not correct when the collection on the inner side is sharded. I modified the repro script above slightly so that the explain succeeds, and saw output like this:
[js_test:repro] {
|
[js_test:repro] "$lookup" : {
|
[js_test:repro] "from" : "foreign",
|
[js_test:repro] "as" : "as",
|
[js_test:repro] "let" : {
|
[js_test:repro]
|
[js_test:repro] },
|
[js_test:repro] "pipeline" : [
|
[js_test:repro] {
|
[js_test:repro] "$match" : {
|
[js_test:repro] "$expr" : {
|
[js_test:repro] "$eq" : [
|
[js_test:repro] "$x",
|
[js_test:repro] "foo"
|
[js_test:repro] ]
|
[js_test:repro] }
|
[js_test:repro] }
|
[js_test:repro] }
|
[js_test:repro] ]
|
[js_test:repro] },
|
[js_test:repro] "totalDocsExamined" : NumberLong(0),
|
[js_test:repro] "totalKeysExamined" : NumberLong(0),
|
[js_test:repro] "collectionScans" : NumberLong(0),
|
[js_test:repro] "indexesUsed" : [ ],
|
[js_test:repro] "nReturned" : NumberLong(1),
|
[js_test:repro] "executionTimeMillisEstimate" : NumberLong(20)
|
[js_test:repro] }
|
The fact that "totalDocsExamined" and "collectionScans" are both zero is incorrect. The explain outputs are gathered from both shards for the subpipeline, but any collection scans, indexes used, or docs/keys examined are being discarded rather than incorporated into the runtime stats. Since, as I explained above, the explain implementation can't easily report accurate runtime statistics in sharded scenarios right now, I think the best course of action would be to omit these stats entirely when the inner collection is sharded.
As a next step, I'm going to investigate possible fixes for both the fact that explain can fail spuriously and the incorrect $lookup exec stats problem.
|