[SERVER-71636] Explain executionStats does not work for $lookup on sharded collection Created: 28/Nov/22  Updated: 29/Oct/23  Resolved: 23/Mar/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 6.0.2
Fix Version/s: 7.0.0-rc0, 6.0.6, 6.3.1

Type: Bug Priority: Major - P3
Reporter: Wernfried Domscheit Assignee: David Storch
Resolution: Fixed Votes: 0
Labels: explain-plan, sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File reproduce.js    
Issue Links:
Backports
Related
related to SERVER-75356 explain command for a find with $expr... Closed
related to SERVER-76037 Write commands using updateOne withou... Closed
related to SERVER-75108 Revisit implementation of explain for... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.3, v6.0
Steps To Reproduce:

Sample data:

db.getCollection('tdube04mipmed0.332-0750.long').insertOne(
   {
      "_id": ObjectId("638468a8c6ac4d7abf8035d5"),
      "data": [
         {
            "_id": ObjectId("6384689b60dc7b330284cab6"),
            "t0": ISODate("2022-11-18T00:00:00.000+0100"),
            "t": ISODate("2022-11-18T23:59:59.999+0100"),
            "a": "periodical",
            "di": "72.163.4.333",
            "tsp": NumberInt(62160),
            "tsi": "888.141.227.4",
            "dp": NumberInt(443),
            "f": "ALTS01TBMFW03",
            "n": "snat",
            "p": "tcp",
            "si": "10.999.999.1",
            "sp": NumberInt(42680),
            "tt": NumberInt(10),
            "h": NumberInt(102),
            "ts": [ISODate("2022-11-14T08:31:00.000+0100"), ISODate("2022-11-28T08:21:00.000+0100")],
            "si_subnet": "10.999.960.0/18"
         }
      ],
      "target": "sessions.20221118"
   }
)
 
db.getCollection('sessions.20221118').insertOne(
   {
      "_id": ObjectId("638468a8c6ac4d7abf803842"),
      "si": "10.999.999.1",
      "tsi": "888.141.227.4",
      "di": "72.163.4.333",
      "dp": NumberInt(443),
      "h": NumberInt(102),
      "n": "snat",
      "sp": NumberInt(42680),
      "t": ISODate("2022-11-18T23:59:59.999+0100"),
      "t0": ISODate("2022-11-18T00:00:00.000+0100"),
      "ts": [ISODate("2022-11-14T08:31:00.000+0100"), ISODate("2022-11-28T08:21:00.000+0100")],
      "tsp": NumberInt(62160)
   }
)
 
db.runCommand({ createIndexes: 'sessions.20221118', indexes: [{ name: "tsi_si", key: { tsi: 1, si: 1 } }] })
 
sh.shardCollection(`${db.getName()}.sessions.20221118`, { tsi: 1, si: 1 })
 

Sprint: QE 2023-02-06, QE 2023-02-20, QE 2023-03-06, QE 2023-03-20, QE 2023-04-03
Participants:

 Description   

I like to get detailed execution plan for this aggregation pipeline:

db.getSiblingDB('data').getCollection('tdube04mipmed0.332-0750.long').aggregate([
   { "$match": { "_id": ObjectId("638468a8c6ac4d7abf8035d5") } },
   { "$unwind": "$data" },
   { "$replaceWith": "$data" },
   {
      "$lookup": {
         "from": "sessions.20221118",
         "let": { "si": "$si", "sp": "$sp", "di": "$di", "dp": "$dp", "n": "$n", "t0": "$t0", "t": "$t", "ts": "$ts", "tsp": "$tsp", "tsi": "$tsi" },
         "pipeline": [
            {
               "$match": {
                  "n": "snat",
                  "ts": { "$exists": true },
                  "h": { "$in": [102, 103] }
               }
            },
            {
               "$match": {
                  "$expr": {
                     "$and": [
                        { "$eq": ["$si", "$$si"] },
                        { "$eq": ["$sp", "$$sp"] },
                        { "$eq": ["$di", "$$di"] },
                        { "$eq": ["$dp", "$$dp"] },
                        { "$eq": ["$n", "$$n"] },
                        { "$gte": ["$t", "$$t"] },
                        { "$eq": ["$tsp", "$$tsp"] },
                        { "$eq": ["$tsi", "$$tsi"] }]
                  }
               }
            },
            { "$sort": { "last": -1 } },
            { "$limit": 1 }],
         "as": "longdata"
      }
   },
   { "$set": { "longdata": { "$first": "$longdata" } } },
   {
      "$set": {
         "_id": { "$ifNull": ["$longdata._id", "$$REMOVE"] },
         "longdata": "$$REMOVE",
         "ts": [
            { "$min": [{ "$first": "$ts" }, { "$first": { "$ifNull": ["$longdata.ts", "$ts"] } }] },
            { "$max": [{ "$last": "$ts" }, { "$last": { "$ifNull": ["$longdata.ts", "$ts"] } }] }
         ],
         "tp": { "$cond": ["$longdata._id", { "$concatArrays": [{ "$ifNull": ["$longdata.tp", []] }, [{ "$last": { "$ifNull": ["$longdata.ts", "$ts"] } }]] }, "$$REMOVE"] }
      }
   },
   { "$unset": ["tt", "si_subnet", "f", "a", "p", "rb", "sb"] },
   { "$merge": { "into": { "db": "data", "coll": "sessions.20221118" } } }
], { explain: 'executionStats' }) 

But I get this error:

MongoServerError: Use of undefined variable: si
    at Connection.onMessage (C:\Programs\MongoDB\Server\bin\mongosh.exe:72962:20)
    at MessageStream.<anonymous> (C:\Programs\MongoDB\Server\bin\mongosh.exe:72777:56)
    at MessageStream.emit (node:events:513:28)
    at MessageStream.emit (node:domain:552:15)
    at processIncomingData (C:\Programs\MongoDB\Server\bin\mongosh.exe:72586:14)
    at MessageStream._write (C:\Programs\MongoDB\Server\bin\mongosh.exe:72462:5)
    at writeOrBuffer (node:internal/streams/writable:391:12)
    at _write (node:internal/streams/writable:332:10)
    at MessageStream.Writable.write (node:internal/streams/writable:336:10)
    at Socket.ondata (node:internal/streams/readable:754:22)
 

Same applies for {{

{explain: 'allPlansExecution' }

}}

It works fine on unsharded collections, but when lookup table is sharded then I get this error.



 Comments   
Comment by Githook User [ 12/Apr/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-71636 Fix explain of $lookup when inner collection is sharded

Before the fix, the system would erroneously dispatch
explain commands across the wire when executing the inner
side and throw away the results. After the fix, the system
will actually run the agg command itself against the inner
side. This improves the runtime stats that we expose in the
final explain output, and avoids the parsing-related error
originally reported under this ticket.

(cherry picked from commit 1ff4d25cae73aa0c26e7a2f09da363401ba48d33)
Branch: v6.3
https://github.com/mongodb/mongo/commit/976ed21bb2cf49bebce1e5eb8965f9bb00ecd50c

Comment by Githook User [ 10/Apr/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-71636 Fix explain of $lookup when inner collection is sharded

Before the fix, the system would erroneously dispatch
explain commands across the wire when executing the inner
side and throw away the results. After the fix, the system
will actually run the agg command itself against the inner
side. This improves the runtime stats that we expose in the
final explain output, and avoids the parsing-related error
originally reported under this ticket.

(cherry picked from commit 1ff4d25cae73aa0c26e7a2f09da363401ba48d33)
Branch: v6.0
https://github.com/mongodb/mongo/commit/19dba1d5461b8f057e9e5dd78a918c7291f2be43

Comment by Githook User [ 23/Mar/23 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-71636 Fix explain of $lookup when inner collection is sharded

Before the fix, the system would erroneously dispatch
explain commands across the wire when executing the inner
side and throw away the results. After the fix, the system
will actually run the agg command itself against the inner
side. This improves the runtime stats that we expose in the
final explain output, and avoids the parsing-related error
originally reported under this ticket.
Branch: master
https://github.com/mongodb/mongo/commit/1ff4d25cae73aa0c26e7a2f09da363401ba48d33

Comment by David Storch [ 09/Feb/23 ]

I got a chance to return to working on this ticket, and while I don't yet have a proposed fix I have a pretty good understanding of the root cause. The $lookup stage is forwarded from the mongos to the primary shard which owns the outer collection. On the primary shard, we first attempt to run the query in order to gather execution stats, since this is an "executionStats" verbosity explain command. The query fails unexpectedly on this shard when attempting to execute the $lookup. By instrumenting the server code, I was able to collect this backtrace at the point of failure:

(gdb) bt
#0  0x00007ffff5b23817 in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000055556a532b4f in mongo::breakpoint () at src/mongo/util/debugger.cpp:72
#2  0x000055556a4d414d in mongo::invariantFailed (expr=0x55556122293f "false", file=0x555561222560 "src/mongo/db/pipeline/variables.cpp", line=361)
    at src/mongo/util/assert_util.cpp:141
#3  0x0000555564416df0 in mongo::invariantWithLocation<bool> (testOK=@0x7fffb9bec6e0: false, expr=0x55556122293f "false", file=0x555561222560 "src/mongo/db/pipeline/variables.cpp",
    line=361) at src/mongo/util/assert_util_core.h:74
#4  0x00005555685b44fb in mongo::VariablesParseState::getVariable (this=0x7fffc5508628, name=...) at src/mongo/db/pipeline/variables.cpp:361
#5  0x00005555684f0953 in mongo::ExpressionFieldPath::parse (expCtx=0x7fffc5508520, raw=..., vps=...) at src/mongo/db/pipeline/expression.cpp:2418
#6  0x00005555684da0ac in mongo::Expression::parseOperand (expCtx=0x7fffc5508520, exprElement=..., vps=...) at src/mongo/db/pipeline/expression.cpp:252
#7  0x00005555684d9eef in mongo::ExpressionNary::parseArguments (expCtx=0x7fffc5508520, exprElement=..., vps=...) at src/mongo/db/pipeline/expression.cpp:236
#8  0x00005555684e118d in mongo::ExpressionCompare::parse (expCtx=0x7fffc5508520, bsonExpr=..., vps=..., op=mongo::ExpressionCompare::EQ) at src/mongo/db/pipeline/expression.cpp:999
#9  0x00005555684e090f in mongo::(anonymous namespace)::BoundOp::operator() (this=0x7ffff047ed78, expCtx=0x7fffc5508520, bsonExpr=..., vps=...)
    at src/mongo/db/pipeline/expression.cpp:981
#10 0x000055556853182e in std::__invoke_impl<boost::intrusive_ptr<mongo::Expression>, mongo::(anonymous namespace)::BoundOp&, mongo::ExpressionContext*, mongo::BSONElement, mongo::VariablesParseState const&> (__f=...) at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/invoke.h:61
#11 0x000055556852dfb1 in std::__invoke_r<boost::intrusive_ptr<mongo::Expression>, mongo::(anonymous namespace)::BoundOp&, mongo::ExpressionContext*, mongo::BSONElement, mongo::VariablesParseState const&> (__fn=...) at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/invoke.h:116
#12 0x000055556852c1da in std::_Function_handler<boost::intrusive_ptr<mongo::Expression>(mongo::ExpressionContext*, mongo::BSONElement, const mongo::VariablesParseState&), mongo::(anonymous namespace)::BoundOp>::_M_invoke(const std::_Any_data &, mongo::ExpressionContext *&&, mongo::BSONElement &&, const mongo::VariablesParseState &) (__functor=...,
    __args#0=@0x7fffb9becb98: 0x7fffc5508520, __args#1=..., __args#2=...)
    at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/std_function.h:291
#13 0x0000555568563840 in std::function<boost::intrusive_ptr<mongo::Expression> (mongo::ExpressionContext*, mongo::BSONElement, mongo::VariablesParseState const&)>::operator()(mongo::ExpressionContext*, mongo::BSONElement, mongo::VariablesParseState const&) const (this=0x7ffff047ed78, __args#0=0x7fffc5508520, __args#1=..., __args#2=...)
    at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/std_function.h:590
#14 0x00005555684d9dc8 in mongo::Expression::parseExpression (expCtx=0x7fffc5508520, obj=..., vps=...) at src/mongo/db/pipeline/expression.cpp:227
#15 0x00005555684d8e60 in mongo::Expression::parseObject (expCtx=0x7fffc5508520, obj=..., vps=...) at src/mongo/db/pipeline/expression.cpp:111
#16 0x00005555684da110 in mongo::Expression::parseOperand (expCtx=0x7fffc5508520, exprElement=..., vps=...) at src/mongo/db/pipeline/expression.cpp:254
#17 0x000055556842f379 in mongo::ExprMatchExpression::ExprMatchExpression (this=0x7fffe5325d40, elem=..., expCtx=..., annotation=...) at src/mongo/db/matcher/expression_expr.cpp:50
#18 0x000055556846c647 in std::make_unique<mongo::ExprMatchExpression, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, std::unique_ptr<mongo::MatchExpression::ErrorAnnotation, std::default_delete<mongo::MatchExpression::ErrorAnnotation> > > ()
    at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/unique_ptr.h:962
#19 0x000055556844d79c in mongo::(anonymous namespace)::parseExpr (name=..., elem=..., expCtx=..., extensionsCallback=0x7fffb9bed680, allowedFeatures=18446744073709551615,
    currentLevel=mongo::DocumentParseLevel::kPredicateTopLevel) at src/mongo/db/matcher/expression_parser.cpp:536
#20 0x000055556847f003 in std::__invoke_impl<mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > >, mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > > (*&)(mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel), mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel> (
    __f=@0x7fffb9bed370: 0x55556844d5df <mongo::(anonymous namespace)::parseExpr(mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, mongo::MatchExpressionParser::AllowedFeatureSet, mongo::DocumentParseLevel)>)
    at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/invoke.h:61
#21 0x000055556847c451 in std::__invoke_r<mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > >, mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > > (*&)(mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel), mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel> (
    __fn=@0x7fffb9bed370: 0x55556844d5df <mongo::(anonymous namespace)::parseExpr(mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, mongo::MatchExpressionParser::AllowedFeatureSet, mongo::DocumentParseLevel)>)
    at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/invoke.h:116
#22 0x000055556847871b in std::_Function_handler<mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > > (mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel), mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > > (*)(mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel)>::_M_invoke(std::_Any_data const&, mongo::StringData&&, mongo::BSONElement&&, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*&&, unsigned long long&&, mongo::DocumentParseLevel&&) (__functor=..., __args#0=..., __args#1=..., __args#2=...,
    __args#3=@0x7fffb9bed1a0: 0x7fffb9bed680, __args#4=@0x7fffb9bed240: 18446744073709551615, __args#5=@0x7fffb9bed248: mongo::DocumentParseLevel::kPredicateTopLevel)
    at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/std_function.h:291
#23 0x000055556846c01d in std::function<mongo::StatusWith<std::unique_ptr<mongo::MatchExpression, std::default_delete<mongo::MatchExpression> > > (mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel)>::operator()(mongo::StringData, mongo::BSONElement, boost::intrusive_ptr<mongo::ExpressionContext> const&, mongo::ExtensionsCallback const*, unsigned long long, mongo::DocumentParseLevel) const (this=0x7fffb9bed370,
    __args#0=..., __args#1=..., __args#2=..., __args#3=0x7fffb9bed680, __args#4=18446744073709551615, __args#5=mongo::DocumentParseLevel::kPredicateTopLevel)
    at /opt/mongodbtoolchain/revisions/c5425838d9b449be0d64b60bf75f08fab326b616/stow/gcc-v4.g6s/include/c++/11.3.0/bits/std_function.h:590
#24 0x000055556844bb24 in mongo::(anonymous namespace)::parse (obj=..., expCtx=..., extensionsCallback=0x7fffb9bed680, allowedFeatures=18446744073709551615,
    currentLevel=mongo::DocumentParseLevel::kPredicateTopLevel) at src/mongo/db/matcher/expression_parser.cpp:309
#25 0x000055556845a8c1 in mongo::MatchExpressionParser::parse (obj=..., expCtx=..., extensionsCallback=..., allowedFeatures=18446744073709551615)
    at src/mongo/db/matcher/expression_parser.cpp:2098
#26 0x000055556753aa94 in operator() (__closure=0x7fffb9bed550) at src/mongo/db/query/canonical_query.cpp:123
#27 0x000055556753afb7 in mongo::CanonicalQuery::canonicalize (opCtx=0x7fffc55356a0, findCommand=..., explain=false, expCtx=..., extensionsCallback=...,
    allowedFeatures=18446744073709551615, projectionPolicies=..., pipeline=..., isCountLike=false) at src/mongo/db/query/canonical_query.cpp:125
#28 0x000055556718299b in mongo::getShardIdsForQuery (expCtx=..., query=..., collation=..., cm=..., shardIds=0x7fffb9bed840, chunkRanges=0x0, targetMinKeyToMaxKey=0x0)
    at src/mongo/s/shard_key_pattern_query_util.cpp:426
#29 0x0000555567167699 in mongo::(anonymous namespace)::buildVersionedRequestsForTargetedShards (opCtx=0x7fffc55356a0, nss=..., cri=..., shardsToSkip=..., cmdObj=..., query=...,
    collation=..., eligibleForSampling=false) at src/mongo/s/cluster_commands_helpers.cpp:183
#30 0x000055556716907e in mongo::scatterGatherVersionedTargetByRoutingTable (opCtx=0x7fffc55356a0, dbName=..., nss=..., cri=..., cmdObj=..., readPref=...,
    retryPolicy=mongo::Shard::RetryPolicy::kIdempotent, query=..., collation=..., eligibleForSampling=false) at src/mongo/s/cluster_commands_helpers.cpp:431
#31 0x000055556711ab3f in mongo::sharded_agg_helpers::dispatchShardPipeline (serializedCommand=..., hasChangeStream=false, startsWithDocuments=false, eligibleForSampling=false,
    pipeline=..., shardTargetingPolicy=mongo::ShardTargetingPolicy::kAllowed, readConcern=...) at src/mongo/db/pipeline/sharded_agg_helpers.cpp:1165
#32 0x0000555567117ad0 in mongo::sharded_agg_helpers::targetShardsAndAddMergeCursors (expCtx=..., targetRequest=..., shardCursorsSortSpec=...,
    shardTargetingPolicy=mongo::ShardTargetingPolicy::kAllowed, readConcern=...) at src/mongo/db/pipeline/sharded_agg_helpers.cpp:788
#33 0x000055556711e849 in operator() (__closure=0x7fffb9beef70, opCtx=0x7fffc55356a0, cri=...) at src/mongo/db/pipeline/sharded_agg_helpers.cpp:1639
#34 0x000055556711eb0e in mongo::sharding::router::CollectionRouter::route<mongo::sharded_agg_helpers::attachCursorToPipeline(mongo::Pipeline*, mongo::ShardTargetingPolicy, boost::optional<mongo::BSONObj>)::<lambda(mongo::OperationContext*, const mongo::CollectionRoutingInfo&)> >(mongo::OperationContext *, mongo::StringData, struct {...} &&) (
    this=0x7fffb9bef000, opCtx=0x7fffc55356a0, comment=..., callbackFn=...) at src/mongo/s/router.h:105
#35 0x000055556711f046 in mongo::sharded_agg_helpers::attachCursorToPipeline (ownedPipeline=0x7ffff0086820, shardTargetingPolicy=mongo::ShardTargetingPolicy::kAllowed,
--Type <RET> for more, q to quit, c to continue without paging--c
    readConcern=...) at src/mongo/db/pipeline/sharded_agg_helpers.cpp:1640
#36 0x000055556636735e in mongo::ShardServerProcessInterface::attachCursorSourceToPipeline (this=0x7ffff0466b10, ownedPipeline=0x7ffff0086820, shardTargetingPolicy=mongo::ShardTargetingPolicy::kAllowed, readConcern=...) at src/mongo/db/pipeline/process_interface/shardsvr_process_interface.cpp:387
#37 0x000055556779f751 in mongo::DocumentSourceLookUp::buildPipeline (this=0x7fffc55e8920, inputDoc=...) at src/mongo/db/pipeline/document_source_lookup.cpp:598
#38 0x000055556779e8a4 in mongo::DocumentSourceLookUp::doGetNext (this=0x7fffc55e8920) at src/mongo/db/pipeline/document_source_lookup.cpp:451
#39 0x000055556485913e in mongo::DocumentSource::getNext (this=0x7fffc55e8920) at src/mongo/db/pipeline/document_source.h:370
#40 0x00005555678698af in mongo::Pipeline::getNext (this=0x7ffff00a61a0) at src/mongo/db/pipeline/pipeline.cpp:499
#41 0x0000555566c52dde in mongo::PlanExecutorPipeline::_tryGetNext (this=0x7fffc5601820) at src/mongo/db/pipeline/plan_executor_pipeline.cpp:131
#42 0x0000555566c52c49 in mongo::PlanExecutorPipeline::_getNext (this=0x7fffc5601820) at src/mongo/db/pipeline/plan_executor_pipeline.cpp:119
#43 0x0000555566c52b2e in mongo::PlanExecutorPipeline::getNextDocument (this=0x7fffc5601820, docOut=0x7fffb9bef778, recordIdOut=0x0) at src/mongo/db/pipeline/plan_executor_pipeline.cpp:101
#44 0x0000555566c529b3 in mongo::PlanExecutorPipeline::getNext (this=0x7fffc5601820, objOut=0x7fffb9bef7d0, recordIdOut=0x0) at src/mongo/db/pipeline/plan_executor_pipeline.cpp:80
#45 0x0000555566cd719d in mongo::(anonymous namespace)::executePlan (exec=0x7fffc5601820) at src/mongo/db/query/explain.cpp:314
#46 0x0000555566cd765c in mongo::Explain::explainPipeline (exec=0x7fffc5601820, executePipeline=true, verbosity=mongo::explain::VerbosityEnum::kExecStats, command=..., out=0x7fffb9befba0) at src/mongo/db/query/explain.cpp:378
#47 0x00005555662ae343 in mongo::runAggregate (opCtx=0x7fffc55356a0, origNss=..., request=..., liteParsedPipeline=..., cmdObj=..., privileges=..., result=0x7ffff0061520, externalDataSourceGuard=...) at src/mongo/db/commands/run_aggregate.cpp:1119
#48 0x000055556629a526 in mongo::(anonymous namespace)::PipelineCommand::Invocation::explain (this=0x7fffbbc8db20, opCtx=0x7fffc55356a0, verbosity=mongo::explain::VerbosityEnum::kExecStats, result=0x7ffff0061520) at src/mongo/db/commands/pipeline_command.cpp:246
#49 0x000055556624566c in mongo::(anonymous namespace)::CmdExplain::Invocation::run (this=0x7fffc54f6920, opCtx=0x7fffc55356a0, result=0x7ffff0061520) at src/mongo/db/commands/explain_cmd.cpp:122
#50 0x000055556835e98b in mongo::CommandHelpers::runCommandInvocation (opCtx=0x7fffc55356a0, request=..., invocation=0x7fffc54f6920, response=0x7ffff0061520) at src/mongo/db/commands.cpp:188
#51 0x000055556835e6f3 in operator() (__closure=0x7ff

Here, the DocumentSouceLookUp is trying to target the query for the inner side to the necessary shards. However, this query is not a regular aggregate command, but rather an explain command. Because it is an explain command, we follow a codepath which allocates its own ExpressionContext. This fresh ExpressionContext does not have the original VariablesParseState object which is used to resolve variable names at parse time. Later on, we try to use this ExpressionContext to parse the subquery on the inner side of the $lookup. This re-parsing is done in order to do shard targeting. However, the parsing fails with a "Use of undefined variable" error because the aforementioned freshly allocated ExpressionContext does not have the correct VariablesParseState.

Zooming out, there is a more fundamental problem here. Explain in "executionStats" or "allPlansExecution" mode currently is not designed to work across network boundaries. In order for this to work, the system would have to be designed to carry runtime stats alongside the actual query results; that way, the merging node could both collect the stats and continue executing the query in order to have an accurate picture of the runtime stats on the merging side. Furthermore, the explain output for the inner side of $lookup is quite limited in its current form (see SERVER-22622). For these reasons, I don't think we should be attempting to run an explain command across all targeted shards on the inner side of a $lookup into a sharded collection. The original symptom of this ticket is that the query can fail spuriously while the system is attempting to do so, but we probably shouldn't be doing so in the first place.

A related observation is that the execution stats reported by $lookup are not correct when the collection on the inner side is sharded. I modified the repro script above slightly so that the explain succeeds, and saw output like this:

[js_test:repro]                                 {
[js_test:repro]                                         "$lookup" : {
[js_test:repro]                                                 "from" : "foreign",
[js_test:repro]                                                 "as" : "as",
[js_test:repro]                                                 "let" : {
[js_test:repro]
[js_test:repro]                                                 },
[js_test:repro]                                                 "pipeline" : [
[js_test:repro]                                                         {
[js_test:repro]                                                                 "$match" : {
[js_test:repro]                                                                         "$expr" : {
[js_test:repro]                                                                                 "$eq" : [
[js_test:repro]                                                                                         "$x",
[js_test:repro]                                                                                         "foo"
[js_test:repro]                                                                                 ]
[js_test:repro]                                                                         }
[js_test:repro]                                                                 }
[js_test:repro]                                                         }
[js_test:repro]                                                 ]
[js_test:repro]                                         },
[js_test:repro]                                         "totalDocsExamined" : NumberLong(0),
[js_test:repro]                                         "totalKeysExamined" : NumberLong(0),
[js_test:repro]                                         "collectionScans" : NumberLong(0),
[js_test:repro]                                         "indexesUsed" : [ ],
[js_test:repro]                                         "nReturned" : NumberLong(1),
[js_test:repro]                                         "executionTimeMillisEstimate" : NumberLong(20)
[js_test:repro]                                 }

The fact that "totalDocsExamined" and "collectionScans" are both zero is incorrect. The explain outputs are gathered from both shards for the subpipeline, but any collection scans, indexes used, or docs/keys examined are being discarded rather than incorporated into the runtime stats. Since, as I explained above, the explain implementation can't easily report accurate runtime statistics in sharded scenarios right now, I think the best course of action would be to omit these stats entirely when the inner collection is sharded.

As a next step, I'm going to investigate possible fixes for both the fact that explain can fail spuriously and the incorrect $lookup exec stats problem.

Comment by David Storch [ 22/Dec/22 ]

Thanks for reporting this issue wernfried.domscheit@sunrise.net and thanks for your investigation so far yuan.fang@mongodb.com! I was able to reproduce the bug based on the provided information. I created a minimal reproduction script that can be run in the server's integration test environment with an invocation like this:

 python3 buildscripts/resmoke.py run --installDir=build/install/bin --suites=sharding repro.js

Here's the full repro script:

(function() {
"use strict";
 
const dbName = "test";
 
const st = new ShardingTest({shards: 2});
const db = st.s.getDB(dbName);
 
const localColl = db["local"];
const foreignColl = db["foreign"];
localColl.drop();
foreignColl.drop();
 
assert.commandWorked(localColl.insertOne({
    x: "foo",
}));
assert.commandWorked(foreignColl.insertOne({
    x: "foo",
    y: "bar",
}));
assert.commandWorked(foreignColl.createIndex({y: 1, x: 1}));
 
assert.commandWorked(st.s.adminCommand({enableSharding: dbName}));
assert.commandWorked(
    st.s.adminCommand({shardCollection: foreignColl.getFullName(), key: {y: 1, x: 1}}));
 
let explain = localColl.explain("executionStats").aggregate([
    {
        $lookup: {
            from: foreignColl.getName(),
            let: {
                myX: "$x",
            },
            pipeline: [
                {$match: {$expr: {$eq: ["$x", "$$myX"]}}},
            ],
            as: "as"
        }
    },
])
printjson(explain);
 
st.stop();
}());

When run against a recent build of the master branch of the server, this test fails with the following error message:

[js_test:repro] uncaught exception: Error: explain failed: {
[js_test:repro]         "ok" : 0,
[js_test:repro]         "errmsg" : "Use of undefined variable: myX",
[js_test:repro]         "code" : 17276,
[js_test:repro]         "codeName" : "Location17276",
[js_test:repro]         "$clusterTime" : {
[js_test:repro]                 "clusterTime" : Timestamp(1671723711, 38),
[js_test:repro]                 "signature" : {
[js_test:repro]                         "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
[js_test:repro]                         "keyId" : NumberLong(0)
[js_test:repro]                 }
[js_test:repro]         },
[js_test:repro]         "operationTime" : Timestamp(1671723711, 38)
[js_test:repro] } :
[js_test:repro] _getErrorWithCode@src/mongo/shell/utils.js:24:13
[js_test:repro] throwOrReturn@src/mongo/shell/explainable.js:25:19
[js_test:repro] constructor/this.aggregate@src/mongo/shell/explainable.js:125:24
[js_test:repro] @repro.js:27:51
[js_test:repro] @repro.js:42:2
[js_test:repro] failed to load: repro.js

Since the bug only reproduces if the $lookup's foreign collection is sharded, I suspect that this is a regression that was introduced in version 5.1 as part of the project to support $lookup into a sharded collection. The 5.x branches have been EOL'ed, so this would mean that the fix will need to be backported to 6.2 and to 6.0 after it lands in master.

I haven't dug in yet to determine the details of the root cause, but the symptom suggests that the inner pipeline is being targeted to one of the shards without the necessary values for the myX variable shipped over the wire. Given the current status of explain for $lookup and the complexity of sharded $lookup, the system probably shouldn't be attempting to gather explain execution stats for the inner pipeline at all.

I am going to mark this ticket as "In Progress" given that I have started looking into it, but I won't be able to continue the investigation for a few weeks while I am away for the winter holidays.

Comment by Yuan Fang [ 29/Nov/22 ]

Hi wernfried.domscheit@sunrise.net,

Thank you for your report. I have reproduced the same error on my end. I will assign this ticket to our query execution team to look at this issue closely.

Regards,
Yuan

Generated at Thu Feb 08 06:19:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.