Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: None
Component/s: Distributed Query Execution
Labels:
None

Assigned Teams:

Query Execution
Operating System:
ALL
Steps To Reproduce:
Hide

This script can be run with resmoke.py using --suites=sharding. It prints the misleading explain output.

import {ShardingTest} from "jstests/libs/shardingtest.js"; let st = new ShardingTest({name: "lookup_stats_test", shards: 2, mongos: 1}); var db = st.getDB("test"); st.adminCommand({enableSharding: "test"}); for (let collName of ["outer", "inner"]) { assert.commandWorked(db[collName].insert({_id: 1, key: 1, x: 1})); assert.commandWorked(db[collName].insert({_id: 2, key: 2, x: 2})); assert.commandWorked(db[collName].createIndex({key: 1})); assert.commandWorked(db[collName].createIndex({x: 1})); let namespace = "test." + collName; st.adminCommand({shardCollection: namespace, key: {key: 1}}); } // Create an empty chunk for the inner collection and move it to the non-primary shard. This will // force the $lookup to target both shards even though all data resides on the primary shard. st.adminCommand({split: "test.inner", middle: {key: 0}}); st.adminCommand({ moveChunk: "test.inner", find: {key: -1}, to: st.getOther(st.getPrimaryShard("test")).name, _waitForDelete: true, }); let pipeline = [ // Prohibit the $lookup from being pushed down to SBE. {$_internalInhibitOptimization: {}}, // The join is _not_ on the shard key, so this $lookup must target both shards. { $lookup: { from: "inner", localField: "x", foreignField: "x", as: "joined", }, }, ]; let explain = db.outer.explain("executionStats").aggregate(pipeline); printjson(explain); st.stop();
Show
This script can be run with resmoke.py using --suites=sharding . It prints the misleading explain output. import {ShardingTest} from "jstests/libs/shardingtest.js" ; let st = new ShardingTest({name: "lookup_stats_test" , shards: 2, mongos: 1}); var db = st.getDB( "test" ); st.adminCommand({enableSharding: "test" }); for (let collName of [ " outer " , " inner " ]) { assert .commandWorked(db[collName].insert({_id: 1, key: 1, x: 1})); assert .commandWorked(db[collName].insert({_id: 2, key: 2, x: 2})); assert .commandWorked(db[collName].createIndex({key: 1})); assert .commandWorked(db[collName].createIndex({x: 1})); let namespace = "test." + collName; st.adminCommand({shardCollection: namespace, key: {key: 1}}); } // Create an empty chunk for the inner collection and move it to the non-primary shard. This will // force the $lookup to target both shards even though all data resides on the primary shard. st.adminCommand({split: "test. inner " , middle: {key: 0}}); st.adminCommand({ moveChunk: "test. inner " , find: {key: -1}, to: st.getOther(st.getPrimaryShard( "test" )).name, _waitForDelete: true , }); let pipeline = [ // Prohibit the $lookup from being pushed down to SBE. {$_internalInhibitOptimization: {}}, // The join is _not_ on the shard key, so this $lookup must target both shards. { $lookup: { from: " inner " , localField: "x" , foreignField: "x" , as: "joined" , }, }, ]; let explain = db. outer .explain( "executionStats" ).aggregate(pipeline); printjson(explain); st.stop();
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When explaining a $lookup query (and the $lookup is not pushed down to SBE) in "executionStats" verbosity mode or higher, we report several runtime statistics pertaining to the execution on the inner side:

totalDocsExamined
totalKeysExamined
collectionScans
indexesUsed

The inner collection may be sharded, and in the general case this means that we may have to target multiple shards. When this happens, the system is unable to gather runtime statistics on the inner side. This is because we currently lack a mechanism for transmitting statistics (alongside the intermediate result set) between shards when running a sharded explain. When the system is unable to gather runtime statistics for a sharded inner collection, we end up with something like the following:

{
    "$lookup" : {
                "from" : "inner",
                "as" : "joined",
                "localField" : "x",
                "foreignField" : "x"
        },
        "nReturned" : NumberLong(2),
        "executionTimeMillisEstimate" : NumberLong(23),
        "totalDocsExamined" : NumberLong(0),
        "totalKeysExamined" : NumberLong(0),
        "collectionScans" : NumberLong(0),
        "indexesUsed" : [ ]
}

Taken at face value, this makes it seem like the $lookup returned 2 documents in 23 milliseconds, but examined no keys or documents on the inner side and performed neither any collection scans or index scans. In reality, 2 docs and 2 keys were examined on the inner side, and 2 index scans were performed against the {key: 1} index. Please see the "Steps to Reproduce" for a detailed script that produces the erroneous output. What makes this even more confusing is that in my example, all of the data for both the inner and outer collections happens to reside on the primary shard – but the reporting issue still arises because the routing table for the inner collection is such that the $lookup needs to target both collections.

Ideally we would improve the system so that these statistics can be accurately reported in a sharded cluster. This is likely easier said than done because it requires communicating statistics and intermediate results alongside each other across the wire. A stopgap solution would be to change the output to somehow make it clear that statistics are known to be missing. Perhaps we could count the number of times that we had to target multiple shards on the inner side, with a note that when this happens the runtime stats can be under-reported?

is related to

SERVER-79702 $lookup subpipeline exectionStats output all 0

Closed

Assignee:: [DO NOT USE] Backlog - Query Execution
Reporter:: David Storch
Participants:: [DO NOT USE] Backlog - Query Execution, David Storch, Mihai Andrei
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Oct 23 2025 05:52:00 PM UTC
Updated:: Oct 27 2025 03:43:21 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates