$lookup explain reports runtime stats in a misleading way when targeting multiple shards on the inner side

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Minor - P4
    • None
    • Affects Version/s: None
    • None
    • Query Execution
    • ALL
    • Hide

      This script can be run with resmoke.py using --suites=sharding. It prints the misleading explain output.

      import {ShardingTest} from "jstests/libs/shardingtest.js";
      
      let st = new ShardingTest({name: "lookup_stats_test", shards: 2, mongos: 1});
      var db = st.getDB("test");
      st.adminCommand({enableSharding: "test"});
      
      for (let collName of ["outer", "inner"]) {
          assert.commandWorked(db[collName].insert({_id: 1, key: 1, x: 1}));
          assert.commandWorked(db[collName].insert({_id: 2, key: 2, x: 2}));
          assert.commandWorked(db[collName].createIndex({key: 1}));
          assert.commandWorked(db[collName].createIndex({x: 1}));
      
          let namespace = "test." + collName;
          st.adminCommand({shardCollection: namespace, key: {key: 1}});
      }
      
      // Create an empty chunk for the inner collection and move it to the non-primary shard. This will
      // force the $lookup to target both shards even though all data resides on the primary shard.
      st.adminCommand({split: "test.inner", middle: {key: 0}});
      st.adminCommand({
          moveChunk: "test.inner",
          find: {key: -1},
          to: st.getOther(st.getPrimaryShard("test")).name,
          _waitForDelete: true,
      });
      
      let pipeline = [
          // Prohibit the $lookup from being pushed down to SBE.
          {$_internalInhibitOptimization: {}},
          // The join is _not_ on the shard key, so this $lookup must target both shards.
          {
              $lookup: {
                  from: "inner",
                  localField: "x",
                  foreignField: "x",
                  as: "joined",
              },
          },
      ];
      
      let explain = db.outer.explain("executionStats").aggregate(pipeline);
      printjson(explain);
      
      st.stop();
      
      Show
      This script can be run with resmoke.py using --suites=sharding . It prints the misleading explain output. import {ShardingTest} from "jstests/libs/shardingtest.js" ; let st = new ShardingTest({name: "lookup_stats_test" , shards: 2, mongos: 1}); var db = st.getDB( "test" ); st.adminCommand({enableSharding: "test" }); for (let collName of [ " outer " , " inner " ]) { assert .commandWorked(db[collName].insert({_id: 1, key: 1, x: 1})); assert .commandWorked(db[collName].insert({_id: 2, key: 2, x: 2})); assert .commandWorked(db[collName].createIndex({key: 1})); assert .commandWorked(db[collName].createIndex({x: 1})); let namespace = "test." + collName; st.adminCommand({shardCollection: namespace, key: {key: 1}}); } // Create an empty chunk for the inner collection and move it to the non-primary shard. This will // force the $lookup to target both shards even though all data resides on the primary shard. st.adminCommand({split: "test. inner " , middle: {key: 0}}); st.adminCommand({ moveChunk: "test. inner " , find: {key: -1}, to: st.getOther(st.getPrimaryShard( "test" )).name, _waitForDelete: true , }); let pipeline = [ // Prohibit the $lookup from being pushed down to SBE. {$_internalInhibitOptimization: {}}, // The join is _not_ on the shard key, so this $lookup must target both shards. { $lookup: { from: " inner " , localField: "x" , foreignField: "x" , as: "joined" , }, }, ]; let explain = db. outer .explain( "executionStats" ).aggregate(pipeline); printjson(explain); st.stop();
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      When explaining a $lookup query (and the $lookup is not pushed down to SBE) in "executionStats" verbosity mode or higher, we report several runtime statistics pertaining to the execution on the inner side:

      • totalDocsExamined
      • totalKeysExamined
      • collectionScans
      • indexesUsed

      The inner collection may be sharded, and in the general case this means that we may have to target multiple shards. When this happens, the system is unable to gather runtime statistics on the inner side. This is because we currently lack a mechanism for transmitting statistics (alongside the intermediate result set) between shards when running a sharded explain. When the system is unable to gather runtime statistics for a sharded inner collection, we end up with something like the following:

      {
          "$lookup" : {
                      "from" : "inner",
                      "as" : "joined",
                      "localField" : "x",
                      "foreignField" : "x"
              },
              "nReturned" : NumberLong(2),
              "executionTimeMillisEstimate" : NumberLong(23),
              "totalDocsExamined" : NumberLong(0),
              "totalKeysExamined" : NumberLong(0),
              "collectionScans" : NumberLong(0),
              "indexesUsed" : [ ]
      }
      

      Taken at face value, this makes it seem like the $lookup returned 2 documents in 23 milliseconds, but examined no keys or documents on the inner side and performed neither any collection scans or index scans. In reality, 2 docs and 2 keys were examined on the inner side, and 2 index scans were performed against the {key: 1} index. Please see the "Steps to Reproduce" for a detailed script that produces the erroneous output. What makes this even more confusing is that in my example, all of the data for both the inner and outer collections happens to reside on the primary shard – but the reporting issue still arises because the routing table for the inner collection is such that the $lookup needs to target both collections.

      Ideally we would improve the system so that these statistics can be accurately reported in a sharded cluster. This is likely easier said than done because it requires communicating statistics and intermediate results alongside each other across the wire. A stopgap solution would be to change the output to somehow make it clear that statistics are known to be missing. Perhaps we could count the number of times that we had to target multiple shards on the inner side, with a note that when this happens the runtime stats can be under-reported?

            Assignee:
            [DO NOT USE] Backlog - Query Execution
            Reporter:
            David Storch
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: