Join optimization: cardinality estimate for join is off by 10^2 (overestimation)

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Critical - P2
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Optimization
    • ALL
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      In the query below, the cardinality estimate for the top-level join is larger than the actual cardinality by two orders of magnitude:

      pipeline = EJSON.deserialize(
          [{
                  "$match": {
                      "p_comment": {
                          "$regex": "^ r"
                      }
                  }
              },
              {
                  "$lookup": {
                      "from": "lineitem",
                      "localField": "p_partkey",
                      "foreignField": "l_partkey",
                      "as": "lineitem",            }
              },
              {
                  "$unwind": "$lineitem"
              },
              {
                  "$lookup": {
                      "from": "partsupp",
                      "localField": "p_partkey",
                      "foreignField": "ps_partkey",
                      "as": "partsupp",
                  }
              },
              {
                  "$unwind": "$partsupp"
              }
          ]
      );
      
      nterprise subjoin_cardinality> db.part.aggregate(pipeline).toArray().length;
      24344
      Enterprise subjoin_cardinality> db.part.aggregate(pipeline).explain().queryPlanner.winningPlan.queryPlan.cardinalityEstimate;
      1695352.9477173113
      Enterprise subjoin_cardinality> db.part.aggregate(pipeline).explain().queryPlanner.winningPlan.queryPlan;
      {
        stage: 'HASH_JOIN_EMBEDDING',
        planNodeId: 6,
        costEstimate: 1405.4084090308088,
        cardinalityEstimate: 1695352.9477173113,
        estimatesMetadata: { ceSource: 'Sampling' },
        leftEmbeddingField: 'none',
        rightEmbeddingField: 'partsupp',
        joinPredicates: [ 'p_partkey = ps_partkey', 'lineitem.l_partkey = ps_partkey' ],
        inputStages: [
          {
            stage: 'INDEXED_NESTED_LOOP_JOIN_EMBEDDING',
            planNodeId: 4,
            costEstimate: 183.26585501196288,
            cardinalityEstimate: 5405.148,
            estimatesMetadata: { ceSource: 'Metadata' },
            leftEmbeddingField: 'none',
            rightEmbeddingField: 'lineitem',
            joinPredicates: [ 'p_partkey = l_partkey' ],
            inputStages: [
              {
                stage: 'COLLSCAN',
                planNodeId: 1,
                costEstimate: 20.949809859962873,
                cardinalityEstimate: 180,
                estimatesMetadata: { ceSource: 'Code' },
                filter: { p_comment: { '$regex': '^ r' } },
                nss: 'subjoin_cardinality.part',
                direction: 'forward'
              },
              {
                stage: 'FETCH',
                planNodeId: 3,
                nss: 'subjoin_cardinality.lineitem',
                inputStage: {
                  stage: 'INDEX_PROBE_NODE',
                  planNodeId: 2,
                  nss: 'subjoin_cardinality.lineitem',
                  keyPattern: { l_partkey: 1 },
                  indexName: 'l_partkey_1',
                  isMultiKey: false,
                  isUnique: false,
                  isSparse: false,
                  isPartial: false,
                  indexVersion: 2
                }
              }
            ]
          },
          {
            stage: 'COLLSCAN',
            planNodeId: 5,
            costEstimate: 132.31859943985148,
            cardinalityEstimate: 80000,
            estimatesMetadata: { ceSource: 'Code' },
            filter: {},
            nss: 'subjoin_cardinality.partsupp',
            direction: 'forward'
          }
        ]
      }

      The estimate for the two-table join between part and lineitem is reasonably correct. 

      Git revision a5de0b02d614d7d5bb756bb71ee082ddca8849a1

            Assignee:
            Unassigned
            Reporter:
            Philip Stoev
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: