[SERVER-80576] Microbenchmarks - investigate regression in $in queries Created: 31/Aug/23  Updated: 31/Jan/24

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Anton Korshunov Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 0
Labels: M7
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-85436 ABT Constant value pooling for effici... Backlog
related to SERVER-80735 [CQF] Lazy simplification Backlog
Assigned Teams:
Query Optimization
Participants:

 Description   

Specifically, in the following workloads:

Q.UnindexedVeryLargeInSortedMatching
Q.UnindexedVeryLargeInSortedMatchingBigCollection
Q.UnindexedVeryLargeInUnsortedMatching
Q.UnindexedVeryLargeInUnsortedMatchingBigCollection
Q.UnindexedLargeInNonMatching
Q.UnindexedLargeInMatching
Q.UnindexedLargeInUnsortedMatching
Q.UnindexedLargeInUnsortedNonMatching
Q.UnindexedLargeInMatchingBigCollection
Q.UnindexedLargeInUnsortedMatchingBigCollection



 Comments   
Comment by Timour Katchaounov [ 19/Jan/24 ]

The problem with slow optimization is solved by constant pooling, as described in  SERVER-85436 and the design doc

 

The problem with slow query execution is solved by query cache parametrization. It is not clear why the non-parameterized plan is so much slower.

Comment by Timour Katchaounov [ 14/Dec/23 ]

This is superseded by the work on skipping sargable node. 

Comment by Timour Katchaounov [ 12/Oct/23 ]

Profiling and analysis of the remaining overhead in addition tothe filter->sargable conversion. These benchmarks and profiles were run after disabling the Filter->SargableNode rewrite.

  1. Queries.UnindexedVeryLargeInSortedMatching
    1. classic planner: 600 qps
    2. Bonsai/master: 6.5 qps
    3. Bonsai/disabled filter->sargable: 335 qps
  2.  Queries.ComplexExpressionQuery_CollScan_MixedBoolExpr2
    1. classic planner: 551 qps
    2. Bonsai/master:    69 qps
    3. Bonsai/disabled filter->sargable: 148 qps

Linux perf/FlameGraph analysis of the perfromance profile without the filter->sargable conversion, based on running a single test "Queries.UnindexedVeryLargeInSortedMatching"

 

- getSBEExecutorViaCascadesOptimizer: 70% (vs 99% master)
  - OptPhaseManager::optimizeNoAssert: 36%
    - OptPhaseManager::runStructuralPhase: 10%
    - OptPhaseManager::runMemoRewritePhases: 23%
      - OptPhaseManager::runMemoLogicalRewrite: 16.4%
        - LogicalRewriter::addRootNode: 13%
          - MemoIntegrator::addNode: 12%
            - Constant::Contant: 11%
              - sbe::value::copyValue: 11%
                - sbe::value::Array::Array: 10%
        - extractLatestPlan: 3%
      - OptPhaseManager::runMemoPhysicalRewrite: 6.4%
  - translateCanonicalQueryToABT: 28%
    - decomposeToFilterNodes: 4%
    - generateMatchExpression: 22%
      - ABTMatchExpressionVisitor::visit: 22%
        - sbe::value::MakeValue: 15%
  - SBENodeLowering: 3%

 

Intel VTune analysis based on running a unit test with 1000 sequential (sorted) $in elements, directly generated as an ABT.

 

- OptPhaseManager::optimizeNoAssert: 50% (0.03) sec
- OptPhaseManager::runMemoRewritePhases: 50%%
  - OptPhaseManager::runMemoLogicalRewrite: 13.3%
    - Memo::integrate: 13.3
    - MemoIntegrator::addNode[s]: 13.3%
      - ABT forMemo = n; 13.3% (26.6% of total opt time)
  - OptPhaseManager::runMemoPhysicalRewrite: 36.7%
    - PhysicalRewriter::optimizeGroup - costAndRetainBestNode - optimizeChildren - optimizeGroup - addImplementers -
      implementationVisitor - collectVariableReferences - NodeVariableTracker::walk - NodeVariableTracker::extractFromABT: 20%
      - VariableEnvironment::walkVariables: 20%
    - extractPhysicalPlans: 16.7%
      - MemoPhysicalPlanExtractor::extract: 16.7
        - properties::removeProperty: 16.7
          ...
          - PolyValue::destroy: 16.7

 

General conclusions:

The Filter->SargableNode conversion (mostly interval simplification) is the major performance  problem for large $in, however, even if it is infinitely fast, there is still substantial overhead (50% worse qps than classic).

This additional overhead is due to copying the $in list a few extra times:

  • Generation of the input ABT because it copies the $in-list from a BSON representation into an ABT.
  • Shredding the query into Memo because 'MemoIntegrator::addNode' calls 'ABT forMemo = n;' which performs a deep copy of the ABT, and thus copies the whole $in list.

In addition there seems to be extra overhead in OptPhaseManager::runMemoPhysicalRewrite which calls:

  • VariableEnvironment::walkVariables (not investigated further why takes so much time), and
  • MemoPhysicalPlanExtractor::extract - calls PolyValue::destroy (probably this frees the $in list, havent' checked yet).

It is not clear how much is this overhead because some profiling approaches do not show it as a primary problem.

Depending on the profiling method different stage seems to be the primary CPU consumer.

  • According to callgrind, memo shredding which calls addNode, and copies the $in array is the biggest one.
  • OptPhaseManager::runMemoPhysicalRewrite shows only on one profile, while 'perf' and callgrind do not show it much.
Comment by Timour Katchaounov [ 05/Oct/23 ]

A performance investigation based on a unit test of an $in with 1000 elements in increasing order shows that the majority of the time is spent in function unionDNFIntervals as follows:

  • unionDNFIntervals: 67%
    • unionTwoIntervals: 66%
      • ConstEval::constFold: 63%
        • ConstEval::optimize: 33%
        • VariableEnvironment::build: 29%

The analysis was performed using the Intel VTune profiler.

 

The reasons for these high processing costs are:

  • Intervals resulting from $in queries are optimized as regular disjunctions, thus not taking into account the specifics of $in - namely all disjuncts are equalities on the same field, and all duplicates are already removed at this optimization phase.
  • The implementation of unionTwoIntervals is too generic, designed to anticipate variable boundaries. For this purpose it constructs pretty complex ABT expressions that would reduce interval boundaries once constant folded. Notice that in the current optimizer however this function is called only for constant boundaries as guaranteed by unionDNFIntervals.
  • Constant folding walks any ABT tree looking for variables, but this is not necessary for ABT expressions over constants.

There are several approaches to improve the performance of large $in queries:

  1. Take into account the specifics of $in predicates, and do not perform interval simplification over them as if they are generic disjunctions. This is the most promising route.
  2. For constant intervals, implement interval merging via explicit logic that doesn't use constant folding.
    A quick test that removes only two of the calls to constant folding doubles the performance.
  3. Extend constant folding to not call reference tracking on constant trees. For this we need to collect in advance the information whether a tree is constant or not. This could be done during reference tracking.
  4. Extend reference tracking to not proceed along subtrees that contain no variables. For this we need to collect, store and update per-node information in advance. One logical phase to do that is during reference tracking (since we collect the info at least once).
Generated at Thu Feb 08 06:43:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.