[SERVER-80581] Microbenchmarks - investigate regression in single top-level field query benchmark Created: 31/Aug/23  Updated: 19/Dec/23  Resolved: 19/Dec/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.3.0-rc0

Type: Task Priority: Major - P3
Reporter: Anton Korshunov Assignee: Henri Nikku
Resolution: Fixed Votes: 0
Labels: M5
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File Queries_NoMatch_heuristic_1.svg    
Issue Links:
Depends
depends on SERVER-81973 Implement generic logic to configure ... Closed
depends on SERVER-82171 Translate queries with single predica... Closed
Assigned Teams:
Query Optimization
Backwards Compatibility: Fully Compatible
Sprint: QO 2023-10-02, QO 2023-10-16, QO 2023-10-30, QO 2023-11-13, QO 2023-11-27, QO 2023-12-11, QO 2023-12-25
Participants:

 Description   

Specific benchmarks to investigate:

Q.NoMatch



 Comments   
Comment by Henri Nikku [ 22/Nov/23 ]

I looked into using query shape from query stats for checking whether a query is eligible for a fast path. For query predicates, this essentially means just normalizing the predicate (see here) using kRepresentativeQueryShapeSerializeOptions, and then hashing/comparing the normalized BSON. Using the full query shape would be unnecessary and impractical, as it would require us to take into account all possible values of all command options (e.g. oplogReplay, allowDiskUse) when defining the shapes that can take a fast path. 

Here is a quick and dirty implementation I used for comparing performance against the custom BSON comparator (all relevant changes are in cqf_fast_paths.cpp).
https://github.com/10gen/mongo/compare/SERVER-83215...fast-path-query-shape

Queries.NoMatch:
  - original.log: 4720.00584674564
  - queryshape.log: 4649.5737870539615
Queries.HalfMatchEqInt100:
  - original.log: 4431.509126787467
  - queryshape.log: 4284.9385786790535
Queries.HalfMatchLtInt100:
  - original.log: 4370.701228066477
  - queryshape.log: 4247.263332343162
Queries.HalfMatchEqNaN100:
  - original.log: 4424.0532583151335
  - queryshape.log: 4323.825544992392
Queries.HalfMatchLteNaN100:
  - original.log: 4412.137001112542
  - queryshape.log: 4317.505076215942

The custom BSON comparator (original.log) seems to be marginally faster (a few %) most of the time. I think there are two explanations for this:

  • For query shape, we need to construct a normalized version of the predicate before doing a lookup on the map. This probably doesn't matter much for small queries (like the ones above), but it is extra work that would have to be done even for queries with large predicates that are obviously not eligible for a fast path.
  • The size of the fast path map is larger when using query shape, because a separate key is needed for each BSON type and each fast path definition.

Using the query shape also leads to more convoluted code when defining the fast paths, as we need define each fast path "shape" separately for every BSON type. Hence I think keeping the current approach (custom BSON comparator) is a better option for our current needs.

Comment by Henri Nikku [ 27/Oct/23 ]

I added some benchmarks that exercise the shortcut. As I mentioned in the previous comment, the shortcut consistently beats classic the field doesn't exist. However when querying a field that exists, the shortcut loses on larger collections. The reason for this is that the shortcut currently uses 'cmp3w' for comparisons the same way bonsai does, whereas classic uses 'less', 'eq', 'isNaN' etc directly.

So it seems like the shortcut should produce the same plans as classic. There is some code in 'stage_builder_filter' that I think could be reused for handling edge cases like NaN. However I recall anton.korshunov@mongodb.com mentioning that we might not want to use stage builder code for this. Did I remember correctly?

Queries.NoMatch:
  - mongod-bonsai.log: 3692.2471275808384
  - mongod-classic.log: 3527.7281434182614
Queries.NoMatchLarge10K:
  - mongod-bonsai.log: 81.30758656141698
  - mongod-classic.log: 77.60521454496732
Queries.NoMatchLarge100K:
  - mongod-bonsai.log: 6.366124453680356
  - mongod-classic.log: 5.964700953806083
Queries.HalfMatchEqInt100:
  - mongod-bonsai.log: 3471.059305820819
  - mongod-classic.log: 3336.8729965522048
Queries.HalfMatchEqInt10K:
  - mongod-bonsai.log: 172.16837272238848
  - mongod-classic.log: 176.06752888518074
Queries.HalfMatchEqInt100K:
  - mongod-bonsai.log: 17.666194392924776
  - mongod-classic.log: 18.196965464097122
Queries.HalfMatchLtInt100:
  - mongod-bonsai.log: 3342.765869611925
  - mongod-classic.log: 3302.271984125427
Queries.HalfMatchLtInt10K:
  - mongod-bonsai.log: 161.1216173871396
  - mongod-classic.log: 176.43353479226727
Queries.HalfMatchLtInt100K:
  - mongod-bonsai.log: 16.380288616152274
  - mongod-classic.log: 18.105359828376088
Queries.HalfMatchEqNaN100:
  - mongod-bonsai.log: 3420.8665560323907
  - mongod-classic.log: 3378.521845289624
Queries.HalfMatchEqNaN10K:
  - mongod-bonsai.log: 170.94780323773955
  - mongod-classic.log: 182.24889442972807
Queries.HalfMatchEqNaN100K:
  - mongod-bonsai.log: 17.13499749363818
  - mongod-classic.log: 18.473821863730596
Queries.HalfMatchLteNaN100:
  - mongod-bonsai.log: 3305.217379393508
  - mongod-classic.log: 3439.8268181196813
Queries.HalfMatchLteNaN10K:
  - mongod-bonsai.log: 160.33506845149228
  - mongod-classic.log: 182.16747950034255
Queries.HalfMatchLteNaN100K:
  - mongod-bonsai.log: 16.131093428165098
  - mongod-classic.log: 18.399805155069274

Comment by Henri Nikku [ 23/Oct/23 ]

I got rid of the transports as discussed during standup. Also changed the logic so that we exit as early as possible in 
getSBEExecutorViaCascadesOptimizer. These changes got us to around 3800-3850 qps, which beats the classic optimizer (3700). So it looks like we can fix the regression with this approach.
 
I will still have to (at least) add support for type bracketed $lt and $gt, and add/run more benchmarks (e.g. queries that match multiple objects in the collection).

Comment by Henri Nikku [ 18/Oct/23 ]

Draft of the SBE construction ticket:

https://github.com/10gen/mongo/pull/16317

 

I ran the benchmark with a hard coded SBE, and got around 3500 qps vs 3700 qps on classic. However I constructed the SBE here, which means that some unnecessary work was done prior to SBE construction (e.g. building VariableEnvironment, I think). I'll try to run another benchmark tomorrow with an earlier exit and report the results here.

 

Edit: we get around 3650 qps by constructing the SBE plan before building VariableEnvironment.

 

Comment by Henri Nikku [ 10/Oct/23 ]

The "time taken by bonsai before lowering" measurement also included the preparations done before optimizing, for example populating metadata etc. Below is a more detailed rundown of the query processing time for bonsai without substitution phase. Note that there is some overhead (around 20-40 μs) from doing more printing than in the measurements above.

 

Operation Time (μs)
getSBEExecutorViaCascadesOptimizer before optimization 74
env::build in optimizeNoAssert 16
ConstEvalPre, PathFuse 16
runMemoRewritePhases 135
PathLower 22
ConstEvalPost 37
optimizeNoAssert 269
getSBEExecutorViaCascadesOptimizer after optimization 63
find execution plan (in total) 444
Total query processing time 636

 

Comment by Henri Nikku [ 10/Oct/23 ]

It seems that for queries with just one predicate, it's safe to skip the substitution phase altogether.

  • Classic: 3612 ops per sec
  • Bonsai master: 1784 ops per sec
  • Bonsai without FilterNode->Sargable conversion: 2046 ops per sec
  • Bonsai without substitution phase: 2208 ops per sec
  • Skipping const eval and path fusion results in at most 1-2% improvement.

Below is a comparison query processing and optimization time for classic and bonsai:

  • Classic:
    Time taken to find execution plan in total: 132 μs
    Total query processing time: 321 μs
  • Bonsai master:
    Time taken by bonsai before lowering: 442 μs
    Time taken by bonsai total: 530 μs
    Time taken to find execution plan in total: 547 μs
    Total query processing time: 736 μs
  • Bonsai without substitution phase:
    Time taken by exploration phase + physical rewrites: 128 μs
    Time taken by bonsai before lowering: 323 μs
    Time taken by bonsai total: 395 μs
    Time taken to find execution plan in total: 412 μs
    Total query processing time: 595 μs

While skipping the substitution phase definitely helps, it doesn't get us even close to the performance of the classic optimizer. Even without all memo rewrite phases we are still not quite there. 

Therefore I think we need a shortcut for these kinds of queries. I think we could either

  • Implement a phase to transform the initial ABT to a physical (ABT) plan without performing cascades-style optimizations.
    • If we detect a trivial query, we could configure OptPhaseManager to run just this phase.
    • Or we could include a phase like this in the default phases to run and exit early from the optimization if that phase found a trivial plan for a trivial query. This way we would still always run ConstEval and PathFuse phases that simplify the query before deciding how to optimize it.
  • Do something similar to the above, but produce the SBE plan directly. This could be faster but I'm not sure if this is reasonable to do. My understanding of SBE is limited, but I suspect that even for queries with just one predicate, this might get overly complicated.
Comment by Henri Nikku [ 05/Oct/23 ]

https://github.com/10gen/mongo/compare/SERVER-80581-scratch

The above branch implements logic that skips the FilterNode->SargableNode conversion if the FilterNode contains only one predicate. Note that the patch is probably insufficient but works for NoMatch and similar simple queries. I ran the benchmark a few times locally and observed a nice perf improvement:

  • Bonsai master: 1784 ops per sec
  • Bonsai with changes: 2046 ops per sec
Comment by Henri Nikku [ 03/Oct/23 ]

Tried to measure the effects of David Percy's suggestion (see below) by commenting out all logical rewrites that are not applicable to the query.

I think one area of waste is how we enqueue and attempt every rewrite whenever a group changes. Instead we could index rewrites by the node type they apply to.

The change did not yield notable improvements. The overhead from checking if rewrites can be applied seems very small at around 1-2% of the total query processing time. I think that for the purposes of improving this particular regression, this approach is not worth investigating further.

As of now, the two biggest bottlenecks seem to be:

  • Memo::integrate (called when when we add plans to memo) seems like the single biggest bottleneck (10-12%).
    • The memo integration logic seems complex out of necessity.
    • Not called in many places, I think the only way to reduce calls to this is to find more ways to exit early in rewrites (see below).
  • FilterSubstitute rewrite (6-8%)
    • Most of the time is spent converting FilterNode to SargableNode (here).
    • Somehow exiting early here would help with both bottlenecks.
    • I think there might not be a sensible way to exit early here (unless we have information about the schema). This seems to be a very crucial rewrite that we want to do for queries of this shape, but it just doesn't happen to help us with this query.
Comment by Henri Nikku [ 26/Sep/23 ]

Bonsai and classic produce different but (at least to my eye) similar plans:

  • bonsai:

     "slotBasedPlan" : {
            "slots" : "$$RESULT=s1 env: {  }",
            "stages" : "[1] filter {(traverseF(s2, lambda(l101.0) { ((move(l101.0) <=> 5L) == 0ll) }, false) ?: false)} \n[0] scan s1 none none none none none none none lowPriority [s2 = nonexistent] @\"6fb7540e-132b-4654-b996-bc278e01922c\" true false "
        }
    

  • classic:

    "slotBasedPlan" : {
            "slots" : "$$RESULT=s2 env: {  }",
            "stages" : "[1] filter {traverseF(s1, lambda(l101.0) { ((move(l101.0) == 5L) ?: false) }, false)} \n[1] scan s2 s3 none none none none none none lowPriority [s1 = nonexistent] @\"6fb7540e-132b-4654-b996-bc278e01922c\" true false "
        }
    

No significant difference in execution time. Optimization is considerably faster in classic, taking ~20% of total query processing time vs ~50-55% in bonsai.

In bonsai, most of the optimization time is spent executing runMemoRewritePhases. The substitution phase seems like the best choice for seeking performance improvements:

runMemoRewritePhases 31.06%
	runMemoLogicalRewrite 22.22%
		substitution 15.97%
			addRootNode ~5% (~10% total)
			rewriteToFixPoint 9.19%
		exploration 6.25%
			addRootNode ~5% (~10% total)
	runMemoPhysicalRewrite 8.68%

Generated at Thu Feb 08 06:43:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.