[SERVER-80581] Microbenchmarks - investigate regression in single top-level field query benchmark Created: 31/Aug/23 Updated: 19/Dec/23 Resolved: 19/Dec/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.3.0-rc0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Anton Korshunov | Assignee: | Henri Nikku |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | M5 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Query Optimization
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Sprint: | QO 2023-10-02, QO 2023-10-16, QO 2023-10-30, QO 2023-11-13, QO 2023-11-27, QO 2023-12-11, QO 2023-12-25 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
Specific benchmarks to investigate: Q.NoMatch |
| Comments |
| Comment by Henri Nikku [ 22/Nov/23 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
I looked into using query shape from query stats for checking whether a query is eligible for a fast path. For query predicates, this essentially means just normalizing the predicate (see here) using kRepresentativeQueryShapeSerializeOptions, and then hashing/comparing the normalized BSON. Using the full query shape would be unnecessary and impractical, as it would require us to take into account all possible values of all command options (e.g. oplogReplay, allowDiskUse) when defining the shapes that can take a fast path. Here is a quick and dirty implementation I used for comparing performance against the custom BSON comparator (all relevant changes are in cqf_fast_paths.cpp).
The custom BSON comparator (original.log) seems to be marginally faster (a few %) most of the time. I think there are two explanations for this:
Using the query shape also leads to more convoluted code when defining the fast paths, as we need define each fast path "shape" separately for every BSON type. Hence I think keeping the current approach (custom BSON comparator) is a better option for our current needs. | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Henri Nikku [ 27/Oct/23 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
I added some benchmarks that exercise the shortcut. As I mentioned in the previous comment, the shortcut consistently beats classic the field doesn't exist. However when querying a field that exists, the shortcut loses on larger collections. The reason for this is that the shortcut currently uses 'cmp3w' for comparisons the same way bonsai does, whereas classic uses 'less', 'eq', 'isNaN' etc directly. So it seems like the shortcut should produce the same plans as classic. There is some code in 'stage_builder_filter' that I think could be reused for handling edge cases like NaN. However I recall anton.korshunov@mongodb.com mentioning that we might not want to use stage builder code for this. Did I remember correctly?
| |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Henri Nikku [ 23/Oct/23 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
I got rid of the transports as discussed during standup. Also changed the logic so that we exit as early as possible in | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Henri Nikku [ 18/Oct/23 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Draft of the SBE construction ticket: https://github.com/10gen/mongo/pull/16317
I ran the benchmark with a hard coded SBE, and got around 3500 qps vs 3700 qps on classic. However I constructed the SBE here, which means that some unnecessary work was done prior to SBE construction (e.g. building VariableEnvironment, I think). I'll try to run another benchmark tomorrow with an earlier exit and report the results here.
Edit: we get around 3650 qps by constructing the SBE plan before building VariableEnvironment.
| |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Henri Nikku [ 10/Oct/23 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
The "time taken by bonsai before lowering" measurement also included the preparations done before optimizing, for example populating metadata etc. Below is a more detailed rundown of the query processing time for bonsai without substitution phase. Note that there is some overhead (around 20-40 μs) from doing more printing than in the measurements above.
| |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Henri Nikku [ 10/Oct/23 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
It seems that for queries with just one predicate, it's safe to skip the substitution phase altogether.
Below is a comparison query processing and optimization time for classic and bonsai:
While skipping the substitution phase definitely helps, it doesn't get us even close to the performance of the classic optimizer. Even without all memo rewrite phases we are still not quite there. Therefore I think we need a shortcut for these kinds of queries. I think we could either
| |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Henri Nikku [ 05/Oct/23 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
https://github.com/10gen/mongo/compare/SERVER-80581-scratch The above branch implements logic that skips the FilterNode->SargableNode conversion if the FilterNode contains only one predicate. Note that the patch is probably insufficient but works for NoMatch and similar simple queries. I ran the benchmark a few times locally and observed a nice perf improvement:
| |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Henri Nikku [ 03/Oct/23 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Tried to measure the effects of David Percy's suggestion (see below) by commenting out all logical rewrites that are not applicable to the query.
The change did not yield notable improvements. The overhead from checking if rewrites can be applied seems very small at around 1-2% of the total query processing time. I think that for the purposes of improving this particular regression, this approach is not worth investigating further. As of now, the two biggest bottlenecks seem to be:
| |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Henri Nikku [ 26/Sep/23 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Bonsai and classic produce different but (at least to my eye) similar plans:
No significant difference in execution time. Optimization is considerably faster in classic, taking ~20% of total query processing time vs ~50-55% in bonsai. In bonsai, most of the optimization time is spent executing runMemoRewritePhases. The substitution phase seems like the best choice for seeking performance improvements:
|