[SERVER-64185] Investigate performance regression of $lookup and $graphLookup in genny workloads Created: 03/Mar/22  Updated: 27/Oct/23  Resolved: 15/Sep/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Hana Pearlman Assignee: David Storch
Resolution: Gone away Votes: 0
Labels: pm2697-m3, sbe
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2022-09-15 at 6.39.25 PM.png    
Issue Links:
Depends
Sprint: QE 2022-09-19
Participants:
Linked BF Score: 27

 Description   

There is a 15-35% regression in $lookup and $graphLookup genny workloads running with unsharded collections that was first seen between v5.0 and v5.1. The workloads in question are here: $lookup and $graphLookup. The regression can be seen in the linked BF of when looking at the sys-perf waterfall for v5.0 and v5.1 (select average latency for "RunGraphLookups.GraphLookupUnshardedToUnshardedOneToMany", for example).

Some ideas have been proposed as to why the regression occurred and what can be done to address it. For example, it may have something to do with slow collection scans (the workloads in question use small collections). It may be that the plan cache project, particularly SERVER-61421, could improve the performance of these workloads, since the subpipelines run by these $lookups and $graphLookups are all simple match queries with the same shape.

A more detailed write-up can be found in the comments.



 Comments   
Comment by David Storch [ 13/Sep/22 ]

It looks like the reason for the AutoRun problem I mentioned above may be that schedule_patch_auto_tasks and schedule_variant_auto_tasks are not specified for the "linux-1-node-replSet-classic-query-engine" build variant. I'm testing that the following patch, when combined with the in-progress changes from SERVER-69650, has the desired effect:

commit 471088f4c982a5285b844ace276182092462596e (HEAD -> SERVER-69650)
Author: David Storch <david.storch@mongodb.com>
Date:   Tue Sep 13 17:42:54 2022 -0400
 
    SERVER-69650 fix sys-perf single replica classic/SBE variants to work with AutoRun
 
diff --git a/etc/system_perf.yml b/etc/system_perf.yml
index 972d118bc20..ee5a1ed7a5d 100755
--- a/etc/system_perf.yml
+++ b/etc/system_perf.yml
@@ -1288,6 +1288,8 @@ buildvariants:
       - "rhel70-perf-single"
     depends_on: *_compile_amazon2
     tasks: &classic_engine_1nodereplset_tasks
+      - name: schedule_patch_auto_tasks
+      - name: schedule_variant_auto_tasks
       - name: linkbench
       - name: linkbench2

Comment by David Storch [ 13/Sep/22 ]

I took a brief look using the Evergreen UI at a recent run of these benchmarks in master compared to a recent run in 5.0. In both cases I used the "release configuration" which means that we should be using the classic engine. This assumes that the $lookup queries in these benchmarks are not eligible for SBE pushdown, which indeed appears to be the case because they specify the pipeline option. Also note that using SBE on the inner side of a $lookup or $graphLookup is no longer permitted in the unsharded case due to the changes in SERVER-69103. As expected, 5.0 and 6.0 now exhibit similar performance, since we are using the classic engine in both versions.

As a final step, I think we should verify that these benchmarks due not regress when featureFlagSbeFull is enabled. We have a system which automatically generates an SBE vs. classic performance comparison every Thursday, but unfortunately the data for these benchmarks appears to be missing from the data set. It looks like this is because the benchmarks are not currently running in either the "all feature flags" or "classic engine" build variants – I'll have to look into why the AutoRun configuration for the workload is not behaving as I would expect it to.

Generated at Thu Feb 08 05:59:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.