[SERVER-33162] Make the update fuzzer run faster Created: 07/Feb/18 Updated: 06/Dec/22 Resolved: 03/Dec/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | 3.6.2 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Robert Guo (Inactive) | Assignee: | Backlog - Server Tooling and Methods (STM) (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | stm | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Server Tooling & Methods
|
| Participants: |
| Description |
|
The update fuzzer runs at up to 15x slower on the 3.6 branch compare with master. This is especially pronounced in the OS X and Code Coverage variants. (Enterprise OS X MMAP1, SSL OS X, RHEL 6.2 DEBUG Code Coverage), causing timeouts in the update_fuzzer_replication suite on these variants. The other variants all suffer between a 2x to 5x slowdown, but not enough to trigger a timeout. The slowdown is likely caused by the additional work of resetting the database and running more JavaScript in the shell to handle blacklisting. 3.7 has not had a similar problem so far, but it should experience a similar slowdown as more known differences are discovered between 3.7 and 3.6. It might be a good idea to proactively reduce the number of fuzzer generated files and the size of each file to prevent sporadic failures in the future, especially for the update_fuzzer_replication suite. |
| Comments |
| Comment by Brooke Miller [ 03/Dec/21 ] |
|
STM doesn't have bandwidth to pick this up, so we're closing as won't fix. However, if other teams have capacity and would like to pursue this, please feel free to pick this up. |
| Comment by Max Hirschhorn [ 08/Feb/18 ] |
Sorry, I managed to type continue_on_failure=true when looking to see what defaults are set via the &jstestfuzz_config_vars anchor. What I really meant to say is "Should we start out with setting run_multiple_jobs=true for the update fuzzer because each of the tests start their own MongoDB deployment?" Even when the Evergreen task times out, the code coverage generated thus far is still uploaded to S3. I don't think there's a lot of value in running with continue_on_failure=true because any failure in one of the generated files is signal that something has gone wrong and knowing that five more generated files would have failed as well doesn't seem particularly helpful as more signal. I think Evergreen's stepback should be "good enough" to help identify the problematic commit.
I was thinking of just having an expansion that defines a fixed number of generated files to create and used it in each of the task definitions. We could also define a ratio but I'm not sure it is worth introducing the complexity equivalent to how the number of resmoke.py jobs are determined by the "run tests" function. |
| Comment by Robert Guo (Inactive) [ 08/Feb/18 ] |
|
max.hirschhorn I didn't think of setting continue_on_failure=true, is the goal to ensure we have some coverage data even if the task times out? Lowering the number of generated files for OS X and ARM sounds good. But we'd probably have to introduce a new "setMultiplier" on the fuzzer's side since the number of generated files is different for different suites. |
| Comment by Max Hirschhorn [ 07/Feb/18 ] |
|
robert.guo, should we start out with setting continue_on_failure=true for the update fuzzer because each of the tests start their own MongoDB deployment? It won't address any issues with the OS X machines, but I've been debating whether we should lower the number of generated files we run on the OS X and ARM machines altogether (i.e. even for jstestfuzz* Evergreen tasks). |