[SERVER-51003] Some replication unittests take much longer to run than others Created: 17/Sep/20 Updated: 29/Oct/23 Resolved: 08/Oct/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Build, Replication, Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 4.9.0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Kevin Pulo | Assignee: | Kevin Pulo |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Backwards Compatibility: | Fully Compatible |
| Sprint: | Sharding 2020-10-19 |
| Participants: |
| Description |
|
I noticed this when running the full unittest suite locally, but it should also benefit the unittests in the commit queue and patch builds. The db_repl_test binary, and to a lesser extent, util_test, take much longer to run than the rest. When running a debug build locally on 40 cores, I still normally spend several minutes at the end waiting for just db_repl_test. Using timings from a recent commit queue patch shows:
Looking into the suites shows just a few that run much longer than the others:
Looking more closely at RandomizedIdempotencyTest shows that CheckUpdateSequencesAreIdempotent takes ~3 secs, and CheckUpdateSequencesAreIdempotentV2 takes the remaining ~26 secs:
So it might be good to split at least this test (CheckUpdateSequencesAreIdempotentV2) or suite (RandomizedIdempotencyTest), and maybe some of the others, into separate binaries, so that they can better parallelize onto multiple cores. (Sending this to SDP, since the ticket is primarily about improving overall (parallel) unittest runtime by avoiding individual long-running tests/suites/binaries. But it could of course be redirected to Replication to fix these particular ones.) |
| Comments |
| Comment by Githook User [ 08/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Kevin Pulo', 'email': 'kevin.pulo@mongodb.com', 'username': 'devkev'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 07/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for doing that investigation, kevin.pulo! I no longer have concerns about this change. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kevin Pulo [ 07/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The push to merge unittests into fewer binaries is less relevant these days, see this comment from Drew. On my workstation it adds < 1 sec at link time, and ~ 3 secs at dynamic link time (ie. runtime):
So it's likely still not feasible to completely split up the unittests, but splitting off a single long-running suite has clear benefits (see below) for a very small cost. I used this POC of splitting idempotency_test.cpp into its own binary (db_repl_idempotency_test):
With this, the previous results from a release build on my workstation become: Top binary durations in seconds (from -j1):
Runtimes in seconds:
So it gets my runs down from 4+ mins to ~3 mins. There are some other long-ish running unittest binaries that might be good to individually target for splitting up, but none of them have a single long-running suite like RandomizedIdempotencyTest, which is the lowest-hanging fruit here. And while it'd be great if unittests had their own -j flag to run suites in parallel (to parallelise suites without paying the dynamic linking penalty on each one, as would happen if each suite was split into its own binary, or if resmoke tried to run subsets of the suites in parallel using --suite or --filter), that's a more general solution that I don't think is worth the effort, so this ticket isn't asking for that. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Cristopher Stauffer [ 06/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Looping in robert.guo as well to get his opinion. I'm all for reducing makespan but the concern is the gains will be lost in longer build times. Is it simple enough to put a patch up and SDP can review the stats and decide then? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 30/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I also want to add to Kev's comments above that db_repl_test holds up sanity tests I run on my local machine, to ensure that I at least didn't break the unit-tests before submitting an Evergreen patch. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kevin Pulo [ 30/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I use debug builds during development because I've been burned by dassert in the past. That's a separate issue, but I don't understand why I would want to use a release build during development — I'm going to debugging routinely so surely a debug build would be more useful? Also I do often use filter, but there are times when running the whole suite is more useful/appropriate (eg. when making invasive changes to low-level code). Anyway, the debug/release build aspect is irrelevant, because it affects the runtime of all tests equally (more or less). The point is that db_repl_test (and specifically CheckUpdateSequencesAreIdempotentV2) take much longer than other tests, and this affects the ability to usefully parallelise the unittests. The original commit queue tasks I linked to are release builds, and I re-ran with a release build to properly confirm that it's equally affected: Top 6 binary durations in seconds (from -j1):
Runtimes in seconds:
If RandomizedIdempotencyTest is linked into a separate binary of its own, then it could spend its 2 minutes running alongside the rest of db_repl_test. Since this is a randomised test I imagine developers won't miss it if they're locally running db_repl_test (and would probably appreciate its runtime being halved), and it'll still be included in the full suite, and still accessible via ninja +randomized_idempotency_test. (It's currently inside idempotency_test.cpp, which contains a non-randomised test UpdateTwoFields that I expect is fast and has actual benefit staying in db_repl_test, so I'm assuming the RandomizedIdemopotencyTest tests should be split out into their own randomized_idempotency_test.cpp). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 28/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
kevin.pulo, as Siyuan commented above, we recommend using a release build for testing, or using the filter argument with debug builds when debugging a particular unit test. We don't consider the performance of unit tests with debug builds to be a priority. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 23/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The initial idea of merging unit tests to a large one is to speed the compile time. I'm curious to hear acm's opinion on this. On a side note, kevin.pulo, I usually compile with debug off since debug build could be much slower in both unit tests and JS tests. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 23/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Speeding up the OplogFetcher based tests feels especially worthwhile since we're using them for tenant migrations (CC siyuan.zhou). Splitting the repl tests up into multiple binaries seems like a small lift and a worthwhile win. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brooke Miller [ 22/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In triaging, we discussed that there may be an opportunity to split up db_repl_test. Robert mentioned that since it's the only test that runs in its job, there's nothing resmoke can do to cut down the runtime further. As a result, we're sending over to Replication for additional investigations or suggestions. Feel free to send it back to STM if there's a systematic improvement you think we can do to the testing framework. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kevin Pulo [ 18/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sorry, I forgot to mention. resmoke is already parallelising the execution of the unittest binaries, over 16 cores on Evergreen. The problem is that if one binary takes much longer than the others, then towards the end it's the only thing left running, and the other cores are mostly idle. The overall unittests task took ~8 minutes, of which the first ~6 mins was compiling. The resmoke run took 128 secs, which means the 108 secs for db_repl_test is a pretty large fraction of that walltime (85%), ie. the overall resmoke runtime is mostly being dominated by the single db_repl_unit binary. If it was instead broken into, say, 3 pieces of ~30 secs each, that would help to reduce the overall resmoke (wallclock) runtime. It could be possible for resmoke to do that sort of splitting by running the unittest binaries with - Again, this isn't so bad on Evergreen since the overall execution is only 2 minutes. But on my workstation (with a debug build) the full runtime of up to 9 minutes is completely dominated by db_repl_test with 98% of walltime (the next nearest being 4 binaries around 50-40% of walltime each):
This means even though the machine has 40 cores, the runtime doesn't decrease when I increase beyond about -j8 or 16. In this case, splitting up db_repl_test would definitely have a direct benefit to the runtime (eg. the above run could be roughly halved, down to as low as 4.5 mins (277 secs)). This would help a lot when needing to run the full unittest suite during iterative development. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by April Schoffer [ 17/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sending this over to STM to review whether there's parallelization that can be done in resmoke.py |