[SERVER-51003] Some replication unittests take much longer to run than others Created: 17/Sep/20  Updated: 29/Oct/23  Resolved: 08/Oct/20

Status: Closed
Project: Core Server
Component/s: Build, Replication, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Improvement Priority: Major - P3
Reporter: Kevin Pulo Assignee: Kevin Pulo
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Sprint: Sharding 2020-10-19
Participants:

 Description   

I noticed this when running the full unittest suite locally, but it should also benefit the unittests in the commit queue and patch builds. The db_repl_test binary, and to a lesser extent, util_test, take much longer to run than the rest. When running a debug build locally on 40 cores, I still normally spend several minutes at the end waiting for just db_repl_test.

Using timings from a recent commit queue patch shows:

secs binary
108.61 db_repl_test
48.16 util_test
33.11 db_catalog_test
31.68 db_unittests
31.20 db_storage_test
20.94 storage_ephemeral_for_test_test
19.99 db_repl_cloners_test
... ...

Looking into the suites shows just a few that run much longer than the others:

millis binary and suite num tests
29368 db_repl_test RandomizedIdempotencyTest 2
19508 db_repl_test TenantOplogApplierTest 19
15077 db_repl_test OplogFetcherTest 73
11347 db_repl_test RSRollbackTest 62
8746 util_test Future 71
8277 util_test Future_Void 61
7635 util_test Future_MoveOnly 60
5012 util_test FailPointStress 1
4648 db_repl_test IdempotencyTestTxns 20
4600 db_repl_test IdempotencyTest 21
4254 db_repl_test InitialSyncerTest 87
3125 util_test InvariantTerminationTest 13
2880 util_test Future_EdgeCases 10
2418 util_test RegistryList 2
2256 util_test SharedFuture 16
2159 db_repl_test PrimaryOnlyServiceTest 13
2072 db_repl_test OplogBufferCollectionTest 41
2043 db_repl_test ReplicationRecoveryTest 57
1577 db_repl_test RollbackImplTest 42
1406 db_repl_test OplogApplierImplTest 35
1011 db_repl_test TenantOplogBatcherTest 11
1000 util_test BackgroundJobBasic 3

Looking more closely at RandomizedIdempotencyTest shows that CheckUpdateSequencesAreIdempotent takes ~3 secs, and CheckUpdateSequencesAreIdempotentV2 takes the remaining ~26 secs:

2020-09-16T12:52:35.435Z I  TEST     23063   [main] "Running","attr":{"suite":"RandomizedIdempotencyTest"}
2020-09-16T12:52:35.435Z I  TEST     23059   [main] "Running","attr":{"test":"CheckUpdateSequencesAreIdempotent","rep":1,"reps":1}
...
2020-09-16T12:52:38.925Z I  TEST     23059   [main] "Running","attr":{"test":"CheckUpdateSequencesAreIdempotentV2","rep":1,"reps":1}
...
2020-09-16T12:53:04.803Z I  TEST     23060   [main] "Done running tests"

So it might be good to split at least this test (CheckUpdateSequencesAreIdempotentV2) or suite (RandomizedIdempotencyTest), and maybe some of the others, into separate binaries, so that they can better parallelize onto multiple cores.

(Sending this to SDP, since the ticket is primarily about improving overall (parallel) unittest runtime by avoiding individual long-running tests/suites/binaries. But it could of course be redirected to Replication to fix these particular ones.)



 Comments   
Comment by Githook User [ 08/Oct/20 ]

Author:

{'name': 'Kevin Pulo', 'email': 'kevin.pulo@mongodb.com', 'username': 'devkev'}

Message: SERVER-51003 split long-running randomized repl unittests into a separate db_repl_idempotency_test binary
Branch: master
https://github.com/mongodb/mongo/commit/b2429fa83457d26135a10a2d70a2c554c61922f7

Comment by Tess Avitabile (Inactive) [ 07/Oct/20 ]

Thanks for doing that investigation, kevin.pulo! I no longer have concerns about this change.

Comment by Kevin Pulo [ 07/Oct/20 ]

The push to merge unittests into fewer binaries is less relevant these days, see this comment from Drew. On my workstation it adds < 1 sec at link time, and ~ 3 secs at dynamic link time (ie. runtime):

$ rm build/install/bin/db_repl_idempotency_test build/7a67446c/mongo/db/repl/db_repl_idempotency_test
$ ninja -j1 install-all
[ 3 / 3 / 3 ] [100%] [0.895s]  Compiling test list: build/unittests.txt
$ time build/install/bin/db_repl_idempotency_test --help
{"t":{"$date":"2020-10-07T04:59:54.792Z"},"s":"I",  "c":"NETWORK",  "id":4915701, "ctx":"main","msg":"Initialized wire specification","attr":{"spec":{"incomingExternalClient":{"minWireVersion":0,"maxWireVersion":11},"incomingInternalClient":{"minWireVersion":0,"maxWireVersion":11},"outgoing":{"minWireVersion":0,"maxWireVersion":11},"isInternalClient":false}}}
  --list                List all test suites in this unit test.
  --suite arg           Test suite name. Specify --suite more than once to run
                        multiple suites.
  --filter arg          Test case name filter. Specify the substring of the
                        test names.
  --fileNameFilter arg  Filter test cases by source file name.
  --repeat arg (=1)     Specifies the number of runs for each test.
  --verbose [=arg(=v)]  Log more verbose output.  Specify one or more 'v's to
                        increase verbosity.
 
real    0m3.207s
user    0m3.117s
sys     0m0.073s

So it's likely still not feasible to completely split up the unittests, but splitting off a single long-running suite has clear benefits (see below) for a very small cost.

I used this POC of splitting idempotency_test.cpp into its own binary (db_repl_idempotency_test):

diff --git a/src/mongo/db/repl/SConscript b/src/mongo/db/repl/SConscript
index 83f414b05cf..bdc8c0ccdac 100644
--- a/src/mongo/db/repl/SConscript
+++ b/src/mongo/db/repl/SConscript
@@ -1429,7 +1429,6 @@ env.CppUnitTest(
         'check_quorum_for_config_change_test.cpp',
         'drop_pending_collection_reaper_test.cpp',
         'idempotency_document_structure_test.cpp',
-        'idempotency_test.cpp',
         'idempotency_update_sequence_test.cpp',
         'initial_syncer_test.cpp',
         'isself_test.cpp',
@@ -1631,6 +1630,18 @@ env.CppUnitTest(
     ],
 )
 
+env.CppUnitTest(
+    target='db_repl_idempotency_test',
+    source=[
+        'idempotency_test.cpp',
+    ],
+    LIBDEPS=[
+        '$BUILD_DIR/mongo/db/update/update_test_helpers',
+        'idempotency_test_fixture',
+        'idempotency_test_util',
+    ],
+)
+
 env.Library(
     target='replication_metrics',
     source=[

With this, the previous results from a release build on my workstation become:

Top binary durations in seconds (from -j1):

Before After
229 db_repl_test 137 db_storage_test
139 db_storage_test 135 db_repl_idempotency_test
91 db_unittests 104 db_repl_test
82 util_test 89 db_unittests
42 db_pipeline_test 82 util_test
35 db_repl_coordinator_test 42 db_pipeline_test
  36 db_repl_coordinator_test

Runtimes in seconds:

  full suite full suite full suite db_repl_test db_repl_test db_repl_idempotency_test
-j before after benefit before after after
1 1161 1171   229 104 135
4 425 325 100 (23%) 231 118 127
8 254 207 47 (18%) 228 114 129
20 268 180 88 (32%) 229 126 128
40 247 177 70 (28%) 240 156 177

So it gets my runs down from 4+ mins to ~3 mins.

There are some other long-ish running unittest binaries that might be good to individually target for splitting up, but none of them have a single long-running suite like RandomizedIdempotencyTest, which is the lowest-hanging fruit here.

And while it'd be great if unittests had their own -j flag to run suites in parallel (to parallelise suites without paying the dynamic linking penalty on each one, as would happen if each suite was split into its own binary, or if resmoke tried to run subsets of the suites in parallel using --suite or --filter), that's a more general solution that I don't think is worth the effort, so this ticket isn't asking for that.

Comment by Cristopher Stauffer [ 06/Oct/20 ]

Looping in robert.guo as well to get his opinion. I'm all for reducing makespan but the concern is the gains will be lost in longer build times. Is it simple enough to put a patch up and SDP can review the stats and decide then?

Comment by Kaloian Manassiev [ 30/Sep/20 ]

I also want to add to Kev's comments above that db_repl_test holds up sanity tests I run on my local machine, to ensure that I at least didn't break the unit-tests before submitting an Evergreen patch.

Comment by Kevin Pulo [ 30/Sep/20 ]

I use debug builds during development because I've been burned by dassert in the past. That's a separate issue, but I don't understand why I would want to use a release build during development — I'm going to debugging routinely so surely a debug build would be more useful? Also I do often use filter, but there are times when running the whole suite is more useful/appropriate (eg. when making invasive changes to low-level code).

Anyway, the debug/release build aspect is irrelevant, because it affects the runtime of all tests equally (more or less). The point is that db_repl_test (and specifically CheckUpdateSequencesAreIdempotentV2) take much longer than other tests, and this affects the ability to usefully parallelise the unittests.

The original commit queue tasks I linked to are release builds, and I re-ran with a release build to properly confirm that it's equally affected:

Top 6 binary durations in seconds (from -j1):

229 db_repl_test
139 db_storage_test
91 db_unittests
82 util_test
42 db_pipeline_test
35 db_repl_coordinator_test

Runtimes in seconds:

-j full suite util_test db_repl_test RandomizedIdempotencyTest CheckUpdateSequencesAreIdempotentV2
1 1161 82 229 124 127
4 425 82 231 126 115
8 254 84 228 109 100
20 268 80 229 126 112
40 247 87 240 119 110

If RandomizedIdempotencyTest is linked into a separate binary of its own, then it could spend its 2 minutes running alongside the rest of db_repl_test. Since this is a randomised test I imagine developers won't miss it if they're locally running db_repl_test (and would probably appreciate its runtime being halved), and it'll still be included in the full suite, and still accessible via ninja +randomized_idempotency_test. (It's currently inside idempotency_test.cpp, which contains a non-randomised test UpdateTwoFields that I expect is fast and has actual benefit staying in db_repl_test, so I'm assuming the RandomizedIdemopotencyTest tests should be split out into their own randomized_idempotency_test.cpp).

Comment by Tess Avitabile (Inactive) [ 28/Sep/20 ]

kevin.pulo, as Siyuan commented above, we recommend using a release build for testing, or using the filter argument with debug builds when debugging a particular unit test. We don't consider the performance of unit tests with debug builds to be a priority.

Comment by Siyuan Zhou [ 23/Sep/20 ]

The initial idea of merging unit tests to a large one is to speed the compile time. I'm curious to hear acm's opinion on this.

On a side note, kevin.pulo, I usually compile with debug off since debug build could be much slower in both unit tests and JS tests.

Comment by Judah Schvimer [ 23/Sep/20 ]

Speeding up the OplogFetcher based tests feels especially worthwhile since we're using them for tenant migrations (CC siyuan.zhou). Splitting the repl tests up into multiple binaries seems like a small lift and a worthwhile win.

Comment by Brooke Miller [ 22/Sep/20 ]

In triaging, we discussed that there may be an opportunity to split up db_repl_test. Robert mentioned that since it's the only test that runs in its job, there's nothing resmoke can do to cut down the runtime further. As a result, we're sending over to Replication for additional investigations or suggestions. Feel free to send it back to STM if there's a systematic improvement you think we can do to the testing framework.

Comment by Kevin Pulo [ 18/Sep/20 ]

Sorry, I forgot to mention. resmoke is already parallelising the execution of the unittest binaries, over 16 cores on Evergreen. The problem is that if one binary takes much longer than the others, then towards the end it's the only thing left running, and the other cores are mostly idle.

The overall unittests task took ~8 minutes, of which the first ~6 mins was compiling. The resmoke run took 128 secs, which means the 108 secs for db_repl_test is a pretty large fraction of that walltime (85%), ie. the overall resmoke runtime is mostly being dominated by the single db_repl_unit binary. If it was instead broken into, say, 3 pieces of ~30 secs each, that would help to reduce the overall resmoke (wallclock) runtime.

It could be possible for resmoke to do that sort of splitting by running the unittest binaries with -suite/-filter, but that seems a lot harder than what I was originally thinking, of just rearranging how the tests are linked into binaries (ie. just moving the long-running tests/suites out of db_repl_test and into their own binaries).

Again, this isn't so bad on Evergreen since the overall execution is only 2 minutes. But on my workstation (with a debug build) the full runtime of up to 9 minutes is completely dominated by db_repl_test with 98% of walltime (the next nearest being 4 binaries around 50-40% of walltime each):

secs test
 530 Total
 523 db_repl_test
 277 storage_wiredtiger_record_store_and_index_test
 236 db_unittests
 232 storage_wiredtiger_test
 220 storage_wiredtiger_prefixed_record_store_and_index_test
 191 db_storage_test
 114 db_catalog_test
  98 util_test
  97 mongo_embedded_test
  69 db_pipeline_test
...

This means even though the machine has 40 cores, the runtime doesn't decrease when I increase beyond about -j8 or 16. In this case, splitting up db_repl_test would definitely have a direct benefit to the runtime (eg. the above run could be roughly halved, down to as low as 4.5 mins (277 secs)). This would help a lot when needing to run the full unittest suite during iterative development.

Comment by April Schoffer [ 17/Sep/20 ]

Sending this over to STM to review whether there's parallelization that can be done in resmoke.py

Generated at Thu Feb 08 05:24:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.