Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-51003

Some replication unittests take much longer to run than others



    • Backwards Compatibility:
      Fully Compatible
    • Sprint:
      Sharding 2020-10-19


      I noticed this when running the full unittest suite locally, but it should also benefit the unittests in the commit queue and patch builds. The db_repl_test binary, and to a lesser extent, util_test, take much longer to run than the rest. When running a debug build locally on 40 cores, I still normally spend several minutes at the end waiting for just db_repl_test.

      Using timings from a recent commit queue patch shows:

      secs binary
      108.61 db_repl_test
      48.16 util_test
      33.11 db_catalog_test
      31.68 db_unittests
      31.20 db_storage_test
      20.94 storage_ephemeral_for_test_test
      19.99 db_repl_cloners_test
      ... ...

      Looking into the suites shows just a few that run much longer than the others:

      millis binary and suite num tests
      29368 db_repl_test RandomizedIdempotencyTest 2
      19508 db_repl_test TenantOplogApplierTest 19
      15077 db_repl_test OplogFetcherTest 73
      11347 db_repl_test RSRollbackTest 62
      8746 util_test Future 71
      8277 util_test Future_Void 61
      7635 util_test Future_MoveOnly 60
      5012 util_test FailPointStress 1
      4648 db_repl_test IdempotencyTestTxns 20
      4600 db_repl_test IdempotencyTest 21
      4254 db_repl_test InitialSyncerTest 87
      3125 util_test InvariantTerminationTest 13
      2880 util_test Future_EdgeCases 10
      2418 util_test RegistryList 2
      2256 util_test SharedFuture 16
      2159 db_repl_test PrimaryOnlyServiceTest 13
      2072 db_repl_test OplogBufferCollectionTest 41
      2043 db_repl_test ReplicationRecoveryTest 57
      1577 db_repl_test RollbackImplTest 42
      1406 db_repl_test OplogApplierImplTest 35
      1011 db_repl_test TenantOplogBatcherTest 11
      1000 util_test BackgroundJobBasic 3

      Looking more closely at RandomizedIdempotencyTest shows that CheckUpdateSequencesAreIdempotent takes ~3 secs, and CheckUpdateSequencesAreIdempotentV2 takes the remaining ~26 secs:

      2020-09-16T12:52:35.435Z I  TEST     23063   [main] "Running","attr":{"suite":"RandomizedIdempotencyTest"}
      2020-09-16T12:52:35.435Z I  TEST     23059   [main] "Running","attr":{"test":"CheckUpdateSequencesAreIdempotent","rep":1,"reps":1}
      2020-09-16T12:52:38.925Z I  TEST     23059   [main] "Running","attr":{"test":"CheckUpdateSequencesAreIdempotentV2","rep":1,"reps":1}
      2020-09-16T12:53:04.803Z I  TEST     23060   [main] "Done running tests"

      So it might be good to split at least this test (CheckUpdateSequencesAreIdempotentV2) or suite (RandomizedIdempotencyTest), and maybe some of the others, into separate binaries, so that they can better parallelize onto multiple cores.

      (Sending this to SDP, since the ticket is primarily about improving overall (parallel) unittest runtime by avoiding individual long-running tests/suites/binaries. But it could of course be redirected to Replication to fix these particular ones.)




            0 Vote for this issue
            13 Start watching this issue