Shard Django test suite across parallel CI jobs to cut run time from 60–90 min to ~10 min

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Unknown
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • None
    • Python Drivers
    • None
    • None
    • None
    • None
    • None
    • None

      Background

      Django's test suite contains ~146 test apps. Running them sequentially (as django-mongodb-backend currently does) takes 60–90 minutes on CI. The django-zodb-backend project has demonstrated that splitting the app list across a GitHub Actions matrix of parallel shards cuts wall-clock time to ~8–12 minutes with no loss of coverage.

      Reference implementation: https://github.com/django-zodb/django/blob/58d04cdaddf024f626daeadce98482d7ab9a161d/.github/workflows/rebase-upstream.yml and .github/workflows/runtests.py

      Proposal

      Adopt the same matrix-sharding pattern used in django-zodb-backend:

      1. Wrapper script (runtests_.py or similar) that:
        • Holds the full ordered list of Django test apps.
        • Reads two env vars — DJANGO_TEST_PART (1-based shard index) and DJANGO_TEST_PARTS (total shards).
        • Selects its subset via round-robin modulo (i % part_count == part_index).
        • Passes all apps in the shard to a single runtests.py invocation (one Django startup per shard, not one per app).
      2. GitHub Actions matrix that fans out across N shards (e.g. 8):
           strategy:      fail-fast: false
             matrix:        part: [1, 2, 3, 4, 5, 6, 7, 8]
           
      1. Summary job (tests-complete) that acts as the required status check, passing only when all shards pass.

      Key implementation notes

      • Do not use --parallel if the storage backend is not fork-safe (e.g. ZODB's MappingStorage breaks across os.fork()). For backends that are fork-safe this can be an additional speed-up.
      • Round-robin distribution (rather than contiguous slicing) keeps shard run times balanced since slow and fast apps are interleaved throughout the list.
      • The shard count (8) is a tunable parameter; start with 4–8 depending on the number of test apps and typical app run time.
      • A fail-fast: false policy lets all shards complete so the full failure picture is visible even when one shard fails.

      Expected outcome

      CI run time 60–90 min ~8–12 min
      Parallelism Sequential 8 parallel runners
      Coverage All apps All apps (identical)

      References

      • django-zodb-backend workflow: .github/workflows/tests.yml
      • django-zodb-backend runner: .github/workflows/runtests.py

            Assignee:
            Unassigned
            Reporter:
            Alex Clark
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: