Continue to find all test file exclusions for disagg_replica_sets.yml

XMLWordPrintableJSON

    • Type: Engineering Test
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Query Integration
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      In SERVER-118741 we created the evergreen suite disagg_replica_sets which runs the same tests as replica_sets but in a disagg cluster.

      As with most / all disagg suites, many existing tests cannot run successfully (they hang) because either the disagg architecture is fundamentally incompatible with them, or the disagg testing infrastructure does not support the feature yet.

      disagg_replica_sets is a fairly large suite (includes 680 jstests) with a pretty large number of exclusions needed. We already included a lot of exclusions in the original definition of the new suite, but more are needed to get the suite to pass successfully without timing out. The assignee of this ticket should take the existing definition of the suite and find all the remaining exclusions.

      Because the suite is large, this process has turned out to be very time consuming. To assist with this, I have developed (with the help of the Cursor agent) a way to find the exclusions in parallel. See the attached scripts and place them in the following directories:

      • buildscripts/modules/atlas/disagg_shard_restart_watcher.py
      • buildscripts/modules/atlas/monitor_disagg_replica_sets_shards.sh
      • buildscripts/modules/atlas/run_disagg_replica_sets_shards.sh

       

      Then you can start / restart up the automatic loop with the following command:

      ./buildscripts/modules/atlas/run_disagg_replica_sets_shards.sh --stop
      ./buildscripts/modules/atlas/run_disagg_replica_sets_shards.sh \
        --shards 4 --restart-dead-shards --restart-docker-cleanup
      ./buildscripts/modules/atlas/monitor_disagg_replica_sets_shards.sh

      After these commands start running, all the shards should be hung on specific tests (that will become exclusions) in 1-2 hours. You can then ask the Cursor agent something like "please analyze the state of all testing shards, and if hung on a test, analyze the reason, and if appropriate add it as a test file exclusion to disagg_replica_sets.yml with the reason".

      You should be able to catch groupings of tests like this in a loop and eventually catch all the exclusions in a few days or so.

       

      disagg_shard_restart_watcher.py

      monitor_disagg_replica_sets_shards.sh

      run_disagg_replica_sets_shards.sh

            Assignee:
            Vijendra Purohit
            Reporter:
            Joe Shalabi
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: