Context:
The following run of the format-abort-recovery-stress-test timed out: https://evergreen.mongodb.com/task/wiredtiger_ubuntu2004_stress_tests_format_abort_recovery_stress_test_1f7ecde9df1d952a2495dbca9020b9b5575539fb_21_09_14_23_32_01
By default, evergreen runs of format-abort-recovery-stress-test bounds the format.sh script run for only 30 mins (after which the test shuts down running format instances and returns success). The above failure scenario is a situation where the default evergreen test timeout (41mins) occurs before the format.sh scripts gets a chance to close out the run at the 30 min mark (in turn invoking the hang analyzer).
Some notes about the above runs failing config:
- The above failing test config has 22 worker threads
- Uses zlib (being the slowest compression method)
- Is a diagnostic build, performing a heavy amount of yielding during the test run
- Uses the stress-split configuration, which adds extra latency (additional sleeps & yield).
- The evergreen instance the test is running on is ubuntu20.04 small, having only 4 cores.
Further investigations in WT-8048 determined this particular test format run wasn't actually hanging, rather just taking longer to execute (~30mins for a single test format run).
Given the limited compute resources (4 cores), running 4 test format instances (one of which is using 22 threads), could suggest the instance may have just stalled, being limited on compute time. Hence missing the 30 min timeout mark to shutdown.
Definition of Done:
- Investigate and confirm the above scenario.
- Investigate and if required, amend the format-abort-recovery-stress-test run to avoid timing out on aforementioned scenarios i.e. update evergreen test timeout, maybe use a ubuntu20.04 large, configure scheduling priority such that format.sh can reliably exit at 30 mins etc