-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Testing Infrastructure
-
Fully Compatible
-
v4.4
-
STM 2020-03-23, STM 2020-04-20, STM 2020-05-04
-
1
resmoke.py ordinarily checks that a test didn't cause the server to crash by calling self.fixture.is_running() after the test finishes. However, due to the stepdown thread and the job thread only being synchronized by calling ContinuousStepdown.after_test(), it isn't safe to check whether the fixture is still running immediately after the test finishes.
# Don't check fixture.is_running() when using the ContinuousStepdown hook, which kills # and restarts the primary. Even if the fixture is still running as expected, there is a # race where fixture.is_running() could fail if called after the primary was killed but # before it was restarted. self._check_if_fixture_running = not any( isinstance(hook, stepdown.ContinuousStepdown) for hook in self.hooks)
Skipping this check causes resmoke.py to continue to run the other data consistency checks, even when a process in the MongoDB cluster has crashed. While misleading for Server engineers in terms of causing them to click on the "wrong" link in Evergreen for the task failure, it also have a severe negative impact on our automated log extraction tool by preventing it from finding relevant information. We should ensure process crashes in test suites using the ContinuousStepdown hook prevent other tests and hooks from running. I suspect having _StepdownThread.pause() check that fixture is still running as the last thing it does would accomplish this.