Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-46842

resmoke.py shouldn't run data consistency checks in stepdown suites if a process has crashed

    • Fully Compatible
    • v4.4
    • STM 2020-03-23, STM 2020-04-20, STM 2020-05-04
    • 1

      resmoke.py ordinarily checks that a test didn't cause the server to crash by calling self.fixture.is_running() after the test finishes. However, due to the stepdown thread and the job thread only being synchronized by calling ContinuousStepdown.after_test(), it isn't safe to check whether the fixture is still running immediately after the test finishes.

      # Don't check fixture.is_running() when using the ContinuousStepdown hook, which kills
      # and restarts the primary. Even if the fixture is still running as expected, there is a
      # race where fixture.is_running() could fail if called after the primary was killed but
      # before it was restarted.
      self._check_if_fixture_running = not any(
          isinstance(hook, stepdown.ContinuousStepdown) for hook in self.hooks)
      

      Skipping this check causes resmoke.py to continue to run the other data consistency checks, even when a process in the MongoDB cluster has crashed. While misleading for Server engineers in terms of causing them to click on the "wrong" link in Evergreen for the task failure, it also have a severe negative impact on our automated log extraction tool by preventing it from finding relevant information. We should ensure process crashes in test suites using the ContinuousStepdown hook prevent other tests and hooks from running. I suspect having _StepdownThread.pause() check that fixture is still running as the last thing it does would accomplish this.

            Assignee:
            mikhail.shchatko@mongodb.com Mikhail Shchatko
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: