Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-40702

resmoke.py should wait for subprocesses it spawned to exit on KeyboardInterrupt

    • Fully Compatible
    • v4.2
    • STM 2019-07-15
    • 15
    • 1

      resmoke.py doesn't wait for the job threads running tests to exit when they are interrupted by the user. It instead relies on the SIGINT being received by all the processes in the process group to exit on their own quickly. While this may reduce the likelihood a user would interrupt resmoke.py multiple times due to it taking longer to exit, it also means that processes spawned by resmoke.py may outlive the resmoke.py Python process. This behavior has caused failures in the backup_restore*.js tests which spawns its own resmoke.py subprocess in order to run FSM workloads against a ReplSetTest instance.

      We should call thr.join() even after a KeyboardInterrupt exception occurs. However, it would be convenient for users if we also logged a message (say after 2 seconds of waiting for the thread) that they can use ctrl-\ to send a SIGQUIT to all of the processes to get them to exit on Linux or ctrl-c again to get them to exit on Windows as the Job object has JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE set. Sending a SIGQUIT is an easy way to ensure resmoke.py exits even if the mongod process is hung.

      def _run_tests(self, test_queue, setup_flag, teardown_flag):
          """Start a thread for each Job instance and block until all of the tests are run.
          Returns a (combined report, user interrupted) pair, where the
          report contains the status and timing information of tests run
          by all of the threads.
          """
      
          threads = []
          interrupt_flag = threading.Event()
          user_interrupted = False
          try:
              # Run each Job instance in its own thread.
              for job in self._jobs:
                  thr = threading.Thread(
                      target=job, args=(test_queue, interrupt_flag), kwargs=dict(
                          setup_flag=setup_flag, teardown_flag=teardown_flag))
                  # Do not wait for tests to finish executing if interrupted by the user.
                  thr.daemon = True
                  thr.start()
                  threads.append(thr)
                  # SERVER-24729 Need to stagger when jobs start to reduce I/O load if there
                  # are many of them.  Both the 5 and the 10 are arbitrary.
                  # Currently only enabled on Evergreen.
                  if _config.STAGGER_JOBS and len(threads) >= 5:
                      time.sleep(10)
      
              joined = False
              while not joined:
                  # Need to pass a timeout to join() so that KeyboardInterrupt exceptions
                  # are propagated.
                  joined = test_queue.join(TestSuiteExecutor._TIMEOUT)
          except (KeyboardInterrupt, SystemExit):
              interrupt_flag.set()
              user_interrupted = True
          else:
              # Only wait for all the Job instances if not interrupted by the user.
              self.logger.debug("Waiting for threads to complete")
              for thr in threads:
                  thr.join()
              self.logger.debug("Threads are completed!")
      

            Assignee:
            robert.guo@mongodb.com Robert Guo (Inactive)
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: