Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-40514

Race between BufferedHandler.close() and BufferedHandler.flush() leads to resmoke.py hanging with --log=buildlogger

    • Fully Compatible
    • ALL
    • v4.0, v3.6, v3.4
    • Dev Tools 2019-04-08, Dev Tools 2019-04-22
    • 5

      The changes from a4d2c52 as part of SERVER-28991 made it so a dedicated "flush thread" is used to do the actually sending of logs to logkeeper. The implementation was rewritten from using a custom AlarmClock thread for each logging handler to instead use a sched.scheduler instance to manage when to flush logs to logkeeper after every 10 seconds of the test running.

      The change from AlarmClock.dismiss() to sched.scheduler.cancel() was flawed in that canceling the next flush event may fail if the BufferedHandler.flush() method is being concurrently called. This can be seen in how the BufferedHandler.close() method attempts to cancel self.__flush_event but ignores the return value.

      def close(self):
          """Flush the buffer and tidies up any resources used by this handler."""
      
          with self.__emit_lock:
              if self.__flush_event is not None:
                  flush.cancel(self.__flush_event)
      

      As mathew.robinson and I saw when debugging together, it is therefore possible for a concurrent BufferedHandler.flush() call to schedule another flush event after sched.scheduler.cancel() fails to cancel the current one which leads to BufferedHandler.flush() being called perpetually.

      def flush(self):
          """Ensure all logging output has been flushed."""
      
          self.__flush(close_called=False)
      
          with self.__emit_lock:
              if self.__flush_event is not None:
                  # We cancel 'self.__flush_event' in case flush() was called by someone other than
                  # the flush thread to avoid having multiple flush() events scheduled.
                  flush.cancel(self.__flush_event)
                  self.__flush_event = flush.flush_after(self, delay=self.interval_secs)
                  self.__flush_scheduled_by_emit = False
      

      A straightforward fix is to have BufferedHandler.close() record it was called and to have BufferedHandler.flush() not continuing to schedule itself again in that case.

            Assignee:
            mathew.robinson@mongodb.com Mathew Robinson (Inactive)
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: