[SERVER-26445] resmoke.py's LoggerPipe threads can stop consuming output from their pipes for long periods Created: 04/Oct/16  Updated: 02/Dec/22  Resolved: 05/Dec/16

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.2.12, 3.4.1, 3.5.1

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Max Hirschhorn
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-33504 Write performance tests for resmoke.p... Closed
Backwards Compatibility: Fully Compatible
Backport Completed:
Sprint: TIG 2016-12-12
Participants:
Linked BF Score: 0

 Description   

The LoggerPipe thread calls BufferedHandler.flush_with_lock() when the buffer's size exceeds its capacity. In the case of the buildlogger handlers (i.e. the _BaseBuildloggerHandler class), this is a long-running, blocking network operation that prevents the LoggerPipe thread from continuing to consume output from its pipe. While this is consistent with smoke.py's behavior, it is undesirable to have log message queue up in the pipe's buffer because the timestamps used for ordering log messages from different processes are then skewed.

Instead, we should offload the sending of all POST requests to logkeeper to the BufferedHandler's AlarmClock thread. This can be done by changing BufferedHandler.flush() to not require holding the lock during the network call and signaling the thread to wake up sooner than its scheduled sleep interval. Doing so will help ensure the LoggerPipe thread is available to read from its pipe and prevent it from becoming blocked when logging to its logging.Logger instance.

Steps to reproduce

Apply the following patch to simulate a delay when calling the flush_with_lock() method on the fixture's logger and run a test through resmoke.py.

diff --git a/buildscripts/resmokeconfig/loggers/console.yml b/buildscripts/resmokeconfig/loggers/console.yml
index b233de4..d4c4b6a 100644
--- a/buildscripts/resmokeconfig/loggers/console.yml
+++ b/buildscripts/resmokeconfig/loggers/console.yml
@@ -4,9 +4,9 @@ logging:
     handlers:
       - class: logging.StreamHandler
   fixture:
-    format: '[%(name)s] %(message)s'
+    format: '[%(name)s] %(asctime)s %(message)s'
     handlers:
-      - class: logging.StreamHandler
+      - class: delayed
   tests:
     format: '[%(name)s] %(asctime)s %(message)s'
     handlers:
diff --git a/buildscripts/resmokelib/logging/config.py b/buildscripts/resmokelib/logging/config.py
index c3960bb..9cf2f94 100644
--- a/buildscripts/resmokelib/logging/config.py
+++ b/buildscripts/resmokelib/logging/config.py
@@ -6,9 +6,11 @@ from __future__ import absolute_import
 
 import logging
 import sys
+import time
 
 from . import buildlogger
 from . import formatters
+from . import handlers
 from . import loggers
 
 
@@ -110,6 +112,26 @@ def apply_buildlogger_test_handler(logger,
     logger.addHandler(handler)
 
 
+class DelayedStreamHandler(handlers.BufferedHandler):
+
+    def __init__(self, handler):
+        super(DelayedStreamHandler, self).__init__(5, 1)
+        self.handler = handler
+
+    def setFormatter(self, formatter):
+        super(DelayedStreamHandler, self).setFormatter(formatter)
+        self.handler.setFormatter(formatter)
+
+    def flush_with_lock(self, close_called):
+        for record in self.buffer:
+            self.handler.handle(record)
+
+        # Simulate a long network delay.
+        time.sleep(2)
+
+        self.buffer = []
+
+
 def _configure_logger(logger, logger_info):
     """
     Adds the handlers specified by the configuration to 'logger'.
@@ -127,6 +149,8 @@ def _configure_logger(logger, logger_info):
             handler = logging.NullHandler()
         elif handler_class == "logging.StreamHandler":
             handler = logging.StreamHandler(sys.stdout)
+        elif handler_class == "delayed":
+            handler = DelayedStreamHandler(logging.StreamHandler(sys.stdout))
         elif handler_class == "buildlogger":
             continue  # Buildlogger handlers are applied when running tests.
         else:

Output

Observe that the timestamp assigned by resmoke.py to the message from the output pipe drifts from the timestamp recorded by the mongod.

[MongoDFixture:job0] 2016-10-04T01:10:13.799-0400 2016-10-04T01:10:13.798-0400 I CONTROL  [initandlisten] MongoDB starting : pid=69918 port=20000 dbpath=/data/db/job0/resmoke 64-bit host=gopher-blue
[MongoDFixture:job0] 2016-10-04T01:10:13.799-0400 2016-10-04T01:10:13.799-0400 I CONTROL  [initandlisten] db version v3.3.15
[MongoDFixture:job0] 2016-10-04T01:10:13.799-0400 2016-10-04T01:10:13.799-0400 I CONTROL  [initandlisten] git version: 520f5571d039b57cf9c319b49654909828971073
[MongoDFixture:job0] 2016-10-04T01:10:14.261-0400 Waiting to connect to mongod on port 20000.
[MongoDFixture:job0] 2016-10-04T01:10:15.805-0400 2016-10-04T01:10:13.799-0400 I CONTROL  [initandlisten] allocator: system
[MongoDFixture:job0] 2016-10-04T01:10:15.805-0400 2016-10-04T01:10:13.799-0400 I CONTROL  [initandlisten] modules: none
[MongoDFixture:job0] 2016-10-04T01:10:15.806-0400 2016-10-04T01:10:13.799-0400 I CONTROL  [initandlisten] build environment:
[MongoDFixture:job0] 2016-10-04T01:10:15.806-0400 2016-10-04T01:10:13.799-0400 I CONTROL  [initandlisten]     distarch: x86_64
[MongoDFixture:job0] 2016-10-04T01:10:15.915-0400 Successfully contacted the mongod on port 20000.
[MongoDFixture:job0] 2016-10-04T01:10:17.812-0400 2016-10-04T01:10:13.799-0400 I CONTROL  [initandlisten]     target_arch: x86_64
[MongoDFixture:job0] 2016-10-04T01:10:17.812-0400 2016-10-04T01:10:13.799-0400 I CONTROL  [initandlisten] options: { net: { port: 20000 }, nopreallocj: true, setParameter: { enableTestCommands: "1" }, storage: { dbPath: "/data/db/job0/resmoke" } }
[MongoDFixture:job0] 2016-10-04T01:10:17.815-0400 2016-10-04T01:10:13.800-0400 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=7680M,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),



 Comments   
Comment by Githook User [ 06/Dec/16 ]

Author:

{u'username': u'visemet', u'name': u'Max Hirschhorn', u'email': u'max.hirschhorn@mongodb.com'}

Message: SERVER-26445 Make emit() not wait for flush() to complete.

Changes the BufferedHandler to always delegate the actual flushing of
the logs (perhaps via a long-running, blocking network operation) to the
timer thread that's running in the background.

(cherry picked from commit 93989f1023869e8640dcabbe59de00275e61dfa4)
Branch: v3.2
https://github.com/mongodb/mongo/commit/9d33f57c662bcae12326c693748ed11d07beb6ea

Comment by Githook User [ 06/Dec/16 ]

Author:

{u'username': u'visemet', u'name': u'Max Hirschhorn', u'email': u'max.hirschhorn@mongodb.com'}

Message: SERVER-26445 Make emit() not wait for flush() to complete.

Changes the BufferedHandler to always delegate the actual flushing of
the logs (perhaps via a long-running, blocking network operation) to the
timer thread that's running in the background.

(cherry picked from commit 93989f1023869e8640dcabbe59de00275e61dfa4)
Branch: v3.4
https://github.com/mongodb/mongo/commit/3eba0c15cacad6e64bcb8bf89f68ef65ec0014b5

Comment by Githook User [ 05/Dec/16 ]

Author:

{u'username': u'visemet', u'name': u'Max Hirschhorn', u'email': u'max.hirschhorn@mongodb.com'}

Message: SERVER-26445 Make emit() not wait for flush() to complete.

Changes the BufferedHandler to always delegate the actual flushing of
the logs (perhaps via a long-running, blocking network operation) to the
timer thread that's running in the background.
Branch: master
https://github.com/mongodb/mongo/commit/93989f1023869e8640dcabbe59de00275e61dfa4

Generated at Thu Feb 08 04:12:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.