-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Testing Infrastructure
-
None
-
Fully Compatible
-
v3.4
-
TIG 2017-05-08, TIG 2017-05-29
-
0
Currently, for each child process spawned by resmoke.py (e.g. a mongo shell running a test), resmoke.py starts a LoggerPipe thread to asynchronously read from the child process's output pipe. (One LoggerPipe thread is created for the stdout of the child process and another is created for the stderr.) When resmoke.py is invoked with --log=buildlogger, the BufferedHandler logging handler causes an AlarmClock thread to be started. The AlarmClock thread is responsible for periodically sending logs to logkeeper via a POST request either when 2000 lines of output have been produced or when 10 seconds have passed. This means for each test that runs, there are typically 3 threads created for that child process (we don't expect anything to ever be written to stderr).
To avoid slowing down test execution when running tests in Evergreen, joining the AlarmClock thread is deferred until later. The AlarmClock thread is joined either by the FlushThread or by Python's atexit handler in the logging package, so it is possible for a significant number of these AlarmClock threads to accumulate is sending POST requests to logkeeper is taking a while. yves.duhem has found through empirical testing that thread contention appers to be responsible for why running the replica_sets_jscore_passthrough_WT task takes longer on Solaris although it isn't understood why similar behavior is not observed in the replica_sets_jscore_passthrough task. Additionally, most of these AlarmClock threads should be idle since they are waiting for 10 seconds to pass before doing any work after the test finishes.
As a way to reduce the number of threads in resmoke.py, we can change the logging model to make use of Python's sched module so that all POST requests are scheduled on a single thread. There is some risk that by making it no longer possible for resmoke.py to send concurrent requests to logkeeper that resmoke.py can fall behind should getting a response from logkeeper take a long time (e.g. on the order of multiple seconds).
- causes
-
SERVER-40514 Race between BufferedHandler.close() and BufferedHandler.flush() leads to resmoke.py hanging with --log=buildlogger
- Closed