Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 3.4.7, 3.5.8
Affects Version/s: None
Component/s: Testing Infrastructure
Labels:
None

Backwards Compatibility:
Fully Compatible
Backport Requested:

v3.4
Sprint:
TIG 2017-05-08, TIG 2017-05-29
Linked BF Score:
0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Currently, for each child process spawned by resmoke.py (e.g. a mongo shell running a test), resmoke.py starts a LoggerPipe thread to asynchronously read from the child process's output pipe. (One LoggerPipe thread is created for the stdout of the child process and another is created for the stderr.) When resmoke.py is invoked with --log=buildlogger, the BufferedHandler logging handler causes an AlarmClock thread to be started. The AlarmClock thread is responsible for periodically sending logs to logkeeper via a POST request either when 2000 lines of output have been produced or when 10 seconds have passed. This means for each test that runs, there are typically 3 threads created for that child process (we don't expect anything to ever be written to stderr).

To avoid slowing down test execution when running tests in Evergreen, joining the AlarmClock thread is deferred until later. The AlarmClock thread is joined either by the FlushThread or by Python's atexit handler in the logging package, so it is possible for a significant number of these AlarmClock threads to accumulate is sending POST requests to logkeeper is taking a while. yves.duhem has found through empirical testing that thread contention appers to be responsible for why running the replica_sets_jscore_passthrough_WT task takes longer on Solaris although it isn't understood why similar behavior is not observed in the replica_sets_jscore_passthrough task. Additionally, most of these AlarmClock threads should be idle since they are waiting for 10 seconds to pass before doing any work after the test finishes.

As a way to reduce the number of threads in resmoke.py, we can change the logging model to make use of Python's sched module so that all POST requests are scheduled on a single thread. There is some risk that by making it no longer possible for resmoke.py to send concurrent requests to logkeeper that resmoke.py can fall behind should getting a response from logkeeper take a long time (e.g. on the order of multiple seconds).

causes

SERVER-40514 Race between BufferedHandler.close() and BufferedHandler.flush() leads to resmoke.py hanging with --log=buildlogger

Closed

Assignee:: Max Hirschhorn
Reporter:: Max Hirschhorn
Participants:: Githook User, Max Hirschhorn
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Apr 26 2017 09:07:48 PM UTC
Updated:: Oct 30 2023 11:17:10 PM UTC
Resolved:: May 16 2017 04:11:37 PM UTC
Confidence Status Last Update:: 26/Apr/17 9:08 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates