[SERVER-28991] Eliminate having individual threads in resmoke.py for flushing log output to logkeeper Created: 26/Apr/17  Updated: 30/Oct/23  Resolved: 16/May/17

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.4.7, 3.5.8

Type: Improvement Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
causes SERVER-40514 Race between BufferedHandler.close() ... Closed
Related
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.4
Sprint: TIG 2017-05-08, TIG 2017-05-29
Participants:
Linked BF Score: 0

 Description   

Currently, for each child process spawned by resmoke.py (e.g. a mongo shell running a test), resmoke.py starts a LoggerPipe thread to asynchronously read from the child process's output pipe. (One LoggerPipe thread is created for the stdout of the child process and another is created for the stderr.) When resmoke.py is invoked with --log=buildlogger, the BufferedHandler logging handler causes an AlarmClock thread to be started. The AlarmClock thread is responsible for periodically sending logs to logkeeper via a POST request either when 2000 lines of output have been produced or when 10 seconds have passed. This means for each test that runs, there are typically 3 threads created for that child process (we don't expect anything to ever be written to stderr).

To avoid slowing down test execution when running tests in Evergreen, joining the AlarmClock thread is deferred until later. The AlarmClock thread is joined either by the FlushThread or by Python's atexit handler in the logging package, so it is possible for a significant number of these AlarmClock threads to accumulate is sending POST requests to logkeeper is taking a while. yves.duhem has found through empirical testing that thread contention appers to be responsible for why running the replica_sets_jscore_passthrough_WT task takes longer on Solaris although it isn't understood why similar behavior is not observed in the replica_sets_jscore_passthrough task. Additionally, most of these AlarmClock threads should be idle since they are waiting for 10 seconds to pass before doing any work after the test finishes.

As a way to reduce the number of threads in resmoke.py, we can change the logging model to make use of Python's sched module so that all POST requests are scheduled on a single thread. There is some risk that by making it no longer possible for resmoke.py to send concurrent requests to logkeeper that resmoke.py can fall behind should getting a response from logkeeper take a long time (e.g. on the order of multiple seconds).



 Comments   
Comment by Githook User [ 07/Jul/17 ]

Author:

{u'username': u'visemet', u'name': u'Max Hirschhorn', u'email': u'max.hirschhorn@mongodb.com'}

Message: SERVER-28991 Eliminate having an AlarmClock thread for each test.

Changes resmoke.py to use its FlushThread for managing all calls to
flush() buildlogger handlers.

(cherry picked from commit a4d2c52dbabc323f752776c7c86ac1d8a60187cb)
Branch: v3.4
https://github.com/mongodb/mongo/commit/4a7a14dac51d112bb235717882c0458668c1332b

Comment by Githook User [ 16/May/17 ]

Author:

{u'username': u'visemet', u'name': u'Max Hirschhorn', u'email': u'max.hirschhorn@mongodb.com'}

Message: SERVER-28991 Eliminate having an AlarmClock thread for each test.

Changes resmoke.py to use its FlushThread for managing all calls to
flush() buildlogger handlers.
Branch: master
https://github.com/mongodb/mongo/commit/a4d2c52dbabc323f752776c7c86ac1d8a60187cb

Generated at Thu Feb 08 04:19:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.