[SERVER-33504] Write performance tests for resmoke.py logging to console and to logkeeper Created: 10/Jan/18  Updated: 29/Oct/23  Resolved: 28/Feb/18

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.7.3

Type: Task Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-26445 resmoke.py's LoggerPipe threads can s... Closed
Backwards Compatibility: Fully Compatible
Sprint: TIG 2018-1-29, TIG 2018-02-12, TIG 2018-02-26, TIG 2018-03-12
Participants:

 Description   

SERVER-26445 was a bug in resmoke.py that caused arbitrarily long delays between the time when a process would write to its output pipe and the time when resmoke.py would read the log message. This would cause problems for programs such as mongod that rely on writing logs messages to be fast in order to avoid holding LockManager locks and/or mutexes for long periods of time, thereby blocking other threads from doing useful work. In particular, this would manifest as the primary stepping down due to the secondary being unable to respond to heartbeat messages while the ReplicationCoordinatorImpl::_mutex is held.

Even after the changes from SERVER-26445 where committed, we've seen other cases of time drift between the timestamp that resmoke.py reports and the timestamp that mongod reports.

BF-4604, BF-4970, BF-6290 are other Evergreen failures where there was an apparent time drift between the timestamp that the MongoDB process reports and the timestamp that resmoke.py reports despite having the changes from SERVER-26445.

https://jira.mongodb.org/browse/BF-6557?focusedCommentId=1674918&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1674918

Additionally, we've also see cases (namely those linked to MCI-2631) where resmoke.py itself gets blocked writing to its output pipe and becomes unable to consume log messages from its subprocesses. The GIL in CPython isn't held while doing I/O, so it is unexpected that the LoggerPipe threads wouldn't be able to continue consuming log messages.



 Comments   
Comment by Githook User [ 28/Feb/18 ]

Author:

{'email': 'max.hirschhorn@mongodb.com', 'name': 'Max Hirschhorn', 'username': 'visemet'}

Message: SERVER-33504 Add test suite for resmoke.py logging performance.

Runs a configurable number of "yes" processes for the specified amount
of time (in seconds).
Branch: master
https://github.com/mongodb/mongo/commit/e25af89eaa5fcfd3b5395b766eef4c7d93f44442

Comment by Max Hirschhorn [ 20/Feb/18 ]

The lower # log messages results for # executables = 5 indicates that the --log=buildlogger mechanism is causing yes processes to block more frequently than with the --log=console or --log=file mechanisms. It would be interesting to use a Python profile (e.g. cProfile, yappi, pyflame) to better understand where time is being spent by the LoggerPipe threads.

mechanism # executables message length duration (secs) # log messages time taken (secs)
console 1 100 10 41055 10.11
console 1 100 30 126220 30.12
console 1 100 60 247264 60.13
console 2 100 10 66572 10.26
console 2 100 30 196336 30.3
console 2 100 60 392772 60.27
console 5 100 10 75744 11.15
console 5 100 30 220225 31.57
console 5 100 60 424701 62.43
file 1 100 10 40977 10.11
file 1 100 30 126199 30.11
file 1 100 60 248686 60.12
file 2 100 10 68091 10.27
file 2 100 30 189994 30.29
file 2 100 60 380876 60.45
file 5 100 10 79739 11.66
file 5 100 30 217524 31.85
file 5 100 60 429161 61.41
buildlogger 1 100 10 40404 10.45
buildlogger 1 100 30 125959 30.24
buildlogger 1 100 60 246355 60.23
buildlogger 2 100 10 59056 10.54
buildlogger 2 100 30 182369 30.48
buildlogger 2 100 60 358960 60.51
buildlogger 5 100 10 25376 14.8
buildlogger 5 100 30 55766 34.93
buildlogger 5 100 60 104901 64.69

The following patterns were used to count the number of log messages under each of the different mechanisms. It is also worth noting that the build_num in the mci.buildlogger file were changed to ensure that a different build_id would be returned by logkeeper.

# For --log=console
grep -o -E "\[YesFixture:job0\] y{100}" temp.log | wc -l
 
# For --log=file
grep -o -E "\[YesFixture:job0\] y{100}" fixture.log | wc -l
 
# For --log=buildlogger
curl 'https://logkeeper.mongodb.org/build/{build_id}/all?raw=1' -o temp.log; grep -o -E "\[YesFixture:job0\] y{100}" temp.log | wc -l

Generated at Thu Feb 08 04:33:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.