-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Testing Infrastructure
-
None
-
Fully Compatible
-
TIG 2018-1-29, TIG 2018-02-12, TIG 2018-02-26, TIG 2018-03-12
SERVER-26445 was a bug in resmoke.py that caused arbitrarily long delays between the time when a process would write to its output pipe and the time when resmoke.py would read the log message. This would cause problems for programs such as mongod that rely on writing logs messages to be fast in order to avoid holding LockManager locks and/or mutexes for long periods of time, thereby blocking other threads from doing useful work. In particular, this would manifest as the primary stepping down due to the secondary being unable to respond to heartbeat messages while the ReplicationCoordinatorImpl::_mutex is held.
Even after the changes from SERVER-26445 where committed, we've seen other cases of time drift between the timestamp that resmoke.py reports and the timestamp that mongod reports.
BF-4604, BF-4970, BF-6290 are other Evergreen failures where there was an apparent time drift between the timestamp that the MongoDB process reports and the timestamp that resmoke.py reports despite having the changes from
SERVER-26445.
Additionally, we've also see cases (namely those linked to MCI-2631) where resmoke.py itself gets blocked writing to its output pipe and becomes unable to consume log messages from its subprocesses. The GIL in CPython isn't held while doing I/O, so it is unexpected that the LoggerPipe threads wouldn't be able to continue consuming log messages.
- is related to
-
SERVER-26445 resmoke.py's LoggerPipe threads can stop consuming output from their pipes for long periods
- Closed