[SERVER-36582] stepdown hook should log the timestamp of any exceptions Created: 10/Aug/18  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: 4.1.1
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Robert Guo (Inactive) Assignee: Backlog - Server Tooling and Methods (STM) (Inactive)
Resolution: Unresolved Votes: 0
Labels: tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Server Tooling & Methods
Backwards Compatibility: Fully Compatible
Sprint: TIG 2018-08-27
Participants:
Linked BF Score: 15

 Description   

The thread running the stepdown hook doesn't log anything when an exception occurs, there is only a stacktrace. This makes it hard to tell when the exception happened. We should log something explicitly on exceptions so there's a timestamp available, which can then be used to compare with the test logs to find what is happening at the time of the failure.



 Comments   
Comment by Steven Vannelli [ 10/May/22 ]

Moving this ticket to the Backlog and removing the "Backlog" fixVersion as per our latest policy for using fixVersions.

Comment by Max Hirschhorn [ 28/Aug/18 ]

I'm reopening this ticket because the _StepdownThread.logger is a child of the fixture logger, which doesn't include a timestamp in its log messages because one is expected to be present from the MongoDB server. yves.duhem, feel free to file a more generic SERVER ticket about having the loggers used by resmoke.py rather than the MongoDB processes have separate prefixes and also include timestamps in their messages. I just want to track that the goal of this ticket has yet to actually be achieved so we don't try and backport it right now.

[ContinuousStepdown:job4] Stepdown Thread threw exception
Traceback (most recent call last):
  File "/data/mci/320ed3830c9bfc5cfe1cd94a28960edb/src/buildscripts/resmokelib/testing/hooks/stepdown.py", line 188, in run
    self._step_down_all()
  File "/data/mci/320ed3830c9bfc5cfe1cd94a28960edb/src/buildscripts/resmokelib/testing/hooks/stepdown.py", line 257, in _step_down_all
    self._step_down(rs_fixture)
  File "/data/mci/320ed3830c9bfc5cfe1cd94a28960edb/src/buildscripts/resmokelib/testing/hooks/stepdown.py", line 355, in _step_down
    primary.await_ready()
  File "/data/mci/320ed3830c9bfc5cfe1cd94a28960edb/src/buildscripts/resmokelib/testing/fixtures/standalone.py", line 102, in await_ready
    self.port, MongoDFixture.AWAIT_READY_TIMEOUT_SECS))
ServerFailure: Failed to connect to mongod on port 21001 after 300 seconds

https://logkeeper.mongodb.org/lobster/build/615bbff2964c989bd71e2670ae8d7f0d/test/5b7c2939c2ab6832aa05449f#bookmarks=0%2C20381%2C20382%2C20383%2C20384%2C20385%2C20386%2C20387%2C20388%2C20389%2C20390%2C20391%2C20577

Comment by Githook User [ 20/Aug/18 ]

Author:

{'name': 'Robert Guo', 'email': 'robert.guo@10gen.com', 'username': 'guoyr'}

Message: SERVER-36582 stepdown hook should log exceptions explicitly
Branch: master
https://github.com/mongodb/mongo/commit/2ba04aa6156baf973a1ced2c7ec013c590458fc8

Comment by Max Hirschhorn [ 10/Aug/18 ]

Max Hirschhorn The uncaught exception is logged much later, not at the time of the failure.

Got it. The Python interpreter is going to write that message to stderr as soon as the exception happens but doesn't call flush on sys.stderr so it isn't guaranteed to be read from the output pipe by Evergreen agent immediately.

[2018/07/24 02:21:27.970] ServerFailure: The stepdown thread is not running.
...
[2018/07/24 02:21:34.807] Exception in thread StepdownThread:

Comment by Max Hirschhorn [ 10/Aug/18 ]

The Evergreen agent will fill in a timestamp of when the stepdown thread had an uncaught exception and logged it to resmoke.py's stderr. robert.guo, what else are you looking to have happen?

Generated at Thu Feb 08 04:43:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.