[SERVER-72860] Python exceptions in create_fixture_table() cause resmoke to incorrectly mark Evergreen tasks as setup failures Created: 14/Jan/23  Updated: 29/Oct/23  Resolved: 06/Feb/23

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: 5.0.0, 6.0.0, 6.2.0-rc6
Fix Version/s: 6.3.0-rc0

Type: Bug Priority: Critical - P2
Reporter: Max Hirschhorn Assignee: Tausif Rahman (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-64151 resmoke.py fails when running with --... Closed
is related to SERVER-55548 resmoke.py reports stale "Fixture sta... Closed
is related to SERVER-50085 Make it easier to correlate mongo pro... Closed
is related to SERVER-66045 Run an unbounded number of splits dur... Closed
Assigned Teams:
Server Development Platform
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

python buildscripts/resmoke.py run --suite=sharding_jscore_passthrough --log=buildlogger jstests/core/query/all/all.js

diff --git a/buildscripts/resmokelib/logging/buildlogger.py b/buildscripts/resmokelib/logging/buildlogger.py
index 1ff2689ea64..d1e89db4278 100644
--- a/buildscripts/resmokelib/logging/buildlogger.py
+++ b/buildscripts/resmokelib/logging/buildlogger.py
@@ -266,10 +266,7 @@ class BuildloggerServer(object):
     def __init__(self):
         """Initialize BuildloggerServer."""
         tmp_globals = {}
-        self.config = {}
-        exec(
-            compile(open(_BUILDLOGGER_CONFIG, "rb").read(), _BUILDLOGGER_CONFIG, 'exec'),
-            tmp_globals, self.config)
+        self.config = dict(username="u", password="p", builder="b", build_num="1")
 
         # Rename "slavename" to "username" if present.
         if "slavename" in self.config and "username" not in self.config:
diff --git a/buildscripts/resmokelib/logging/flush.py b/buildscripts/resmokelib/logging/flush.py
index 16335ef44ab..a2b32d5e9a2 100644
--- a/buildscripts/resmokelib/logging/flush.py
+++ b/buildscripts/resmokelib/logging/flush.py
@@ -35,7 +35,7 @@ def stop_thread():
     _FLUSH_THREAD.signal_shutdown()
     # Wait for 1min instead of _FLUSH_THREAD.await_shutdown() because we can
     # sometimes wait indefinitely for a response, causing a task timeout.
-    _FLUSH_THREAD.join(60)
+    _FLUSH_THREAD.join(5)
 
     success = not _FLUSH_THREAD.is_alive()
     return success
diff --git a/buildscripts/resmokelib/logging/handlers.py b/buildscripts/resmokelib/logging/handlers.py
index 29292a3bdef..5ff2d068762 100644
--- a/buildscripts/resmokelib/logging/handlers.py
+++ b/buildscripts/resmokelib/logging/handlers.py
@@ -192,6 +192,8 @@ class HTTPHandler(object):
         on the content type.
         """
 
+        return dict(id="fake_id")
+
         data = utils.default_if_none(data, [])
         data = json.dumps(data)
 
diff --git a/buildscripts/resmokelib/testing/fixtures/shardedcluster.py b/buildscripts/resmokelib/testing/fixtures/shardedcluster.py
index dddb01ca8d2..dbfdabf1018 100644
--- a/buildscripts/resmokelib/testing/fixtures/shardedcluster.py
+++ b/buildscripts/resmokelib/testing/fixtures/shardedcluster.py
@@ -267,6 +267,7 @@ class ShardedClusterFixture(interface.Fixture):
         output = []
         for shard in self.shards:
             output += shard.get_node_info()
+        raise AttributeError("Intentionally raised")
         for mongos in self.mongos:
             output += mongos.get_node_info()
         return output + self.configsvr.get_node_info()

Participants:
Linked BF Score: 159

 Description   

Some of the commits which were impacted by BF-27442 had a large number of setup failures (for example).

[2023/01/11 04:17:48.924] [executor:js_test:job0] 04:17:48.919Z The setup of ShardedClusterFixture (Job #0) failed.
[2023/01/11 04:17:48.932] [executor:js_test:job0] 04:17:48.928Z Encountered an error when tearing down the fixture.
[2023/01/11 04:17:48.932] Traceback (most recent call last):
[2023/01/11 04:17:48.932]   File "/data/mci/4d10ab542ccbbb72ec0cfdae37347ba8/src/buildscripts/resmokelib/testing/job.py", line 95, in __call__
[2023/01/11 04:17:48.932]     teardown_succeeded = self.manager.teardown_fixture(self.logger)
[2023/01/11 04:17:48.932]   File "/data/mci/4d10ab542ccbbb72ec0cfdae37347ba8/src/buildscripts/resmokelib/testing/job.py", line 384, in teardown_fixture
[2023/01/11 04:17:48.932]     self.report.logging_prefix = create_fixture_table(self.fixture)
[2023/01/11 04:17:48.932]   File "/data/mci/4d10ab542ccbbb72ec0cfdae37347ba8/src/buildscripts/resmokelib/testing/fixtures/interface.py", line 360, in create_fixture_table
[2023/01/11 04:17:48.932]     info: List[NodeInfo] = fixture.get_node_info()
[2023/01/11 04:17:48.932]   File "/data/mci/4d10ab542ccbbb72ec0cfdae37347ba8/src/buildscripts/resmokelib/testing/fixtures/shardedcluster.py", line 271, in get_node_info
[2023/01/11 04:17:48.932]     output += mongos.get_node_info()
[2023/01/11 04:17:48.932]   File "/data/mci/4d10ab542ccbbb72ec0cfdae37347ba8/src/buildscripts/resmokelib/testing/fixtures/shardedcluster.py", line 573, in get_node_info
[2023/01/11 04:17:48.932]     port=self.port, pid=self.mongos.pid)
[2023/01/11 04:17:48.932] AttributeError: 'NoneType' object has no attribute 'pid'
...
[2023/01/11 04:19:00.678] [resmoke] 04:19:00.678Z Failed to flush all logs within a reasonable amount of time, treating logs as incomplete
[2023/01/11 04:19:00.678] [resmoke] 04:19:00.678Z Exiting with code 75 rather than requested code 2 because we failed to flush all log output to logkeeper.

https://parsley.mongodb.com/evergreen/mongodb_mongo_master_enterprise_rhel_80_64_bit_dynamic_all_feature_flags_required_sharding_jscore_passthrough_3a842713b25c2945fe1884abd8e60203f37f6258_23_01_11_03_08_29/0/task?bookmarks=0,1522,1597,1598&selectedLine=1522

Setup failures are intentionally ignored by the Build Barons so this can lead to delays in the timeliness of identifying true failures. (Setup failures are ignored because Logkeeper instability has been generally accepted and accommodated within the testing infrastructure, see SERVER-35472. The concept of setup failures may be worth revisiting now that Logkeeper has moved to S3 but I'm considering that outside the scope of this issue here.)

It looks like the changes to standalone.py in 3805148 as part of SERVER-66045 made it so get_node_info() wouldn't raise an exception when the fixture setup had failed for mongod. However there is an equivalent case for when the fixture setup had failed for mongos and is why the setup failures observed here all happen with the ShardedClusterFixture being used.

Note: The uncaught exception at fixture teardown also causes resmoke to leak processes upon exit. It may we worthwhile to revisit whether the calls to create_fixture_table() in job.py should have their own try/except block too.



 Comments   
Comment by Githook User [ 06/Feb/23 ]

Author:

{'name': 'Tausif Rahman', 'email': 'tausif.rahman@mongodb.com', 'username': 'trahman1318'}

Message: SERVER-72860 Python exceptions in create_fixture_table() cause resmoke to incorrectly mark Evergreen tasks as setup failures
Branch: master
https://github.com/mongodb/mongo/commit/1a9b7b1566e651d2694146ca170816d71cc7f44f

Comment by Tausif Rahman (Inactive) [ 01/Feb/23 ]

The challenging part of this is where the infrastructure failure was coming from and it has to do with log flushing. That is the main fix here, but there is also a drive-by.

I took a deeper look into how we do logging in resmoke & here are my findings:

  • In addition to the normal console logging using the built-in `logging` library, resmoke allows setting up logging handlers to send logs to other services (ie: evergreen/s3/buildlogger)
  • This is how things run in CI so that if machines crashes or other issues occur, the logs are not lost.
  • There are two types of logging handlers: `Fixture` & `Test`. The handlers are set up so that they will continually flush logs every [X] seconds – unless explicitly closed (check "recursive flush" and "explicit close").
  • `Test` handlers get set up and closed via `TestReport`, which is passed to the execution of each test (check _execute_test). This reliably cleans up any `Test` logging handlers that were setup – regardless of tests pass/fail.
  • When a test suite is executed, `Job`s are created to independently run tests. Each `Job` has its own `Fixture` to run tests against. `Fixture` handlers get set up when a `Job` instance's fixture is created (specifically here).
  • These `Fixture` handlers are not cleaned up until the `Fixture`'s teardown method is called.
  • The exception in `teardown_fixture` from create_fixture_table prevents the actual teardown method from being called, which keeps the "recursive flush" on the scheduler indefinitely.
  • This never-ending queue cause the `FlushThread`'s `.join` to fail here which causes the infra failure.

Proposed Fix: Add a `finally` block during fixture teardown, so that there are never any rogue logging handlers.

There is also a drive by which mimics the change from here so that this failure does not happen on sharded clusters anymore.

I have https://github.com/10gen/mongo/pull/10226 ready for review to address this.

Generated at Thu Feb 08 06:23:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.